Alpha risk is the risk of mistakenly concluding that there is a difference when really there is none. Alpha risk is controlled by adopting a probability level for the discrepancy between the results and the null hypothesis beyond which the result will be considered to be a significant departure from what the null hypothesis predicts. Other things being equal, reducing alpha risk entails increasing beta risk. To reduce both alpha risk and beta risk simultaneously for a given procedure, we require larger amounts of data. The 'p value' quoted in usual tests of significance is the level of alpha risk. Alpha risk is also called 'the risk of Type1 error'.
Assessor. The term is a generic one referring to anyone who appraises samples by means of the senses.
The best estimate of the probability of success in a difference test is the proportion of trials that resulted in success. However, this value is only an estimate and there is uncertainty about its true value. A confidence interval around the best estimate gives a range of values within which the true answer is expected to lie.
Beta risk is the risk of failing to conclude that there is a difference when there really is one. Beta risk is related to the power of the test procedure and the size of difference that counts as a real one. Other things being equal, reducing beta risk entails increasing alpha risk. To reduce both alpha risk and beta risk simultaneously for a given procedure, we require larger amounts of data. The 'p value' quoted in usual tests of significance is the level of alpha risk. Beta risk is not as widely quoted as alpha risk but is an essential ingredient in some procedures for similarity testing and tables of beta risk are available.
Chance probability is the probability of a trial being correct if only chance determines its outcome. This would be the case, for instance, if no difference between samples is perceived by the assessor, who therefore must choose a response at random.
Confidence bounds are upper
and lower limits to the values of any uncertain estimate. The range
of values between the bounds is called the confidence
interval and we feel reasonably confident that the true
value of the thing estimated is within that interval. How
confident we can be in that belief depends on how the bounds were calculated.
Confidence interval. This is often described as the range of values within which we can be 95% sure that the true answer lies when we estimate something with some uncertainty, though this description is only approximate. (A precise definition can be derived from the definition of the confidence bounds. See also examples here of the use of confidence bounds in DiffTest.) Although 95% is the most usual value, for some purposes we might be interested in some other confidence interval, most often 90% or 99%, if the degree of confidence we want is less or more, respectively. To achieve greater confidence that the answer lies within the calculated interval, the interval naturally has to be wider, with bounds further apart. DiffTest allows you to choose degrees of confidence quite flexibly by changing the probability in the Set the bounds window. It is possible to set the probability of the upper bound being exceeded to any value from 0.005 to 0.25. However, only a few of the possible values for confidence bounds are in widespread use. The most usual values are:
Detectors. When the results of a sensory difference test indicate that assessors are not choosing at random it is often convenient to speak as though some (the detectors) are detecting the difference with certainty while others are choosing at random. This picture of what occurs should not be taken seriously  it is much more likely that all or most are choosing with abovechance success  but it can be a convenient way to explain the results to nonspecialists. The proportion of ‘detectors’ is numerically equivalent to the Discrimination index.
Difference test. In a sensory difference test, assessors are asked to perform a task such as matching one item to one of a selection of other items. The task can be performed with an abovechance probability of success only if at least some assessors detect a difference between the items. The main outcome of a sensory difference test is the number of trials whose outcome was successful.
Directional test. If only one particular direction of result differing from the chance probability of success will count as evidence of a sensory difference, the appropriate significance test is said be directional. In some contexts it may be referred to as a onetailed test.
Discrimination index, D
is supposed to indicate the proportion of trials
that are correct for some reason other than chance, though this
concept should not be taken any more seriously than that of detectors.
Both are just metaphors to simplify the description of results but
can be useful in that capacity.
Forced choice. In a sensory difference test, it is good practice to require the assessors to choose even when they are uncertain. If they are allowed to respond that they don’t know which sample to choose, or to respond that there are no differences among the samples, the personality variable of willingness to give an opinion becomes confused with the detectability of the sensory attribute being studied. Also, there is ambiguity about how the ‘don’t know’ responses should be used in the analysis. In a forcedchoice procedure, assessors are instructed not to opt out of choosing and to guess at random if necessary.
The null hypothesis is
the supposition that nothing but chance has influenced the data.
In the case of a difference
test, it is that there is no detectable difference between the
samples so the assessors have
had to make their choices at random.
Statistical power is the ability of a procedure to reject the null hypothesis when it is mistaken. The power of a difference test depends on the nature of the task and the amount of data obtained. Power is greater if the probability of obtaining a correct response on each trial by chance alone is smaller or if the number of trials is greater.
Statistical significance. An outcome is said to be significant if the probability of obtaining the observed result (or results that are even more extreme) is sufficiently small if nothing but chance influences the outcome (that is, if the null hypothesis is true). Usually, a probability of 0.05 (1 in 20) is considered small enough but other probabilities such as 0.01 (1 in 100) or 0.1 (1 in 10) are sometimes used. If the outcome is agreed to be significant, the conclusion is drawn that something other than chance influenced the results. That is, the null hypothesis is rejected.
Task is a term used in DiffTest to refer to the requirements that the difference test makes of an assessor. For instance, in a triangle test, the ‘task’ is to select the one odd item from a set of three samples.
A trial is a single occasion on
which a set of samples is presented to an assessor.
In some procedures, several samples may be presented simultaneously and
the trial consists of their presentation and the assessor’s
response or responses. The trial is scored as successful if all
responses are correct and unsuccessful if any of the responses
are wrong. For instance, in the ‘2 out of 5 test’
three samples of one kind (A) and two of another kind (b) are
presented and the assessor is instructed to separate them into groups
of two and three with all samples in each group being of the same kind.
In a triangle test, three samples are presented
to each assessor.
Two samples are identical and one is different in some way.
The task for the assessor is to indicate which of
the three is the odd one. If the difference is undetectable to
the assessor, the probability of making the correct choice is one in
three (0.333).
