StatBasics
 Difference testing Difference Testing Glossary How DiffTest works  
Site highly rated by Schoolzone.co.uk

This glossary mainly contains terms relating to sensory difference and similarity testing. A larger and more general glossary of basic statistical terms contained in the program Introduction to Statistics can be viewed here.

 

 

Alpha risk is the risk of mistakenly concluding that there is a difference when really there is none. Alpha risk is controlled by adopting a probability level for the discrepancy between the results and the null hypothesis beyond which the result will be considered to be a significant departure from what the null hypothesis predicts. Other things being equal, reducing alpha risk entails increasing beta risk.  To reduce both alpha risk and beta risk simultaneously for a given procedure, we require larger amounts of data. The 'p value' quoted in usual tests of significance is the level of alpha risk. Alpha risk is also called 'the risk of Type-1 error'.

∧ Top


 

Assessor. The term is a generic one referring to anyone who appraises samples by means of the senses.

∧ Top


 

The best estimate of the probability of success in a difference test is the proportion of trials that resulted in success. However, this value is only an estimate and there is uncertainty about its true value. A confidence interval around the best estimate gives a range of values within which the true answer is expected to lie.

∧ Top


 

Beta risk is the risk of failing to conclude that there is a difference when there really is one.  Beta risk is related to the power of the test procedure and the size of difference that counts as a real one. Other things being equal, reducing beta risk entails increasing alpha risk.  To reduce both alpha risk and beta risk simultaneously for a given procedure, we require larger amounts of data.  The 'p value' quoted in usual tests of significance is the level of alpha risk.  Beta risk is not as widely quoted as alpha risk but is an essential ingredient in some procedures for similarity testing and tables of beta risk are available.

∧ Top


 

Chance probability is the probability of a trial being correct if only chance determines its outcome. This would be the case, for instance, if no difference between samples is perceived by the assessor, who therefore must choose a response at random.

∧ Top


 

Confidence bounds are upper and lower limits to the values of any uncertain estimate. The range of values between the bounds is called the confidence interval and we feel reasonably confident that the true value of the thing estimated is within that interval. How confident we can be in that belief depends on how the bounds were calculated.
Once data have been obtained from a sensory difference test, we can estimate the probability of a single trial being correct. If this value exceeds the probability of being correct if nothing but chance has produced the result (that is, if the assessors were not systematically perceiving a difference) it constitutes evidence - though not necessarily persuasive evidence - that some difference was being perceived.
The best estimate of the probability of a single trial being correct is the same as the obtained proportion of correct results. However, there is uncertainty about this estimate and the true probability may be either higher or lower than the estimate. Confidence bounds give a range of values around the best estimate within which the true answer may reasonably be expected to lie. The width of this interval depends on the amount of data - the more data the narrower the bounds.
Strictly speaking, DiffTest calculates each bound as the value just different enough from the best estimate that if the bound were the true chance probability, the data would differ from it by an amount that was just significant. For this reason, the bounds are usually not equal distances from the best estimate. In this respect especially, the exact calculations of DiffTest can differ noticeably from bounds calculated using the normal distribution as an approximation.

∧ Top


 

Confidence interval. This is often described as the range of values within which we can be 95% sure that the true answer lies when we estimate something with some uncertainty, though this description is only approximate. (A precise definition can be derived from the definition of the confidence bounds. See also examples here of the use of confidence bounds in DiffTest.)

Although 95% is the most usual value, for some purposes we might be interested in some other confidence interval, most often 90% or 99%, if the degree of confidence we want is less or more, respectively. To achieve greater confidence that the answer lies within the calculated interval, the interval naturally has to be wider, with bounds further apart.

DiffTest allows you to choose degrees of confidence quite flexibly by changing the probability in the Set the bounds window. It is possible to set the probability of the upper bound being exceeded to any value from 0.005 to 0.25. However, only a few of the possible values for confidence bounds are in widespread use.   The most usual values are:

p  0.005 0.025  0.05
Confidence  99%  95%  90%

∧ Top


 

Detectors. When the results of a sensory difference test indicate that assessors are not choosing at random it is often convenient to speak as though some (the detectors) are detecting the difference with certainty while others are choosing at random. This picture of what occurs should not be taken seriously - it is much more likely that all or most are choosing with above-chance success - but it can be a convenient way to explain the results to non-specialists.  The proportion of ‘detectors’ is numerically equivalent to the Discrimination index.

∧ Top


 

Difference test. In a sensory difference test, assessors are asked to perform a task such as matching one item to one of a selection of other items. The task can be performed with an above-chance probability of success only if at least some assessors detect a difference between the items. The main outcome of a sensory difference test is the number of trials whose outcome was successful.

∧ Top


 

Directional test. If only one particular direction of result differing from the chance probability of success will count as evidence of a sensory difference, the appropriate significance test is said be directional. In some contexts it may be referred to as a one-tailed test.

∧ Top


 

Discrimination index, D  is supposed to indicate the proportion of trials that are correct for some reason other than chance, though this concept should not be taken any more seriously than that of detectors. Both are just metaphors to simplify the description of results but can be useful in that capacity.
The discrimination index, D, is calculated from C, the observed proportion of correct choices and p, the probability of correct choices by chance alone, thus:    Equation
A negative value of D is interpreted as zero since it occurs when the observed proportion of correct trials is below the chance level and hence gives no evidence that any discrimination is occurring. Values for D always lie between 0 (no discrimination) and 1 (perfect discrimination). The quantity D is also referred to as the proportion of Detectors or the Degree of discrimination.

∧ Top


 

Forced choice. In a sensory difference test, it is good practice to require the assessors to choose even when they are uncertain. If they are allowed to respond that they don’t know which sample to choose, or to respond that there are no differences among the samples, the personality variable of willingness to give an opinion becomes confused with the detectability of the sensory attribute being studied. Also, there is ambiguity about how the ‘don’t know’ responses should be used in the analysis. In a forced-choice procedure, assessors are instructed not to opt out of choosing and to guess at random if necessary.

∧ Top


 

The null hypothesis is the supposition that nothing but chance has influenced the data.  In the case of a difference test, it is that there is no detectable difference between the samples so the assessors have had to make their choices at random.
A test of statistical significance takes the null hypothesis as a starting point and calculates the probability of obtaining the observed data or any result more different than the observed data from the result that would be most likely if the null hypothesis were true. If the probability of obtaining such results is small enough (conventionally, less than 0.05 or 5%) the data are said to be significantly different from the predictions of the null hypothesis so the null hypothesis is rejected.
In the case of a difference test, rejecting the null hypothesis of no detectable difference leads us to conclude that there is a detectable difference between the samples.

∧ Top


 

Statistical power is the ability of a procedure to reject the null hypothesis when it is mistaken. The power of a difference test depends on the nature of the task and the amount of data obtained. Power is greater if the probability of obtaining a correct response on each trial by chance alone is smaller or if the number of trials is greater.

∧ Top


 

Statistical significance. An outcome is said to be significant if the probability of obtaining the observed result (or results that are even more extreme) is sufficiently small if nothing but chance influences the outcome (that is, if the null hypothesis is true). Usually, a probability of 0.05 (1 in 20) is considered small enough but other probabilities such as 0.01 (1 in 100) or 0.1 (1 in 10) are sometimes used. If the outcome is agreed to be significant, the conclusion is drawn that something other than chance influenced the results. That is, the null hypothesis is rejected.

∧ Top


 

Task is a term used in DiffTest to refer to the requirements that the difference test makes of an assessor. For instance, in a triangle test, the ‘task’ is to select the one odd item from a set of three samples.

∧ Top


 

A trial is a single occasion on which a set of samples is presented to an assessor. In some procedures, several samples may be presented simultaneously and the trial consists of their presentation and the assessor’s response or responses. The trial is scored as successful if all responses are correct and unsuccessful if any of the responses are wrong. For instance, in the ‘2 out of 5 test’ three samples of one kind (A) and two of another kind (b) are presented and the assessor is instructed to separate them into groups of two and three with all samples in each group being of the same kind.
The grouping (AB) (BAA) counts as one ‘wrong’ trial and so does the grouping (AA) (BBA), even though the group (AA) consists only of items of the same kind. Only (AAA) (BB) counts as a correct trial.

∧ Top


 

In a triangle test, three samples are presented to each assessor. Two samples are identical and one is different in some way. The task for the assessor is to indicate which of the three is the odd one. If the difference is undetectable to the assessor, the probability of making the correct choice is one in three (0.333).
If correct choices are made more than one time in three, the best estimate of the probability of making a correct choice is greater than 0.333 and this constitutes some degree of evidence that the difference between the two types of sample is detectable.
Other things being equal, this evidence is more convincing if there is a greater proportion of correct choices or if the number of trials is greater. If the proportion correct and the number of trials are sufficiently great, the result may be statistically significant.
The triangle test is a commonly-used sensory difference test. When DiffTest starts, the settings in its Task panel are those for a triangle test.

∧ Top