Examples of DiffTest applicationsThe examples in this section are only illustrative – DiffTest can be used in other ways too – but working through specific examples is a good way to learn how to use DiffTest. The examples are:>
Example 1: Product difference testing (a directional test)A Duotrio test has been run with a panel of 18 assessors and 13 of them have identified the correct member of the pair. Is the evidence strong enough to conclude that there is a perceptible difference between the two types of sample? Each assessor in a Duotrio test selects from a pair the sample that matches a sample identified as the control and which is presented first. This is described to DiffTest by entering Select 1 out of 2 in the Task panel. The Chance probability of each correct trial becomes 0.500. The results are entered by setting the Total trials to 18 and the Number correct to 13. The Number wrong is immediately shown as 5 and the Exact significance of these results becomes 0.048. Since 0.048 is less than 0.05, the result meets the usual standard for being declared significant.
Example 2. Comparison of consumer groups (a nondirectional test)Largescale consumer testing has shown that a certain product is preferred to its main competitor by 28.6% of all consumers in the USA. A much smaller trial in the UK resulted in 39 out of 100 consumers preferring it. How good is the evidence that UK consumers differ from USA consumers in this respect? Since the USA data came from a very large group, we can treat it as a population and ask if the UK sample of 100 consumers could plausibly have been drawn from a population with the same degree of preference for this product. Note If the USA data came from a group that was not much bigger than the UK group, this approach would not be appropriate. The analysis would require calculations that DiffTest does not provide. DiffTest compares the results from a single sample consisting of two types of outcome with their theoretical probabilities of occurrence. It does not provide for comparison of two samples. If the UK group reflect the preference pattern of US consumers, their probability of preferring the product is 0.286. When this is entered into the Chance p of each trial window, DiffTest's labelling changes in the way shown below. The Task description disappears and is replaced by a button permitting a return to the former display. Rather than correct and wrong, the outcomes are labelled just A or not A, and the particular respect in which the two types of outcome differ is not specified. Here, 'A' means, 'preferring the product'. Beware here that the Exact significance of these results window is not an appropriate test of the hypothesis that the UK sample come from a population with a probability of 0.286 of preferring the product. This is because the window shows directional significance. That is, it shows the probability of the mean of a sample of 100 departing from a population mean of 0.286 in only one direction, upwards – 39 or more of the consumer sample preferring the product. But of course the UK consumers might prefer the product less often than the USA consumers and that too would constitute a difference between them. Nor is it appropriate just to double the probability shown. With data consisting of proportions, the distribution of outcomes is not symmetrical, especially if the frequencies are small or the chance probability is very large or small. To deal with this, we use the Confidence bounds. In the Set the bounds window, the probability of each bound being exceeded has been set to 0.025. Consequently, the probability of one or the other being exceeded is twice that, 0.05. If the resulting bounds span an interval that does not include the chance probability (0.286), the result is significant at the 5% (0.05) level. Since the 95% confidence interval here spans 0.294 to 0.493, it does not include the USA proportion of preferences (0.286) so the UK results differ from that at at least the 5% level of significance. To see if they differ at a more exacting level of significance, we can adjust the bounds to have a smaller probability of being exceeded.
If we modify the Set the bounds window to 0.005, the confidence interval is 99% rather than 95%. If it does not include 0.286 (the USA proportion of preferences) then the UK data will differ from that at the 1% (0.01) level of significance.
However, we see that the 99% bounds are 0.267 and 0.524 so the USA proportion does lie between them. Because it is inside the 99% confidence interval, the USA proportion is not extreme enough to make the UK result significantly different from it at the 1% level (though we have already seen that it is significantly different at the 5% level).
Example 3. How much data for a similarity test?A sensory analyst hopes to show that a recent change in the source of supply for the product's raw materials has not had a noticeable effect on its sensory character. How much reassurance is desirable and how much is achievable? It is known that the new product is different – the question concerns only customer perceptions – and a moderate degree of detectability when the new product is compared with the old is acceptable. However, the difference should not be so noticeable that everyone detects it. Suppose company policy says that no more than 30% of experienced assessors should notice the difference even when old and new samples are directly compared. The analyst proposes to conduct triangle tests using old and new samples. How many triangles are needed? When the proportion of assessors detecting the difference is determined from data coming from only a few individuals, it is essential to recognise that the answer can only be an estimate of what is true for a wider population. And because it is an estimate, it is unlikely to be exactly correct and may possibly be quite far wrong. We can calculate how far wrong the estimate may be – small errors are more likely than large errors. One way to express it is by the confidence interval for the estimate. If we are willing to accept a wide interval, we can arrange to be pretty confident that it includes the true proportion of detectors – but of course a wide interval means that we are not very sure what the true proportion is. Note The confidence interval is used when we are considering the accuracy of an estimate. We therefore take account of the possibility of estimates being too high or too low, taking account of both types of error. For this reason, DiffTest labels the bounds as 95% if the probability of each type of error is 0.025, resulting in a 95% confidence interval. But in the present example, the analyst does not care if the true answer is lower than the estimate. That would not be an error with adverse consequences for the company so only the probability of exceeding the upper bound is relevant. In the present example, the analyst wants to be pretty confident that no more than 30% of people resembling those used in the triangle tests will detect the difference, thus the upper bound of the estimate should at most be 30% (a probability of 0.30) with an acceptable degree of confidence. What is 'acceptable'? There is no answer that suits everyone, but it is quite common to be satisfied with an answer that has no more than a 5% chance (a probability of 0.05) of generating data more extreme than the results actually obtained. This may be referred to as a 95% level of confidence. All these considerations can be entered into the DiffTest calculator in the following way. The choice of triangle test is entered in the Task panel as a choice of 1 item from 3. The Chance probability of each correct trial is immediately calculated by DiffTest to be 0.333 (1/3).
The required level of confidence is selected by entering a probability of 0.05 in the Set the bounds window. This is appropriate since the only error the analyst is worried about is the true proportion being greater than the estimate. That is, only the probability of exceeding the upper bound is of concern and the analyst wants that to be less than 0.05. DiffTest labels the confidence interval as 90% because it allows for either bound being exceeded.
Now the analyst can explore the inferences that would follow from various sets of imaginary data. Let's begin by seeing what would follow if 24 triangles are conducted and they happen to result in exactly the chance expectation of correct choices – that is, 8 correct out of 24.
After entering 24 into the Total trials window and 8 into the Number correct window, the Number wrong window adjusts automatically to 16. At the same time, all the other windows adjust to show the consequences of all the above settings.
The Best estimate (in the light of the data) of the probability of being correct on a single trial is 0.333. This reflects that fact that in these imaginary results, 8 trials out of 24 (that is 1/3 of them) were correct. The consequent estimated proportion of assessors detecting the difference is given by the Discrimination index and is 0.000 – in other words, the best estimate is that none of them detect the difference. The exact interpretation of the Upper Bound result is that there is a probability of 0.05 of having 8 or fewer correct trials out of 24 even if the probability of a correct trial is as high as 0.521, with the consequent proportion of 'detectors' being 0.282. This proportion meets the company's requirement of being below 30% – so such an outcome would give the degree of reassurance that was wanted. Note that the Upper Bound is labelled as a 90% bound since that relates to the confidence interval and the program does not know that here the analyst is here concerned only with the upper bound. The probability showing in the Set the bounds window reminds us that the probability of exceeding the upper bound is 0.05. The figure in the Amount of data required window is worrying. It tells us that if the analyst goes ahead and carries out 24 trials there is a probability of 0.406 that the actual outcome will be less satisfactory than this. That is, on 40.6% of occasions when 24 triangles are run, even when comparing identical samples, the upper bound will show more than 28.2% 'detectors' and will not meet the company's requirement. This means that the analyst cannot run 24 triangles and confidently expect a satisfactory outcome. If the analyst assumes a less favourable outcome, supposing 12 (rather than 8) out of the 24 trials to be correct, the Amount of data required result shown below would probably be considered satisfactory:
The probability of getting more than 12 right is now only 0.028 (less than 3%) if the difference is completely undetectable. So far so good – but for these results the upper bound for the proportion of 'detectors' is now 0.520 (more than 50%) and this does not meet the company's target. If the Number correct is set to 11 (rather than 12), the upper bound on the proportion of 'detectors' drops to no better than 0.464 and the probability of that being exceeded rises to 0.068. Clearly, something has to be changed for the analyst to achieve 95% confidence with an upper bound below 30% and a probability of obtaining such a result of 0.95 or more. Two further variables are available for manipulation, and either of them may be able to achieve the desired result. They are, the Task and the Total trials, which up to now have been assumed to be fixed. Total trialsBy trial and error, the number of trials and the number correct can be adjusted to seek a satisfactory set of values. Doubling the numbers in the preceding image to 48 trials with 24 correct brings the probability of the bound being exceeded down to 0.006 (probably smaller than necessary) but brings the upper bound down only to 0.440 – still not low enough. But by decreasing the hypothetical Number correct we can lower it further without the probability of getting a worse bound rising to unacceptable levels. For instance, with 21 correct out of 48, the Amount of data required window is 0.048, still meeting the objective of a 95% chance of achieving a result that is no worse, while the Upper bound drops to 0.349. These figures still do not quite meet the analyst's objectives, though they are much better than those considered so far. By juggling in this way with the Total trials and Number correct until the upper bound is below 0.300 and the probability of it being exceeded if that number of trials are run is below 0.050, we can quickly reach the following result.
These figures show that with 75 triangles, there is only a 3.5% chance of the upperbound estimate of the percentage of detectors turning out to be more than 29.2% if the difference is completely undetectable. However, the sensory analyst might be unhappy to discover that so many triangles are required, especially if each triangle is obtained from a different assessor. Task Instead of increasing the number of trials, the analyst might use a task with a lower probability of being correct by chance alone. One that is sometimes used in sensory analysis is the Twooutoffive test, where instead of three samples the assessor is given five samples, two of one kind and three of another and must sort them into their two groups. The probability of sorting the samples into the correct groups of two and three by chance alone (that is, if no sensory difference is perceptible) is only 0.1. This increases the power of the test.
With this task, the 0.047 in the bottom panel shows that there is more than a 95% chance of obtaining an upperbound discrimination index of under 30% (0.278) using only 27 trials, a big improvement on 75 and it is possible to do even better using a task such as 3 out of 7, where the probability of chance success drops further (to 0.029). Unfortunately, even the 2 out of 5 test is rarely used because analysts fear that it requires the assessors to deal with more samples than they can manage without confusion. Thus it is difficult to get a degree of reassurance that will be considered satisfactory. Either too many assessors are needed or the task is too demanding for the assessors. The apparent solution of testing each assessor repeatedly in order to increase the number of trials and treating the data as independent trials mixes real differences between people with random variation attributable to chance. This invalidates some of the assumptions that are usually made in this type of analysis. If even a low probability of noticing a difference is important (if the difference will be regarded as evidence of contamination, say) the problem is much worse. For instance, to bring the upper bound below 5% 'detectors' with a confidence of 5% and a probability of obtaining a worse result in practice below 5% requires more than 2100 triangles. It is unlikely that a sensory analyst will consider this a reasonable task, so some compromise in the requirements will have to be accepted. DiffTest provides a tool to evaluate tradeoffs among all the variables: the task, the amount of data, the degree of confidence desired, the maximum allowable proportion of detectors and the probability that if the envisaged amount of data is collected the resulting upper bound will be no worse than that.
