Confidence boundsSuppose we are going to carry out a difference test with 12 assessors and count how many correct answers are given. If correct answers are likely, we expect to get a large number of correct answers – maybe 10, 11 or 12. If correct answers are unlikely, we expect few correct answers – perhaps 0, 1 or 2.
Now suppose you have carried out 12 trials and found that 7 of the answers were correct. For some probabilities of an individual making a correct answer, this result is quite a likely one. For others, it is a very unlikely outcome. For instance, if the probability of giving a correct answer is 1/100 for every assessor, the probability of getting any correct answers from 12 trials is small – and the probability of getting 7 or more correct answers is very tiny indeed. But if the probability of a correct answer on one trial is 1/2, 7 correct answers out of 12 is a fairly likely outcome. Somewhere between, 1/100 and 1/2 there will be some probability, p(C), that is just small enough to make the probability of 7 or more correct answers exactly 0.025. This probability is the Lower 95% bound for the estimated probability of giving correct answers.
So, if we have actually obtained 7 correct answers out of 12, we have a result that is very unlikely if the probability of correct answers is 0.277 (and the result is even more surprising if it is anything less than that). Thus, observing 7 correct answers makes us inclined to disbelieve that the probability of being correct is very low  not lower than 0.277, anyway.
But seven out of twelve is also an unlikely number of correct answers if their probability is very high. For instance, if the probability of a correct answer is 0.9, the probability of getting only 7 or fewer correct answers is only about four in a thousand (0.004). A similar argument to the one for the lower bound lets us put an Upper bound on the estimated probability of correct answers. There will be some probability of being correct which is sufficiently high to make 7 or fewer correct answers unlikely enough to make us disbelieve that the true probability is any higher than that. That is, there will be some value between 0.5 and 0.9 for the probability of giving correct answers that makes the chance of getting 7 or fewer exactly 0.025.
So, if we have actually obtained 7 correct answers out of 12, we have a result that is very unlikely if the probability of correct answers is 0.848 (and the result is even less likely if it is any higher). Confidence intervalsWe now have a range of possible estimates of the probability of a correct answer that are not so extreme as to be unbelievable. This range runs from 0.277 to 0.848. We reject values lower than 0.277 or higher than 0.848 because all values outside that range require us to believe that something very unlikely has happened. But unlikely things can happen so we may have been wrong to reject them. The two criterion probabilities of 0.025 add to 0.05 or 5%, so if we always reject any result outside these bounds we will do so correctly 95% of the time but will be wrong on the other 5% of occasions. For this reason, the range of believable values is usually referred to as the 95% Confidence Interval. (Unfortunately, there is no way to know which of our decisions were correct and which were mistaken!) If we want to be more confident that the range we have calculated includes the true probability of correct answers being given, we can set a more stringent criterion, such as 0.005. Doing so gives us a 99% confidence interval, which is wider than the 95% confidence interval, but by including additional possible values gives us more confidence that the correct answer lies within it. Conversely, we could relax the criterion, say to 0.05, giving a 90% confidence interval, which will cover a narrower range of probabilities for correct answers but with less assurance that the true answer is actually in the interval. A confidence interval has some similarities to the sort of significance test that might routinely be used to decide if the number of correct answers is great enough to conclude that something other than guessing is required to account for them. They have features in common but they are not exactly equivalent. A significance test begins from the Null Hypothesis that answers are given at random so the probabilty of a correct answer is 1/3 (for a triangle test). It calculates the chance of getting so many correct if that is so. Confidence bounds are calculated by discovering what probability of correct answers would make the observed result just significant. Their relationship is shown by the following graphs.
