StatBasics
< Back DiffTest Page 4 of 8 Next >
Site highly rated by Schoolzone.co.uk

 

Confidence bounds

Suppose we are going to carry out a difference test with 12 assessors and count how many correct answers are given. If correct answers are likely, we expect to get a large number of correct answers – maybe 10, 11 or 12. If correct answers are unlikely, we expect few correct answers – perhaps 0, 1 or 2.

Animated graph of p(C)

For any probability of individual answers being correct – written p(C) for short – we can calculate the probability of 0, 1, 2, 3 ... up to 12 of the 12 answers being correct.  These probabilities are shown in this dynamic graph.

(If the graph remains static, click here .)

Now suppose you have carried out 12 trials and found that 7 of the answers were correct.  For some probabilities of an individual making a correct answer, this result is quite a likely one.  For others, it is a very unlikely outcome.  For instance, if the probability of giving a correct answer is 1/100 for every assessor, the probability of getting any correct answers from 12 trials is small – and the probability of getting 7 or more correct answers is very tiny indeed.  But if the probability of a correct answer on one trial is 1/2, 7 correct answers out of 12 is a fairly likely outcome.  Somewhere between, 1/100 and 1/2 there will be some probability, p(C), that is just small enough to make the probability of 7 or more correct answers exactly 0.025.  This probability is the Lower 95% bound for the estimated probability of giving correct answers.

 

Lower bound graph

If we have actually obtained 7 correct trials out of 12, the Lower 95% bound for the estimated probability of a correct answer is the probability that makes 7 or more correct out of 12 exactly 0.025.

The probability of 7 or more, written p(7+), is represented by the area coloured red in the graph.  It is 0.025 when the probability of being correct is 0.277 on every trial.

So, if we have actually obtained 7 correct answers out of 12, we have a result that is very unlikely if the probability of correct answers is 0.277 (and the result is even more surprising if it is anything less than that).

Thus, observing 7 correct answers makes us inclined to disbelieve that the probability of being correct is very low - not lower than 0.277, anyway.


 

But seven out of twelve is also an unlikely number of correct answers if their probability is very high.  For instance, if the probability of a correct answer is 0.9, the probability of getting only 7 or fewer correct answers is only about four in a thousand (0.004).

A similar argument to the one for the lower bound lets us put an Upper bound on the estimated probability of correct answers.  There will be some probability of being correct which is sufficiently high to make 7 or fewer correct answers unlikely enough to make us disbelieve that the true probability is any higher than that.  That is, there will be some value between 0.5 and 0.9 for the probability of giving correct answers that makes the chance of getting 7 or fewer exactly 0.025.

Upper bound graph

The probability of 7 or fewer, written p(7-), is represented by the area coloured red in the graph.

It is 0.025 when the probability of being correct is 0.848 on every trial.

So, if we have actually obtained 7 correct answers out of 12, we have a result that is very unlikely if the probability of correct answers is 0.848 (and the result is even less likely if it is any higher).

Confidence intervals

We now have a range of possible estimates of the probability of a correct answer that are not so extreme as to be unbelievable. This range runs from 0.277 to 0.848.

We reject values lower than 0.277 or higher than 0.848 because all values outside that range require us to believe that something very unlikely has happened.  But unlikely things can happen so we may have been wrong to reject them.  The two criterion probabilities of 0.025 add to 0.05 or 5%, so if we always reject any result outside these bounds we will do so correctly 95% of the time but will be wrong on the other 5% of occasions. For this reason, the range of believable values is usually referred to as the 95% Confidence Interval. (Unfortunately, there is no way to know which of our decisions were correct and which were mistaken!)

If we want to be more confident that the range we have calculated includes the true probability of correct answers being given, we can set a more stringent criterion, such as 0.005.  Doing so gives us a 99% confidence interval, which is wider than the 95% confidence interval, but by including additional possible values gives us more confidence that the correct answer lies within it.

Conversely, we could relax the criterion, say to 0.05, giving a 90% confidence interval, which will cover a narrower range of probabilities for correct answers but with less assurance that the true answer is actually in the interval.

A confidence interval has some similarities to the sort of significance test that might routinely be used to decide if the number of correct answers is great enough to conclude that something other than guessing is required to account for them. They have features in common but they are not exactly equivalent. 

A significance test begins from the Null Hypothesis that answers are given at random so the probabilty of a correct answer is 1/3 (for a triangle test). It calculates the chance of getting so many correct if that is so.  Confidence bounds are calculated by discovering what probability of correct answers would make the observed result just significant.  Their relationship is shown by the following graphs.

Graph of upper and lower limits

A significance test invokes one probability distribution. Its shape depends on the Null Hypothesis.  For a triangle test, the Null Hypothesis is that the three samples are equally likely to be selected and consequently the probability of selecting the right one is 1/3 or 0.333.

Observed frequencies that are far enough from the most likely frequency may be unlikely enough to be judged significant.

Graph of upper limit

A confidence interval usually invokes two probability distributions, which for difference test data are rarely of the same shape (since the shape depends on where the peak is).

These distributions are selected – by choosing p(C) appropriately – to make the observed frequency significant (but only just so).