Confidence Intervals (my)

Motivation

Let y denote a measurement of interest from some population with mean my and standard deviation sy. Consider the sampling distribution of ybar for n large. For the sake of this motivation, let us consider the interval [L, U] constructed about mybar thus:

L: mybar - 2s ybar
U: mybar + 2s ybar

Since we know that the sampling distribution of ybar is normal with mean mybar=my and standard deviation sybar=sy/sqrt(n) then L and U define boundaries between which 95.45% of the ybar's for all possible samples of size n will be. You may think of these boundaries dividing the ybar's into those that are relatively close to mybar (and so my) and those that are relatively distant from mybar (and so my).

For a particular sample we would like to know if the sample mean (ybar) is one of those that is close to my. Unfortunately we do not know my and so we cannot tell whether the sample mean is close to my or not. However we can address this issue indirectly by making use of our understanding of sampling distributions and proceeding thus:

Note: To keep the argument simple we will assume that sy is known:

  1. Construct an interval [L', U'] as we did above but about ybar:
    L': ybar - 2s ybar
    U': ybar + 2
    s ybar
  2. Claim that my is in this interval. Notice that this claim will be true if our ybar happens to be one of those that is close (as defined above) to my.
  3. State that we are 95.45% confident in our claim. We can safely do this since 95.45% of all possible samples will result in a ybar which will make our claim true.

This strategy captures the basic ideas of confidence interval theory. The only difference is that we will construct intervals that allow us to claim whatever confidence we desire. Also, since sy is not known, we will estimate it with our sample standard deviation sy.

 

Definition

An a% confidence interval for the population parameter my is an interval constructed from a sample mean (ybar) within which you expect my to be with a% confidence. There are two cases that need to be considered:

  1. n large:
    L': ybar - za(sybar)
    U': ybar + z
    a(sybar)
    where sybar=sy/sqrt(n) and za is the z value from the standard normal table so that the area between -z and z is a% .
  2. n small:
    L': ybar - tn-1;(100-a)/2(sybar)
    U': ybar + tn-1;(100-
    a)/2(sybar)
    where sybar=sy/sqrt(n-1) and tn-1;(100-a)/2 is the t value for the standard "t" table indexed by n-1 degrees of freedom and so that the area to the right of t is (100-a)/2% .

 

Terminology

  1. sybar=sy/sqrt(n) or sybar=sy/sqrt(n-1) is often referred to as the "error" in your estimate of my.
  2. z95(sybar) or tn-1;2.5(sybar) is often referred to as the "margin of error".
  3. The term "point estimate" refers to the value of the statistic used to estimate the corresponding parameter.

 

Problems:

Consider the population of disk drives manufactured from a particular production run. You are interested in estimating the mean seek time of this population and select a sample of n=100 drives for examination. You discover that the sample mean is 9.9ms with a standard deviation of 0.5ms.

  1. Construct and interpret a 90% confidence interval for the population mean (my).
  2. Construct and interpret a 99% confidence interval for the population mean (my).
  3. Let the sample size be 26 instead of 100. Construct and interpret a 90% confidence interval for the population mean (my).
  4. Let the sample size be 17 instead of 100. Construct and interpret a 99% confidence interval for the population mean (my).

Solutions:

  1. Since sy=0.5 and n is large then sybar=sy/sqrt(n)=0.5/10=0.05. Also, za=z90=1.65, hence the 90% confidence interval for my is [9.9-1.65(0.05), 9.9+1.65(0.05)]=[9.8175, 9.9825]. Hence, you are 90% confident that the mean seek time for the population is between 9.8175ms and 9.9825ms.
  2. In this case sybar=0.05 as above but za=z99=2.6, hence the 99% confidence interval for my is [9.9-2.6(0.05), 9.9+2.6(0.05)]=[9.77,10.03]. Hence, you are 99% confident that the mean seek time for the population is between 9.77ms and 10.03ms.
  3. Since n is small then sybar=sy/sqrt(n-1)=0.5/5=0.1. Also, tn-1;(100-a)/2=t25;5=1.71, hence the 90% confidence interval for my is [9.9-1.71(0.1), 9.9+1.71(0.1)]=[9.729, 10.071]. Hence, you are 90% confident that the mean seek time for the population is between 9.729ms and 10.071ms.
  4. Since n is small then sybar=sy/sqrt(n-1)=0.5/4=0.125. Also, tn-1;(100-a)/2=t16;0.5=2.92, hence the 99% confidence interval for my is [9.9-2.92(0.125), 9.9+2.92(0.125)]=[9.535, 10.265]. Hence, you are 99% confident that the mean seek time for the population is between 9.535ms and 10.265ms.

Note: Notice the tradeoff between confidence and interval length. Also, observe that for a particular interval length you must increase sample size to obtain a higher level of confidence.

 

Summary

We initially set out to answer the question:

"How good is ybar as a point estimate of my "

By "good" we mean how close is ybar to my. By considering the sampling distribution of ybar we have proposed the following:

  1. sybar as the error in our estimate of my.
  2. An a% confidence interval as an interval estimate for my.

Notice that our confidence interval provides a plus/minus amount (known as the level of accuracy) which expresses how close we think ybar is to my with a level of confidence a.