Introduction/Review II
Following is the completion of the review started last class. Remember that this is material that you should have covered in a first class in data analysis or statistics.
Readings
Inferences about m:
Last week I indicated that the objective of Data Analysis is to make inferences about a population based on information gleaned from a sample. These inferential problems may be classified into two broad categories each of which has several dimensions:
Note also, that these problems are usually formulated in terms of one or more population parameters. We will intially address the simplest of these problems. That is, problems that have to do with m only.
Estimation:
This section addresses the Estimation problem. There are three dimensions to this problem:
Sample Selection:
For valid statistical inference we must select items for our sample in such a way that we do not introduce selection bias. That is, the items must be selected in a random manner. Such samples are referred to as random samples.
Definition: A random sample is a sample selected in such a way that every item in the population has the same chance of being selected for inclusion in the sample.
There are several ways in which a random sample may be selected (see page 21 of Ott for an overview). In any case, all samples discussed in this class should be taken to be random samples. We use random samples to ensure that the sample is representative of the population and so our sample statistics will be good estimators of the corresponding population parameters.
Confidence Intervals:
Essentially, the question of accuracy has to do with how can we compute a quantity that will allow us to state how close a particular statistic is to the corresponding unknown parameter. To answer this question we first need to understand the behavior of sample statistics. That is we need to discuss Sampling Distributions.
Sampling Distributions –- (ybar)
Let n denote sample size and let y denote a measurement of interest from some population with mean my and standard deviation sy.
Theorem (sy known)
Let n be large (i.e. n >= 30). Consider all
possible samples of size n that may be selected from the population.
Compute the sample mean (i.e.ybar) in each case. The distribution of the
ybar’s will be
Normal with
mean mybar= my
and standard deviation
s ybar=
s y/sqrt(n).
Note: If n is small, then as long as the distribution of y is
mound shaped, we can
say that the ybars are
approximately Normal.
Problems
Solution
: From the Theorem mybar= my=10ms and s ybar= s y/sqrt(n)=0.1/10=0.01. Hence z=(9.98-10)/0.01=-2 and so the desired proportion is 2.28%Solution
: From the Theorem mybar= my=3 and sybar= s y/sqrt(n)=1/10=0.1. Hence z=(3.4-3)/0.1=4 and so the desired proportion is 0.003167%
Theorem (sy unknown)
Since sy is typically not known, we will use the following variation of the Theorem above. Notice that for this Theorem we assume that sy has been determined from a sample selected from the population. We will consider both the case where n is large and the case where n is small.
Motivation - Confidence Intervals (my)
Let y denote a measurement of interest from some population with mean my and standard deviation sy. Consider the sampling distribution of ybar for n large. For the sake of this motivation, let us consider an interval [L, U] constructed about my (i.e. mybar) thus:
L: my - 2s ybar
U: my + 2s ybar
Since we know that the sampling distribution of ybar is normal with mean mybar=my and standard deviation sybar=sy/sqrt(n) then L and U define boundaries between which 95.45% of the ybar's for all possible samples of size n will be. You may think of these boundaries dividing the ybar's into those that are relatively close to mybar (and so my) and those that are relatively distant from mybar (and so my).
For a particular sample we would like to know if the sample mean (ybar) is one of those that is close to my. Unfortunately we do not know my and so we cannot tell whether the sample mean is close to my or not. However we can address this issue indirectly by making use of our understanding of sampling distributions and proceeding thus:
Note
: To keep the argument simple we will assume that sy is known:L': ybar - 2s ybar
U': ybar + 2s ybar
This strategy captures the basic ideas of confidence interval theory. The only difference is that we will construct intervals that allow us to claim whatever confidence we desire. Also, since sy is not known, we will estimate it with our sample standard deviation sy. However when n is small and sy is used to estimate sy then the distribution of the ybars is no longer approximately normal (see Theorem above). Instead, the ybars are Student t distributed with n - 1 df (see Ott, section 5.7).
Definition: An a % confidence interval for the population parameter my is an interval constructed from a sample mean (ybar) within which you expect my to be with a% confidence. There are two cases that need to be considered:
L': ybar - za(sybar)
U': ybar + za(sybar)
where sybar=sy/sqrt(n) and za is the z value from the standard normal table so that the area between -z and z is a% .
L': ybar - tn-1;(100-a)/2(sybar)
U': ybar + tn-1;(100-a)/2(sybar)
where sybar=sy/sqrt(n) and tn-1;(100-a)/2 is the t value from the standard "t" table in the row indexed by n-1 degrees of freedom and where the area to the right of t is (100-a)/2% .
Terminology
Problems:
Consider the population of disk drives manufactured from a particular production run. You are interested in estimating the mean seek time of this population and select a sample of n=100 drives for examination. You discover that the sample mean is 9.9ms with a standard deviation of 0.5ms.
Solutions:
Note
: Notice the tradeoff between confidence and interval length. Also, observe that for a particular interval length you must increase sample size to obtain a higher level of confidence.
Sample Size Determination
Motivation
Consider the situation where you would like to estimate a population parameter to some "level of accuracy" for a particular "level of confidence". You know that a random sample is necessary but you do not know the minimum sample size necessary to achieve the desired accuracy and confidence.
This is the sample size determination problem. We may utilize confidence interval theory to determine the sample size. We do so by treating the "level of accuracy" as the plus/minus amount in our confidence interval expression and solving for n.
Definition (my)
Let n denote the sample size, D the level of accuracy and a the level of confidence. An a% confidence interval for the population parameter my may be expressed thus:
where:
D
= za(sybar) = za(sy/sqrt(n))hence, squaring and rearranging terms:
Note that sy is unknown. However, we may address this in two ways:
Note
: You will always round up to the nearest integer. Rounding down does not make sense since you would merely be ensuring that your sample is fractionally too small to achieve the accuracy and confidence required.Problem:
Consider the disk drive problem. You would like to estimate the seek time of disk drives to an accuracy of 0.01ms with 95% confidence. What is the minimum sample size required to achieve this level of accuracy and confidence. Assume that a previous study indicates that seek times range between 9.8ms and 10.2ms.
Solution:
Since a is 95% then za=1.95. We may estimate sy from the range R. That is, R=10.2-9.8=0.4 hence sy=R/4=0.1. Also, D = 0.01 and so n=1.952(0.12/0.012)=380.25. Since you always round up, 381 drives are required to achieve a level of accuracy of 0.01 with 95% confidence.