Introduction/Review II

Following is the completion of the review started last class. Remember that this is material that you should have covered in a first class in data analysis or statistics.

Readings

  1. Ott, Chapter 4, section 4.11 - 4.12
  2. Ott, Chapter 5, section 5.2 - 5.4, 5.7

Inferences about m:

Last week I indicated that the objective of Data Analysis is to make inferences about a population based on information gleaned from a sample. These inferential problems may be classified into two broad categories each of which has several dimensions:

  1. The Estimation problem.
  2. The Hypothesis Testing problem.

Note also, that these problems are usually formulated in terms of one or more population parameters. We will intially address the simplest of these problems. That is, problems that have to do with m only.

Estimation:

This section addresses the Estimation problem. There are three dimensions to this problem:

  1. Under what circumstances is it reasonable to use a statistic to estimate the corresponding parameter?
    i.e. The sample selection scheme question.
  2. How accurate is the estimate?
    i.e. The confidence interval question.
  3. What is the minimum sample size required to estimate a parameter to a predefined level of accuracy.
    i.e. The sample size determination question.

Sample Selection:

For valid statistical inference we must select items for our sample in such a way that we do not introduce selection bias. That is, the items must be selected in a random manner. Such samples are referred to as random samples.

Definition: A random sample is a sample selected in such a way that every item in the population has the same chance of being selected for inclusion in the sample.

There are several ways in which a random sample may be selected (see page 21 of Ott for an overview). In any case, all samples discussed in this class should be taken to be random samples. We use random samples to ensure that the sample is representative of the population and so our sample statistics will be good estimators of the corresponding population parameters.

Confidence Intervals:

Essentially, the question of accuracy has to do with how can we compute a quantity that will allow us to state how close a particular statistic is to the corresponding unknown parameter. To answer this question we first need to understand the behavior of sample statistics. That is we need to discuss Sampling Distributions.

Sampling Distributions –- (ybar)

Let n denote sample size and let y denote a measurement of interest from some population with mean my and standard deviation sy.

Theorem (sy known)

Let n be large (i.e. n >= 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s will be Normal with mean mybar= my and standard deviation s ybar= s y/sqrt(n).
Note: If n is small, then as long as the distribution of y is mound shaped, we can say that the ybars are approximately Normal.

Problems

  1. Consider the population of disk drives manufactured for a particular production run with mean seek time 10.00ms and standard deviation 0.10ms. What proportion of samples of size 100 would you expect to result in a ybar less than 9.98ms.

    Solution: From the Theorem mybar= my=10ms and s ybar= s y/sqrt(n)=0.1/10=0.01. Hence z=(9.98-10)/0.01=-2 and so the desired proportion is 2.28%

  2. Consider the population of C++ programs in a portfolio with a mean development cost per line of code (LOC) of $3.00 and standard deviation $1.00. What proportion of samples of size 100 would you expect to result in a mean greater than $3.40.

    Solution: From the Theorem mybar= my=3 and sybar= s y/sqrt(n)=1/10=0.1. Hence z=(3.4-3)/0.1=4 and so the desired proportion is 0.003167%

 

Theorem (sy unknown)

Since sy is typically not known, we will use the following variation of the Theorem above. Notice that for this Theorem we assume that sy has been determined from a sample selected from the population. We will consider both the case where n is large and the case where n is small.

  1. Let n be large (i.e. n >= 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s will be Normal with mean mybar= my and standard deviation sybar=sy/sqrt(n).
  2. Let n be small (i.e. n < 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s wil be Student t with n-1 degrees of freedom and with mean mybar= my and standard deviation sybar=sy/sqrt(n).

Motivation - Confidence Intervals (my)

Let y denote a measurement of interest from some population with mean my and standard deviation sy. Consider the sampling distribution of ybar for n large. For the sake of this motivation, let us consider an interval [L, U] constructed about my (i.e. mybar) thus:

L: my - 2s ybar
U: my + 2s ybar

Since we know that the sampling distribution of ybar is normal with mean mybar=my and standard deviation sybar=sy/sqrt(n) then L and U define boundaries between which 95.45% of the ybar's for all possible samples of size n will be. You may think of these boundaries dividing the ybar's into those that are relatively close to mybar (and so my) and those that are relatively distant from mybar (and so my).

For a particular sample we would like to know if the sample mean (ybar) is one of those that is close to my. Unfortunately we do not know my and so we cannot tell whether the sample mean is close to my or not. However we can address this issue indirectly by making use of our understanding of sampling distributions and proceeding thus:

Note: To keep the argument simple we will assume that sy is known:

  1. Construct an interval [L', U'] as we did above but about ybar:
  2. L': ybar - 2s ybar
    U': ybar + 2s ybar

  3. State that my is in this interval. Notice that this statement will be true if our ybar happens to be one of those that is close (as defined above) to my.
  4. Assert that we are 95.45% confident in our statement. We can safely do this since 95.45% of all possible samples will result in a ybar which will make our statement true.

This strategy captures the basic ideas of confidence interval theory. The only difference is that we will construct intervals that allow us to claim whatever confidence we desire. Also, since sy is not known, we will estimate it with our sample standard deviation sy. However when n is small and sy is used to estimate sy then the distribution of the ybars is no longer approximately normal (see Theorem above). Instead, the ybars are Student t distributed with n - 1 df (see Ott, section 5.7).

Definition: An a % confidence interval for the population parameter my is an interval constructed from a sample mean (ybar) within which you expect my to be with a% confidence. There are two cases that need to be considered:

  1. n large:
  2. L': ybar - za(sybar)
    U': ybar + za(sybar)

    where sybar=sy/sqrt(n) and za is the z value from the standard normal table so that the area between -z and z is a% .

  3. n small:

L': ybar - tn-1;(100-a)/2(sybar)
U': ybar + tn-1;(100-a)/2(sybar)

where sybar=sy/sqrt(n) and tn-1;(100-a)/2 is the t value from the standard "t" table in the row indexed by n-1 degrees of freedom and where the area to the right of t is (100-a)/2% .

Terminology

  1. sybar=sy/sqrt(n) is often referred to as the "error" in your estimate of my.
  2. z95(sybar) or tn-1;2.5(sybar) is often referred to as the "margin of error".
  3. The term "point estimate" refers to the value of the statistic used to estimate the corresponding parameter.

Problems:

Consider the population of disk drives manufactured from a particular production run. You are interested in estimating the mean seek time of this population and select a sample of n=100 drives for examination. You discover that the sample mean is 9.9ms with a standard deviation of 0.5ms.

  1. Construct and interpret a 90% confidence interval for the population mean (my).
  2. Construct and interpret a 99% confidence interval for the population mean (my).
  3. Let the sample size be 25 instead of 100. Construct and interpret a 90% confidence interval for the population mean (my).
  4. Let the sample size be 16 instead of 100. Construct and interpret a 99% confidence interval for the population mean (my).
This standard normal table provides the proportion under the standard normal curve between -z and z and so may be easier to use for these problems than the one supplied in Ott.

Solutions:

  1. Since sy=0.5 and n is large then sybar=sy/sqrt(n)=0.5/10=0.05. Also, za=z90=1.65, hence the 90% confidence interval for my is [9.9-1.65(0.05), 9.9+1.65(0.05)]=[9.8175, 9.9825]. Hence, you are 90% confident that the mean seek time for the population is between 9.8175ms and 9.9825ms.
  2. In this case sybar=0.05 as above but za=z99=2.6, hence the 99% confidence interval for my is [9.9-2.6(0.05), 9.9+2.6(0.05)]=[9.77,10.03]. Hence, you are 99% confident that the mean seek time for the population is between 9.77ms and 10.03ms.
  3. Since n is small then sybar=sy/sqrt(n)=0.5/5=0.1. Also, tn-1;(100-a)/2=t24;5=1.711, hence the 90% confidence interval for my is [9.9-1.711(0.1), 9.9+1.711(0.1)]=[9.7289, 10.0711]. Hence, you are 90% confident that the mean seek time for the population is between 9.7289ms and 10.0711ms.
  4. Since n is small then sybar=sy/sqrt(n)=0.5/4=0.125. Also, tn-1;(100-a)/2=t15;0.5=2.947, hence the 99% confidence interval for my is [9.9-2.947(0.125), 9.9+2.947(0.125)]=[9.531625, 10.268375]. Hence, you are 99% confident that the mean seek time for the population is between 9.531625ms and 10.268375ms.

Note: Notice the tradeoff between confidence and interval length. Also, observe that for a particular interval length you must increase sample size to obtain a higher level of confidence.

 

Sample Size Determination

Motivation

Consider the situation where you would like to estimate a population parameter to some "level of accuracy" for a particular "level of confidence". You know that a random sample is necessary but you do not know the minimum sample size necessary to achieve the desired accuracy and confidence.

This is the sample size determination problem. We may utilize confidence interval theory to determine the sample size. We do so by treating the "level of accuracy" as the plus/minus amount in our confidence interval expression and solving for n.

Definition (my)

Let n denote the sample size, D the level of accuracy and a the level of confidence. An a% confidence interval for the population parameter my may be expressed thus:

L': ybar - D
U': ybar + D

where:

D = za(sybar) = za(sy/sqrt(n))

hence, squaring and rearranging terms:

D2 = za2(sy2/n)
n = za2(sy2/D2)

Note that sy is unknown. However, we may address this in two ways:

  1. Conduct a pilot study. That is, select a sample (about 30 items) and determine sy from the sample.
  2. sy may be available from another study.

  3. Note: In some cases another study is available but sy is not. If the minimum and maximum values are known then we may compute R (i.e. the range) and estimate sy by sy=R/4.

Note: You will always round up to the nearest integer. Rounding down does not make sense since you would merely be ensuring that your sample is fractionally too small to achieve the accuracy and confidence required.

Problem:

Consider the disk drive problem. You would like to estimate the seek time of disk drives to an accuracy of 0.01ms with 95% confidence. What is the minimum sample size required to achieve this level of accuracy and confidence. Assume that a previous study indicates that seek times range between 9.8ms and 10.2ms.

Solution:

Since a is 95% then za=1.95. We may estimate sy from the range R. That is, R=10.2-9.8=0.4 hence sy=R/4=0.1. Also, D = 0.01 and so n=1.952(0.12/0.012)=380.25. Since you always round up, 381 drives are required to achieve a level of accuracy of 0.01 with 95% confidence.