CSC423/324 Data Analysis

Introduction/Review II

Following is the completion of the review started last class. Remember that this is material that you should have covered in a first class in data analysis or statistics.

Readings

Ott, Chapter 4, section 4.11 - 4.12
Ott, Chapter 5, section 5.2 - 5.4, 5.7

Inferences about m:

Last week I indicated that the objective of Data Analysis is to make inferences about a population based on information gleaned from a sample. These inferential problems may be classified into two broad categories each of which has several dimensions:

The Estimation problem.
The Hypothesis Testing problem.

Note also, that these problems are usually formulated in terms of one or more population parameters. We will intially address the simplest of these problems. That is, problems that have to do with m only.

Estimation:

This section addresses the Estimation problem. There are three dimensions to this problem:

Under what circumstances is it reasonable to use a statistic to estimate the corresponding parameter?
i.e. The sample selection scheme question.
How accurate is the estimate?
i.e. The confidence interval question.
What is the minimum sample size required to estimate a parameter to a predefined level of accuracy.
i.e. The sample size determination question.

Sample Selection:

For valid statistical inference we must select items for our sample in such a way that we do not introduce selection bias. That is, the items must be selected in a random manner. Such samples are referred to as random samples.

Definition: A random sample is a sample selected in such a way that every item in the population has the same chance of being selected for inclusion in the sample.

There are several ways in which a random sample may be selected (see page 21 of Ott for an overview). In any case, all samples discussed in this class should be taken to be random samples. We use random samples to ensure that the sample is representative of the population and so our sample statistics will be good estimators of the corresponding population parameters.

Confidence Intervals:

Essentially, the question of accuracy has to do with how can we compute a quantity that will allow us to state how close a particular statistic is to the corresponding unknown parameter. To answer this question we first need to understand the behavior of sample statistics. That is we need to discuss Sampling Distributions.

Sampling Distributions –- (ybar)

Let n denote sample size and let y denote a measurement of interest from some population with mean m_y and standard deviation s_y.

Theorem (s_y known)

Let n be large (i.e. n >= 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s will be Normal with mean m_ybar= m_y and standard deviation s _ybar= s _y/sqrt(n).
Note: If n is small, then as long as the distribution of y is mound shaped, we can say that the ybars are approximately Normal.

Problems

Consider the population of disk drives manufactured for a particular production run with mean seek time 10.00ms and standard deviation 0.10ms. What proportion of samples of size 100 would you expect to result in a ybar less than 9.98ms.

Solution: From the Theorem m_ybar= m_y=10ms and s _ybar= s _y/sqrt(n)=0.1/10=0.01. Hence z=(9.98-10)/0.01=-2 and so the desired proportion is 2.28%
Consider the population of C++ programs in a portfolio with a mean development cost per line of code (LOC) of $3.00 and standard deviation $1.00. What proportion of samples of size 100 would you expect to result in a mean greater than $3.40.

Solution: From the Theorem m_ybar= m_y=3 and s_ybar= s _y/sqrt(n)=1/10=0.1. Hence z=(3.4-3)/0.1=4 and so the desired proportion is 0.003167%

Theorem (s_y unknown)

Since s_y is typically not known, we will use the following variation of the Theorem above. Notice that for this Theorem we assume that s_y has been determined from a sample selected from the population. We will consider both the case where n is large and the case where n is small.

Let n be large (i.e. n >= 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s will be Normal with mean m_ybar= m_y and standard deviation s_ybar=s_y/sqrt(n).
Let n be small (i.e. n < 30). Consider all possible samples of size n that may be selected from the population. Compute the sample mean (i.e.ybar) in each case. The distribution of the ybar’s wil be Student t with n-1 degrees of freedom and with mean m_ybar= m_y and standard deviation s_ybar=s_y/sqrt(n).

Motivation - Confidence Intervals (m_y)

Let y denote a measurement of interest from some population with mean m_y and standard deviation s_y. Consider the sampling distribution of ybar for n large. For the sake of this motivation, let us consider an interval [L, U] constructed about m_y (i.e. m_ybar) thus:

L: m_y - 2s _ybar
U: m_y + 2s _ybar

Since we know that the sampling distribution of ybar is normal with mean m_ybar=m_y and standard deviation s_ybar=s_y/sqrt(n) then L and U define boundaries between which 95.45% of the ybar's for all possible samples of size n will be. You may think of these boundaries dividing the ybar's into those that are relatively close to m_ybar (and so m_y) and those that are relatively distant from m_ybar (and so m_y).

For a particular sample we would like to know if the sample mean (ybar) is one of those that is close to m_y. Unfortunately we do not know m_y and so we cannot tell whether the sample mean is close to m_yor not. However we can address this issue indirectly by making use of our understanding of sampling distributions and proceeding thus:

Note: To keep the argument simple we will assume that s_yis known:

Construct an interval [L', U'] as we did above but about ybar:

L': ybar- 2s _ybar
U': ybar + 2s _ybar

State that m_yis in this interval. Notice that this statement will be true if our ybar happens to be one of those that is close (as defined above) to m_y.
Assert that we are 95.45% confident in our statement. We can safely do this since 95.45% of all possible samples will result in a ybar which will make our statement true.

This strategy captures the basic ideas of confidence interval theory. The only difference is that we will construct intervals that allow us to claim whatever confidence we desire. Also, since s_yis not known, we will estimate it with our sample standard deviation s_y. However when n is small and s_y is used to estimate s_y then the distribution of the ybars is no longer approximately normal (see Theorem above). Instead, the ybars are Student t distributed with n - 1 df (see Ott, section 5.7).

Definition: An a % confidence interval for the population parameter m_yis an interval constructed from a sample mean (ybar) within which you expect m_yto be with a% confidence. There are two cases that need to be considered:

n large:

L': ybar- z_a(s_ybar)
U': ybar + z_a(s_ybar)

where s_ybar=s_y/sqrt(n) and z_a is the z value from the standard normal table so that the area between -z and z is a% .

n small:

L': ybar- t_{n-1;(100-a)/2}(s_ybar)
U': ybar + t_{n-1;(100-a)/2}(s_ybar)

where s_ybar=s_y/sqrt(n) and t_{n-1;(100-a)/2} is the t value from the standard "t" table in the row indexed by n-1 degrees of freedom and where the area to the right of t is (100-a)/2% .

Terminology

s_ybar=s_y/sqrt(n) is often referred to as the "error" in your estimate of m_y.
z₉₅(s_ybar) or t_n-1;2.5(s_ybar) is often referred to as the "margin of error".
The term "point estimate" refers to the value of the statistic used to estimate the corresponding parameter.

Problems:

Consider the population of disk drives manufactured from a particular production run. You are interested in estimating the mean seek time of this population and select a sample of n=100 drives for examination. You discover that the sample mean is 9.9ms with a standard deviation of 0.5ms.

Construct and interpret a 90% confidence interval for the population mean (m_y).
Construct and interpret a 99% confidence interval for the population mean (m_y).
Let the sample size be 25 instead of 100. Construct and interpret a 90% confidence interval for the population mean (m_y).
Let the sample size be 16 instead of 100. Construct and interpret a 99% confidence interval for the population mean (m_y).

This standard normal table provides the proportion under the standard normal curve between -z and z and so may be easier to use for these problems than the one supplied in Ott.

Solutions:

Since s_y=0.5 and n is large then s_ybar=s_y/sqrt(n)=0.5/10=0.05. Also, z_a=z₉₀=1.65, hence the 90% confidence interval for m_y is [9.9-1.65(0.05), 9.9+1.65(0.05)]=[9.8175, 9.9825]. Hence, you are 90% confident that the mean seek time for the population is between 9.8175ms and 9.9825ms.
In this case s_ybar=0.05 as above but z_a=z₉₉=2.6, hence the 99% confidence interval for m_y is [9.9-2.6(0.05), 9.9+2.6(0.05)]=[9.77,10.03]. Hence, you are 99% confident that the mean seek time for the population is between 9.77ms and 10.03ms.
Since n is small then s_ybar=s_y/sqrt(n)=0.5/5=0.1. Also, t_{n-1;(100-a)/2}=t_24;5=1.711, hence the 90% confidence interval for m_y is [9.9-1.711(0.1), 9.9+1.711(0.1)]=[9.7289, 10.0711]. Hence, you are 90% confident that the mean seek time for the population is between 9.7289ms and 10.0711ms.
Since n is small then s_ybar=s_y/sqrt(n)=0.5/4=0.125. Also, t_{n-1;(100-a)/2}=t_15;0.5=2.947, hence the 99% confidence interval for m_y is [9.9-2.947(0.125), 9.9+2.947(0.125)]=[9.531625, 10.268375]. Hence, you are 99% confident that the mean seek time for the population is between 9.531625ms and 10.268375ms.

Note: Notice the tradeoff between confidence and interval length. Also, observe that for a particular interval length you must increase sample size to obtain a higher level of confidence.

Sample Size Determination

Motivation

Consider the situation where you would like to estimate a population parameter to some "level of accuracy" for a particular "level of confidence". You know that a random sample is necessary but you do not know the minimum sample size necessary to achieve the desired accuracy and confidence.

This is the sample size determination problem. We may utilize confidence interval theory to determine the sample size. We do so by treating the "level of accuracy" as the plus/minus amount in our confidence interval expression and solving for n.

Definition (m_y)

Let n denote the sample size, D the level of accuracy and a the level of confidence. An a% confidence interval for the population parameter m_ymay be expressed thus:

L': ybar- D
U': ybar + D

where:

D = z_a(s_ybar) = z_a(s_y/sqrt(n))

hence, squaring and rearranging terms:

D² = z_a²(s_y²/n)
n = z_a²(s_y²/D²)

Note that s_y is unknown. However, we may address this in two ways:

Conduct a pilot study. That is, select a sample (about 30 items) and determine s_y from the sample.
s_y may be available from another study.

Note

i.e.

Note: You will always round up to the nearest integer. Rounding down does not make sense since you would merely be ensuring that your sample is fractionally too small to achieve the accuracy and confidence required.

Problem:

Consider the disk drive problem. You would like to estimate the seek time of disk drives to an accuracy of 0.01ms with 95% confidence. What is the minimum sample size required to achieve this level of accuracy and confidence. Assume that a previous study indicates that seek times range between 9.8ms and 10.2ms.

Solution:

Since a is 95% then z_a=1.95. We may estimate s_y from the range R. That is, R=10.2-9.8=0.4 hence s_y=R/4=0.1. Also, D = 0.01 and so n=1.95²(0.1²/0.01²)=380.25. Since you always round up, 381 drives are required to achieve a level of accuracy of 0.01 with 95% confidence.