Lecture 10/14

Sample Size Determination

Motivation

Consider the situation where you would like to estimate a population parameter to some "level of accuracy" for a particular "level of confidence". You know that a probability sample is necessary but you do not know the minimum sample size necessary to achieve the desired accuracy and confidence.

This is the sample size determination problem. We may utilize confidence interval theory to determine the sample size. We do so by treating the "level of accuracy" as the plus/minus amount in our confidence interval expression and solving for n.

Definition (m_y)

Let n denote the sample size, D the level of accuracy and a the level of confidence. An a% confidence interval for the population parameter m_ymay be expressed thus:

L': ybar- D

U': ybar + D

where:

D = z_a(s_ybar) = z_a(s_y/sqrt(n))

hence, squaring and rearranging terms:

D² = z_a²(s_y²/n)

n = z_a²(s_y²/D²)

Note that s_y is unknown. However, we may address this in one of three ways:

s_y may be available from another study.
The range (i.e. R=max - min) may be available from another study in which case we estimate s_y=R/4.
Conduct a pilot study. That is, select a small sample and determine s_y from the sample.

Note: You will always round up to the nearest integer. Rounding down does not make sense since you would merely be ensuring that your sample is fractionally too small to achieve the accuracy and confidence required.

Problem:

Consider the disk drive problem. You would like to estimate the seek time of disk drives to an accuracy of 0.01ms with 95% confidence. What is the minimum sample size required to achieve this level of accuracy and confidence. Assume that a previous study indicates that seek times range between 9.8ms and 10.2ms.

Solution:

Since a is 95% then z_a=1.95. We may estimate s_y from the range R. That is, R=10.2-9.8=0.4 hence s_y=R/4=0.1. Also, D = 0.01 and so n=1.95²(0.1²/0.01²)=380.25. Since you always round up, 381 drives are required to achieve a level of accuracy of 0.01 with 95% confidence.

Definition (p)

Let n denote the sample size, D the level of accuracy and a the level of confidence. An a% confidence interval for the population parameter pmay be expressed thus:

L': p- D

U': p + D

where:

D = z_a(s_p) = z_a(sqrt(p(1-p)/n))

hence, squaring and rearranging terms:

D² = z_a²(p(1-p)/n)

n = z_a²(p(1-p)/D²)

Note that p is unknown. However, we may address this in one of three ways:

p may be available from another study.
Conduct a pilot study. That is, select a small sample and determine p from the sample.
Let p=0.5 (i.e. 50%). In this case we will obtain a conservative estimate of n. That is, n is the minimum sample size in the absence of prior knowledge of p. However, prior knowledge of p will result in a smaller value of n.

Problem:

Consider the disk drive problem. You would like to estimate the proportion of defective drives to an accuracy of 0.05 (i.e. 5%) with 99% confidence. What is the minimum sample size required to achieve this level of accuracy and confidence.

Solution:

In this case, a is 99% then z_a=2.6. Notice that p is not available from another study so we use p=0.5 (i.e. 50%). Since D = 0.05 then n=2.6²(0.5(1-0.5)/0.05²)=676. So 676 drives are required to achieve a level of accuracy of 0.05 (i.e. 5%) with 99% confidence.

Problem:

Consider the disk drive problem again. You would like to estimate the proportion of defective drives to an accuracy of 0.05 (i.e. 5%) with 99% confidence as before but you happen to know that the proportion of defective drives from a previous sample is 20%. How does this knowledge affect the minimum sample size required.

Solution:

In this case p=0.2 (i.e. 20%). Since D = 0.05 then n=2.6²(0.2(1-0.2)/0.05²)=432.64. Hence only 433 drives are required to achieve a level of accuracy of 0.05 (i.e. 5%) with 99% confidence given this prior knowledge of p.