Inferences - Two Populations (m)

Readings:

Ott; 6.2-6.3, 6.4-6.6

Inferences about 2 population means:

Consider two populations where y1, y2 are the measurements of interest of the respective populations. Let us say my1, sy1 are the mean and standard deviation of y1, and my2, sy2 are the mean and standard deviation of y2.

We are interested in inferences that have to do with the difference between my1 and my2. As for inferences about my only (see week 2 and week 3 notes), these inferential problems may be classified into two broad categories:

  1. The Estimation problem.
  2. The Hypothesis Testing problem.

To illustrate, let us consider the following examples:

  1. CTI-02 graduates problem.
  2. Let us say we are interested in CS and SE graduates only. We may think of these two groups as two populations. That is, CTI-02(CS) and CTI-02(SE) where yCS, ySE correspond to y1, y2 mentioned above and represent the starting salaries of each member in these groups.

    Estimation: We may be interested in estimating the difference in the mean starting salary (i.e. mCS - mSE). If so, we may need to address the accuracy of our estimate or we may need to determine the minimum sample size required to estimate this difference to a predefined level of accuracy.

    Hypothesis Testing: We may be interested in determining which of two contrasting points of view about the difference in the mean starting salary (i.e. mCS - mSE=0 vs. mCS - mSE<0) is more reasonable.

  3. Algorithm implementation problem.
  4. Consider Project #1. We may think of the set of all decoding times for each implementation of the MLP decoding algorithm as each of the the two populations. That is, yIHD, yOTS correspond to y1, y2 mentioned above and represent decoding times.

    Estimation: We may be interested in estimating the difference in the mean decoding time (i.e. mIHD - mOTS). If so, we may need to address the accuracy of our estimate or we may need to determine the minimum sample size required to estimate this difference to a predefined level of accuracy.

    Hypothesis Testing: (see Project #1).

     

It should be clear that as long as our samples are random samples, we may use (y1bar - y2bar) to estimate (my1 - my2). It should also be clear that to address questions of accuracy and sample size determination as well as the hypothesis testing problem we must first address the sampling distribution of (y1bar - y2bar).

Note however that problems that involve inferences about two population means fall into two distinct categories that require very different approaches to solving the problems. These categories are referred to as:

  1. Paired Sample problem:
  2. The algorithm implementation problem mentioned above falls into this category (see Project #1). For paired sample problems, each value in the sample selected from one population is paired, or matched, with a specific value in the sample selected from the other population. Often, as in Project #1, the corresponding values are taken from the same item (i.e. for Project #1, an audio datastream). However, this is not always the case. The important point to note is that the values are paired or matched in some way. It may be because the values are from the same item or it could be because the items from which the values are obtained have some underlying matching characteristic. Pairing (or matching) is used as an experimental device to ensure that the effect of extraneous sources of variation is mitigated.

    e.g. Consider the problem of comparing two different GUI’s. Let us say we decide to use twins for our study where one twin is observed completing a series of tasks with one GUI and the other twin is observed completing the same tasks but with the other GUI. This is a paired sample problem since the values obtained for each sample are matched.

    Note: Matching may occur in a variety of ways. One common approach is to match by some form of testing. It may be a psychological test or a standardized test etc.

  3. Independent Sample problem:
  4. For these problems, values in the sample selected from one population are independent of the values in the other population.

    e.g. Consider the CTI-02(CS), CTI-02(SE) problem above. The values in each sample are clearly unrelated to the values in the other sample.

Note:

If you are uncertain if a particular problem is a paired or independent sample problem then there are a few simple things that you can do. First, examine the size of the respective samples. If the sizes are different then it is an independent sample problem. If the sample sizes are the same then, from the description of the problem, see if pairing or matching is suggested. If it is then it is a paired problem and if not it is an independent sample problem.

 

Paired Sample Problem

As you discovered for Project #1, the paired sample problem may be solved using the techniques developed for problems that involve inferences about a single mean. This is simply because, since values are paired/matched, we can analyze the differences between the paired values instead of the values themselves. By so doing, we transform each pair of values into a single value and so, we need not consider the sampling distribution of y1bar - y2bar. Instead we simply consider the sampling distribution of dbar which is equivalent to considering the sampling distribution of ybar. That is, rather than analyzing the difference between means, we simply analyze the mean of differences.

Note: To see this, let di=y1i - y2i. Using basic algebra, show that dbar=y1bar - y2bar. Similarly, you may show that md=(my1 - my2).

 

Independent Sample Problem

Since values are not paired or matched, we cannot use the differences approach mentioned above. We must first investigate the sampling distribution of y1bar - y2bar in order to address questions of accuracy, sample size determination or hypothesis testing.

Recall: The sampling distribution of of y1bar - y2bar refers to the distribution of y1bar - y2bar when we consider y1bar for ALL possible samples of size n1 from population 1 and y2bar for ALL possible samples of size n2 from population 2 and determine y1bar - y2bar for each combination.

Five distinct settings of the independent sample problem exist:

  1. Sample size n1 and n2 both large.
  2. Note: As you would expect, in this case we need not be concerned about the distribution of y1 or y2. We will solve these problems manually and will also discuss how to use SAS to solve these problems.

  3. Sample size n1 small or n2 small.
    1. y1 and y2 normally distributed and sy1=sy2.
    2. Note: We will solve these problems manually and will also discuss how to use SAS to solve these problems.

    3. y1 and y2 normally distributed and sy1 not equal to sy2.
    4. Note: We will not solve these problems manually. Instead we will use PROC TTEST from SAS (see week 7 notes).

    5. y1 and y2 not normally distributed and sy1=sy2.
    6. Note: We will not solve these problems manually. Instead we will use PROC NPAR1WAY from SAS (see week 7 notes). This procedure provides a non-parametric technique known as the Wilcoxon Rank Sum technique.

    7. y1 and y2 not normally distributed and sy1 not equal to sy2.

Note: We will not discuss this situation.

 

 

Case 1: n1 and n2 both large:

Remember that since the sample sizes are large we need not be concerned about the distribution of y1 and y2. Also, we need not be concerned about the relative size of sy1 or sy2.

Theorem 1: In this case (y1bar-y2bar) is normally distributed with mean and standard deviation:

m(y1bar-y2bar)=(my1 - my2)

s(y1bar-y2bar)=sqrt(s2y1/n1 + s2y2/n2)

Since s2y1 and s2y2 are unknown we estimate them by s2y1 and s2y2 and so we estimate s(y1bar-y2bar) with s(y1bar-y2bar) thus:

s(y1bar-y2bar)=sqrt(s2y1/n1 + s2y2/n2)

Remember that we may use this theorem to address both hypothesis testing and accuracy problems.

 

Problem:

Consider the CTI-02(CS) and CTI-02(SE) two independent sample problem mentioned above. Let us say that you are interested in the following hypothesis testing problem:

H0: mCS -mSE =0
Ha:
mCS -mSE <0

You select a sample of 40 SE and 36 CS graduates and discover that mean starting salary of the SE graduates is $60K with standard deviation of $2K and the mean starting salary of CS graduates is $58K with a standard deviation of $3.6K. Conduct a test of hypotheses.

Solution:

  1. The null and alternative hypotheses are:
  2. H0: mCS -mSE =0
    Ha:
    mCS -mSE <0

  3. Examining the sample:
    1. nSE=40; ySEbar=$60K and sSE=$2K
      nCS=36; yCSbar=$58K and sCS=$3.6K
    2. Since (yCSbar- ySEbar)=-2000 then (yCSbar- ySEbar) is consistent with Ha and we may proceed.

  4. Since (yCSbar- ySEbar) is consistent then:
    1. Assume H0 true (i.e. mCS -mSE =0).
    2. Given that H0 is true, determine the p-value.
    3. Both n1 and n2 are large and so from Theorem 1 (yCSbar- ySEbar) is normally distributed. Also:

      m(CSbar-SEbar)=(mCS - mSE)=0
      s(y1bar-y2bar)=sqrt{s
      2CS/nCS + s2SE/nSE}=678.23

      Our test statistic is therefore:

      z=(-2000 - 0)/678.23=-2.95

      Hence the required p-value is 0.16%.

  5. Apply the decision rule to your p-value.
  6. Since the p-value is <= 1% then the p-value is highly significant and so we reject H0 and conclude that SE graduates receive better starting salaries than CS graduates.

 

Definition 1: Given Theorem 1, an a % confidence interval for (my1 - my2) may be obtained thus:

L: (y1bar-y2bar) - za(s(y1bar-y2bar))
U: (y1bar-y2bar) + z
a(s(y1bar-y2bar))

 

Problem:

Consider the CTI-02(CS) and CTI-02(SE) above, construct and interpret a 90% CI for (mCS-mSE).

Solution:

A 90% confidence interval for (mCS-mSE) is:

L: (yCSbar-ySEbar) - z90(s(y1bar-y2bar))
U: (yCSbar-ySEbar) + z
90(s(y1bar-y2bar))
L: (-2000) - 1.65(678.23)
U: (-2000) + 1.65(678.23)
[-3119.08, -880.92]

I am 90% confident that (mCS-mSE) is between -3119.08 and -880.92. That is, I am 90% confident that, on average, CS graduates receive starting salaries between $3119.08 and $880.92 less than SE graduates.

 

 

Case 2: n1 small or n2 small:

Remember that if at least one of the sample sizes is small then, in order to determine the sampling distribution of (y1bar-y2bar), we must know the distribution of y1 and y2 as well as the relative size of sy1 and sy2. Although there are four scenarios we will only address three.

Theorem 2a: Let y1 and y2 be normally distributed and let sy1=sy2. In this case (y1bar-y2bar) is Student t distributed with (n1+n2-2) degrees of freedom and with mean and standard deviation:

m(y1bar-y2bar)=(my1 - my2)

s(y1bar-y2bar)=sqrt(s2y1/n1 + s2y2/n2)

Since sy1=sy2=s we may simplify thus:

s(y1bar-y2bar)=sqrt{s2/n1 + s2/n2)= s(sqrt(1/n1 + 1/n2)}

But s=sy1=sy2 is unknown and so we estimate s by sp where sp is referred to as a pooled estimator. That is, think of sp as a weighted average of sy1 and sy2:

sp=sqrt{((n1-1) s2y1 + (n2 –1)s2y2)/(n1+n2-2)}

Therefore we estimate s(y1bar-y2bar) with s(y1bar-y2bar) thus:

s(y1bar-y2bar)=sp{sqrt(1/n1 + 1/n2)}

 

Problem:

Consider the CTI-02(CS) and CTI-02(SE) two independent sample hypothesis testing problem mentioned above.

In this case you select a sample of 26 SE and 17 CS graduates and discover that the mean starting salary of the SE graduates is $60K with standard deviation of $2K and the mean starting salary of CS graduates is $58K with a standard deviation of $2.1K. Conduct a test of hypotheses.

Solution:

  1. The null and alternative hypotheses are:
  2. H0: mCS -mSE =0
    Ha:
    mCS -mSE <0

  3. Examining the sample:
    1. nSE=26; ybar=$60K and sSE=$2K
      nCS=17; ybar=$58K and sCS=$2.1K
    2. Since (yCSbar- ySEbar)=-2000 then (yCSbar- ySEbar) is consistent with Ha and we may proceed.

  4. Since (yCSbar- ySEbar) is consistent then:
    1. Assume H0 true (i.e. mCS -mSE =0).
    2. Given that H0 is true, determine the p-value.
    3. In this case nCS is small and nSE is also small. If we assume that yCS is normally distributed and that ySE is normally distributed and also assume that sCS=sSE then Theorem 2a applies and so (yCSbar- ySEbar) is Student t distributed with (nCS+nSE-2)=41 degrees of freedom. Also:

      m(CSbar-SEbar)=(mCS - mSE)=0
      Since sp=sqrt{((nCS –1)s
      2CS + (nSE –1)s2SE)/(nCS+nSE-2)}=2039.608 then:
      s(y1bar-y2bar)=2039.608(sqrt(1/17 + 1/26))=636.16

      Our test statistic is therefore:

      t=(-2000 - 0)/636.16=-3.14

      Hence the required p-value is less than 1%.

  5. Apply the decision rule to your p-value.
  6. Since the p-value is less than 1% then the p-value is highly significant and so we again reject H0 and conclude that SE graduates receive better starting salaries than CS graduates.

 

Definition 2a: Given Theorem 2a, an a % confidence interval for (my1 - my2) may be obtained thus:

L: (y1bar-y2bar) - (tn1+n2-2;(100-a)/2)(s(y1bar-y2bar))
U: (y1bar-y2bar) + (tn1+n2-2
;(100-a)/2)(s(y1bar-y2bar))

where:

s(y1bar-y2bar)=sp(sqrt(1/n1 + 1/n2))

Problem:

Consider the CTI-02(CS) and CTI-02(SE) hypothesis testing (small sample) problem above. Given the statistics and assumptions, construct and interpret a 90% CI for (mCS-mSE).

Solution:

Given the statistics and assumptions for the hypothesis testing problem above, a 90% confidence interval for (mCS-mSE) is:

L: (yCSbar-ySEbar) - (tn1+n2-2;(100-a)/2)(s(y1bar-y2bar))
U: (yCSbar-ySEbar) + (tn1+n2-2
;(100-a)/2)(s(y1bar-y2bar))
L: (-2000) - 1.684(636.16)
U: (-2000) + 1.684(636.16)
[-3071.29, -928.71]

I am 90% confident that (mCS-mSE) is between -3071.29 and -928.71.

 

Sample Size Determination:

We may also address the question of how to determine the minimum sample size required to estimate (my1 - my2) to a predefined level of accuracy and confidence. Remember that we will only be concerned with the situation where the sample sizes are both large.

However, in addition, we will make two simplifying assumptions. First, we will assume that both samples are of the same size (i.e n1=n2=n). Second, we will assume that the population standard deviations are the same (i.e sy1=sy2=s).

Given these assumptions, we may easily obtain an expression for n from Definition 1 above.

D = za sqrt(s2y1/n1 + s2y2/n2)

Since we assume n1=n2=n and sy1=sy2=s (hence sy1=sy2=s) then this simplifies to:

D = za s{sqrt(1/n + 1/n)}
D = za s{sqrt(2/n)}
D2 = z2a *s2*(2/n)
n = 2*(z
a *s/D)2

Problem:

Consider the CTI-02(CS) and CTI-02(SE) above, determine the minimum sample size required to ensure that we estimate the difference in means to an accuracy of $500.00 with 90% confidence. Let us say that from a previous study s=2000.00.

Solution:

From our expression above:

n = 2*(za *s/D)2
n = 2*(1.65*2000/500)2
n = 87.12

Hence, we need 88 students from each concentration to achieve the desired level of accuracy and confidence.

 

Note: See the Non-Parametric Methods lecture notes for a discussion of case 2b and 2c.