CSC423/324 Data Analysis

Inferences - Two Populations (m)

Readings:

Ott; 6.2-6.3, 6.4-6.6

Inferences about 2 population means:

Consider two populations where y₁, y₂ are the measurements of interest of the respective populations. Let us say m_y1, s_y1 are the mean and standard deviation of y₁, and m_y2, s_y2 are the mean and standard deviation of y₂.

We are interested in inferences that have to do with the difference between m_y1 and m_y2. As for inferences about m_y only (see week 2 and week 3 notes), these inferential problems may be classified into two broad categories:

The Estimation problem.
The Hypothesis Testing problem.

To illustrate, let us consider the following examples:

CTI-02 graduates problem.

Let us say we are interested in CS and SE graduates only. We may think of these two groups as two populations. That is, CTI-02(CS) and CTI-02(SE) where y_CS, y_SE correspond to y₁, y₂ mentioned above and represent the starting salaries of each member in these groups.

Estimation: We may be interested in estimating the difference in the mean starting salary (i.e. m_CS - m_SE). If so, we may need to address the accuracy of our estimate or we may need to determine the minimum sample size required to estimate this difference to a predefined level of accuracy.

Hypothesis Testing: We may be interested in determining which of two contrasting points of view about the difference in the mean starting salary (i.e. m_CS - m_SE=0 vs. m_CS - m_SE<0) is more reasonable.

Algorithm implementation problem.

Consider Project #1. We may think of the set of all decoding times for each implementation of the MLP decoding algorithm as each of the the two populations. That is, y_IHD, y_OTS correspond to y₁, y₂ mentioned above and represent decoding times.

Estimation: We may be interested in estimating the difference in the mean decoding time (i.e. m_IHD - m_OTS). If so, we may need to address the accuracy of our estimate or we may need to determine the minimum sample size required to estimate this difference to a predefined level of accuracy.

Hypothesis Testing: (see Project #1).

It should be clear that as long as our samples are random samples, we may use (y_1bar - y_2bar) to estimate (m_y1 - m_y2). It should also be clear that to address questions of accuracy and sample size determination as well as the hypothesis testing problem we must first address the sampling distribution of (y_1bar - y_2bar).

Note however that problems that involve inferences about two population means fall into two distinct categories that require very different approaches to solving the problems. These categories are referred to as:

Paired Sample problem:

The algorithm implementation problem mentioned above falls into this category (see Project #1). For paired sample problems, each value in the sample selected from one population is paired, or matched, with a specific value in the sample selected from the other population. Often, as in Project #1, the corresponding values are taken from the same item (i.e. for Project #1, an audio datastream). However, this is not always the case. The important point to note is that the values are paired or matched in some way. It may be because the values are from the same item or it could be because the items from which the values are obtained have some underlying matching characteristic. Pairing (or matching) is used as an experimental device to ensure that the effect of extraneous sources of variation is mitigated.

e.g. Consider the problem of comparing two different GUI’s. Let us say we decide to use twins for our study where one twin is observed completing a series of tasks with one GUI and the other twin is observed completing the same tasks but with the other GUI. This is a paired sample problem since the values obtained for each sample are matched.

Note: Matching may occur in a variety of ways. One common approach is to match by some form of testing. It may be a psychological test or a standardized test etc.

Independent Sample problem:

For these problems, values in the sample selected from one population are independent of the values in the other population.

e.g. Consider the CTI-02(CS), CTI-02(SE) problem above. The values in each sample are clearly unrelated to the values in the other sample.

Note:

If you are uncertain if a particular problem is a paired or independent sample problem then there are a few simple things that you can do. First, examine the size of the respective samples. If the sizes are different then it is an independent sample problem. If the sample sizes are the same then, from the description of the problem, see if pairing or matching is suggested. If it is then it is a paired problem and if not it is an independent sample problem.

Paired Sample Problem

As you discovered for Project #1, the paired sample problem may be solved using the techniques developed for problems that involve inferences about a single mean. This is simply because, since values are paired/matched, we can analyze the differences between the paired values instead of the values themselves. By so doing, we transform each pair of values into a single value and so, we need not consider the sampling distribution of y_1bar - y_2bar. Instead we simply consider the sampling distribution of dbar which is equivalent to considering the sampling distribution of ybar. That is, rather than analyzing the difference between means, we simply analyze the mean of differences.

Note: To see this, let d_i=y_1i - y_2i. Using basic algebra, show that dbar=y_1bar - y_2bar. Similarly, you may show that m_d=(m_y1 - m_y2).

Independent Sample Problem

Since values are not paired or matched, we cannot use the differences approach mentioned above. We must first investigate the sampling distribution of y_1bar - y_2bar in order to address questions of accuracy, sample size determination or hypothesis testing.

Recall: The sampling distribution of of y_1bar - y_2bar refers to the distribution of y_1bar - y_2bar when we consider y_1bar for ALL possible samples of size n₁ from population 1 and y_2bar for ALL possible samples of size n₂ from population 2 and determine y_1bar - y_2bar for each combination.

Five distinct settings of the independent sample problem exist:

Sample size n₁ and n₂ both large.

Note: As you would expect, in this case we need not be concerned about the distribution of y₁ or y₂. We will solve these problems manually and will also discuss how to use SAS to solve these problems.

Sample size n₁ small or n₂ small.

y₁ and y₂ normally distributed and s_y1=s_y2.

Note: We will solve these problems manually and will also discuss how to use SAS to solve these problems.

y₁ and y₂ normally distributed and s_y1 not equal to s_y2.

Note: We will not solve these problems manually. Instead we will use PROC TTEST from SAS (see week 7 notes).

y₁ and y₂ not normally distributed and s_y1=s_y2.

Note: We will not solve these problems manually. Instead we will use PROC NPAR1WAY from SAS (see week 7 notes). This procedure provides a non-parametric technique known as the Wilcoxon Rank Sum technique.

y₁ and y₂ not normally distributed and s_y1 not equal to s_y2.

Note: We will not discuss this situation.

Case 1: n₁ and n₂ both large:

Remember that since the sample sizes are large we need not be concerned about the distribution of y₁ and y₂. Also, we need not be concerned about the relative size of s_y1 or s_y2.

Theorem 1: In this case (y_1bar-y_2bar) is normally distributed with mean and standard deviation:

m_{(y1bar-y2bar)}=(m_y1 - m_y2)

s_{(y1bar-y2bar)}=sqrt(s²_y1/n₁ + s²_y2/n₂)

Since s²_y1 and s²_y2 are unknown we estimate them by s²_y1 and s²_y2 and so we estimate s_{(y1bar-y2bar)} with s_{(y1bar-y2bar)} thus:

s_{(y1bar-y2bar)}=sqrt(s²_y1/n₁ + s²_y2/n₂)

Remember that we may use this theorem to address both hypothesis testing and accuracy problems.

Problem:

Consider the CTI-02(CS) and CTI-02(SE) two independent sample problem mentioned above. Let us say that you are interested in the following hypothesis testing problem:

H₀: m_CS-m_SE=0
H_a: m_CS-m_SE<0

You select a sample of 40 SE and 36 CS graduates and discover that mean starting salary of the SE graduates is $60K with standard deviation of $2K and the mean starting salary of CS graduates is $58K with a standard deviation of $3.6K. Conduct a test of hypotheses.

Solution:

The null and alternative hypotheses are:

H₀: m_CS-m_SE=0
H_a: m_CS-m_SE<0

Examining the sample:

n_SE=40; y_SEbar=$60K and s_SE=$2K
n_CS=36; y_CSbar=$58K and s_CS=$3.6K
Since (y_CSbar- y_SEbar)=-2000 then (y_CSbar- y_SEbar) is consistent with H_a and we may proceed.

Since (y_CSbar- y_SEbar) is consistent then:

Assume H₀ true (i.e. m_CS-m_SE=0).

Given that H₀ is true, determine the p-value.

Both n₁ and n₂ are large and so from Theorem 1 (y_CSbar- y_SEbar) is normally distributed. Also:

m_{(CSbar-SEbar)}=(m_CS - m_SE)=0
s_{(y1bar-y2bar)}=sqrt{s²_CS/n_CS + s²_SE/n_SE}=678.23

Our test statistic is therefore:

z=(-2000 - 0)/678.23=-2.95

Hence the required p-value is 0.16%.

Apply the decision rule to your p-value.

Since the p-value is <= 1% then the p-value is highly significant and so we reject H₀ and conclude that SE graduates receive better starting salaries than CS graduates.

Definition 1: Given Theorem 1, an a % confidence interval for (m_y1 - m_y2) may be obtained thus:

L: (y_1bar-y_2bar) - z_a(s_{(y1bar-y2bar)})
U: (y_1bar-y_2bar) + z_a(s_{(y1bar-y2bar)})

Problem:

Consider the CTI-02(CS) and CTI-02(SE) above, construct and interpret a 90% CI for (m_CS-m_SE).

Solution:

A 90% confidence interval for (m_CS-m_SE) is:

L: (y_CSbar-y_SEbar) - z₉₀(s_{(y1bar-y2bar)})
U: (y_CSbar-y_SEbar) + z₉₀(s_{(y1bar-y2bar)})
L: (-2000) - 1.65(678.23)
U: (-2000) + 1.65(678.23)
[-3119.08, -880.92]

I am 90% confident that (m_CS-m_SE) is between -3119.08 and -880.92. That is, I am 90% confident that, on average, CS graduates receive starting salaries between $3119.08 and $880.92 less than SE graduates.

Case 2: n₁ small or n₂ small:

Remember that if at least one of the sample sizes is small then, in order to determine the sampling distribution of (y_1bar-y_2bar), we must know the distribution of y₁ and y₂ as well as the relative size of s_y1 and s_y2. Although there are four scenarios we will only address three.

Theorem 2a: Let y₁ and y₂ be normally distributed and let s_y1=s_y2. In this case (y_1bar-y_2bar) is Student t distributed with (n₁+n₂-2) degrees of freedom and with mean and standard deviation:

m_{(y1bar-y2bar)}=(m_y1 - m_y2)

s_{(y1bar-y2bar)}=sqrt(s²_y1/n₁ + s²_y2/n₂)

Since s_y1=s_y2=s we may simplify thus:

s_{(y1bar-y2bar)}=sqrt{s²/n₁ + s²/n₂)= s(sqrt(1/n₁ + 1/n₂)}

But s=s_y1=s_y2 is unknown and so we estimate s by s_p where s_p is referred to as a pooled estimator. That is, think of s_p as a weighted average of s_y1 and s_y2:

s_p=sqrt{((n₁-1) s²_y1 + (n₂ –1)s²_y2)/(n₁+n₂-2)}

Therefore we estimate s_{(y1bar-y2bar)} with s_{(y1bar-y2bar)} thus:

s_{(y1bar-y2bar)}=s_p{sqrt(1/n₁ + 1/n₂)}

Problem:

Consider the CTI-02(CS) and CTI-02(SE) two independent sample hypothesis testing problem mentioned above.

In this case you select a sample of 26 SE and 17 CS graduates and discover that the mean starting salary of the SE graduates is $60K with standard deviation of $2K and the mean starting salary of CS graduates is $58K with a standard deviation of $2.1K. Conduct a test of hypotheses.

Solution:

The null and alternative hypotheses are:

H₀: m_CS-m_SE=0
H_a: m_CS-m_SE<0

Examining the sample:

n_SE=26; ybar=$60K and s_SE=$2K
n_CS=17; ybar=$58K and s_CS=$2.1K
Since (y_CSbar- y_SEbar)=-2000 then (y_CSbar- y_SEbar) is consistent with H_a and we may proceed.

Since (y_CSbar- y_SEbar) is consistent then:

Assume H₀ true (i.e. m_CS-m_SE=0).

Given that H₀ is true, determine the p-value.

In this case n_CS is small and n_SE is also small. If we assume that y_CS is normally distributed and that y_SE is normally distributed and also assume that s_CS=s_SE then Theorem 2a applies and so (y_CSbar- y_SEbar) is Student t distributed with (n_CS+n_SE-2)=41 degrees of freedom. Also:

m_{(CSbar-SEbar)}=(m_CS - m_SE)=0
Since s_p=sqrt{((n_CS –1)s²_CS + (n_SE –1)s²_SE)/(n_CS+n_SE-2)}=2039.608 then:
s_{(y1bar-y2bar)}=2039.608(sqrt(1/17 + 1/26))=636.16

Our test statistic is therefore:

t=(-2000 - 0)/636.16=-3.14

Hence the required p-value is less than 1%.

Apply the decision rule to your p-value.

Since the p-value is less than 1% then the p-value is highly significant and so we again reject H₀ and conclude that SE graduates receive better starting salaries than CS graduates.

Definition 2a: Given Theorem 2a, an a % confidence interval for (m_y1 - m_y2) may be obtained thus:

L: (y_1bar-y_2bar) - (t_n1+n2-2;(100-a)/2)(s_{(y1bar-y2bar)})
U: (y_1bar-y_2bar) + (t_n1+n2-2;(100-a)/2)(s_{(y1bar-y2bar)})

where:

s_{(y1bar-y2bar)}=s_p(sqrt(1/n₁ + 1/n₂))

Problem:

Consider the CTI-02(CS) and CTI-02(SE) hypothesis testing (small sample) problem above. Given the statistics and assumptions, construct and interpret a 90% CI for (m_CS-m_SE).

Solution:

Given the statistics and assumptions for the hypothesis testing problem above, a 90% confidence interval for (m_CS-m_SE) is:

L: (y_CSbar-y_SEbar) - (t_n1+n2-2;(100-a)/2)(s_{(y1bar-y2bar)})
U: (y_CSbar-y_SEbar) + (t_n1+n2-2;(100-a)/2)(s_{(y1bar-y2bar)})
L: (-2000) - 1.684(636.16)
U: (-2000) + 1.684(636.16)
[-3071.29, -928.71]

I am 90% confident that (m_CS-m_SE) is between -3071.29 and -928.71.

Sample Size Determination:

We may also address the question of how to determine the minimum sample size required to estimate (m_y1 - m_y2) to a predefined level of accuracy and confidence. Remember that we will only be concerned with the situation where the sample sizes are both large.

However, in addition, we will make two simplifying assumptions. First, we will assume that both samples are of the same size (i.e n₁=n₂=n). Second, we will assume that the population standard deviations are the same (i.e s_y1=s_y2=s).

Given these assumptions, we may easily obtain an expression for n from Definition 1 above.

D = z_a sqrt(s²_y1/n₁ + s²_y2/n₂)

Since we assume n₁=n₂=n and s_y1=s_y2=s (hence s_y1=s_y2=s) then this simplifies to:

D = z_a s{sqrt(1/n + 1/n)}
D = z_a s{sqrt(2/n)}
D² = z²_a *s²*(2/n)
n = 2*(z_a *s/D)² Problem:

Consider the CTI-02(CS) and CTI-02(SE) above, determine the minimum sample size required to ensure that we estimate the difference in means to an accuracy of $500.00 with 90% confidence. Let us say that from a previous study s=2000.00.

Solution:

From our expression above:

n = 2*(z_a *s/D)²
n = 2*(1.65*2000/500)²
n = 87.12

Hence, we need 88 students from each concentration to achieve the desired level of accuracy and confidence.

Note: See the Non-Parametric Methods lecture notes for a discussion of case 2b and 2c.