CSC423/324 Data Analysis

Regression

Readings:

Ott; 11.1(pg 531-534), 11.2 - 11.4

Motivation:

Consider some population where, for each item in the population, we are interested in p numeric characteristics. That is, for the i^th item in the population, we observe the p-tuple:

(y_i, x_i,1,..., x_i,p-1)

We believe that each y_i is linearly related to the corresponding x_i,j's (j=1,...,p-1). That is, an expression that describes the relationship would have the following form:

y = b₀ + b₁x₁ + ,..., + b_p-1x_p-1

So, if you think of each item in our population as a point in p dimensional space then this expression defines a hyper-plane which bisects the items in our population. That is, there are points above the hyper-plane and points below the hyper-plane and the hyper-plane describes how y changes as the x_i,j's (j=1,...,p-1) change.
Note: We refer to the y variable as the dependent, or response, variable and the x variables (i.e. x₁,..., x_p-1) as the independent, or explanatory, variables.

e.g: Consider the population of CTI-02 graduates. For each graduate we are interested in the 3-tuple:

(salary, gpa, experience)

where experience refers to work experience in months.

In this case, salary is the dependent variable and gpa and experience are explanatory variables. It is reasonable to believe that salary is related to gpa and experience. However, we would not expect this relationship to be deterministic. That is, for a particular setting of gpa and experience (say gpa=3.0 and experience=36mths) we may be able to identify several graduates but would expect them to have different salaries. Now, if we were to consider another setting of gpa and experience (say gpa=4.0 and experience=36mths) we may also be able to identify several graduates with different salaries but would expect higher salaries overall, perhaps with some overlap. The question then is, can we derive an expression that describes the general change in salary as gpa and experience change.

Note that any such expression may be used to predict y for a particular setting of the x_i,j's. However, we should not expect to predict y exactly since, as described in our example, for a particular setting of the x_i,j's we have multiple y's.

This is the Regression problem, sometimes known as the prediction problem. We will address this problem in two parts. First, we will thoroughly address the simplest of these Regression problems. That is, the Simple Linear Regression (SLR) problem where we have a single explanatory variable x₁ (or just x). We will then generalize our discussion to the Multiple Regression problem.

Simple Linear Regression:

For SLR we have a population of pairs (y, x) and we believe that the relationship between x and y is linear. For example, using the CTI-02 problem above, our pairs could be (salary, gpa).

If x and y are linearly related then, from a scatter plot of x and y, we should be able to discern a straight line that describes this relationship. That is, we should be able to imagine a line which describes the trend, or the change, in y as x changes. Such a line is referred to as the regression line. It is of the following form:

y = b₀ + b₁x

However, note that this is just the equation of the regression line. To address the SLR problem we must first understand the SLR Model. The SLR Model is an expression which relates each (y, x) pair in our population to the regression line plus a set of axioms (or assumptions) that must hold in order for SLR to be appropriate.

Simple Linear Regression Model:

Consider a population of pairs (y, x). SLR is appropriate iff:

y = b₀ + b₁x + e

where for any particular setting of x, the associated e:

are normally distributed.
have mean zero.
have standard deviation s_e and the standard deviation is the same for all settings of x.

If these axioms are true then, since the e are normally distributed with mean zero and standard deviation s_e for a particular setting of x, the corresponding y values must also be normally distributed but with mean b₀ + b₁x and standard deviation s_e.

Note:

The e's are known as residuals and represent the deviation of each y in the population from the population regression line.
b₀, b₁, and s_e are known as parameters of the SLR model.

Theorem 1:

Consider a random sample from this population of pairs. Let us say that SLR is appropriate and let y=b₀+b₁x denote the sample regression line. We may use the sample regression line to estimate the population regression line.

Now, let r be the Pearson correlation coefficient between x and y. If s_y is the standard deviation of y, and s_x the standard deviation of x, and ybar the mean of y, and xbar the mean of x, then the slope b₁ and intercept b₀ of the sample regression line may be determined thus:

b₁=r(s_y/s_x)
the point (ybar, xbar) is on the sample regression line y=b₀+b₁x hence:
b₀=ybar - b₁(xbar)

Also, if we let s_e denote the standard deviation of the residuals about the sample regression line then:

s_e=s_ysqrt(1-r²)

Observe that b₁, b₀, and s_e are statistics. Since we have a random sample of pairs then these statistics may be used to estimate the parameters b₁, b₀, and s_e.

Note:

Remember that the slope is the change in y for unit increase in x and the intercept is the value of y when x is zero. Remember also that some thought may be needed, given the context of the problem, to determine if the intercept makes sense.
The Pearson correlation coefficient r is a quantity that may be computed from any set of numeric pairs and is a quantitative measure of the strength of the linear relationship between the variables defined in the pair. Remember that we use r to denote this measure when we compute it from a sample but use the symbol r when we compute it from a population. Also, remember that the Pearson correlation coefficient ranges between -1 and +1.
The expression for b₀ and b₁ above are known as the least squares estimates of the corresponding parameters. The least squares method requires knowledge of partial derivative calculus. To see how this method is used to derive these expressions see the minimizing residuals link.
The expression for b₁ above is equivalent to the expression for b₁hat on page 542 of the text. On page 596 of the text, the authors show how the above expression may be derived from the expression on page 542.

Problem:

Consider the CTI-02 problem. Given the following statistics, derive the equation of the SLR line:

         r=0.8
         income: mean=50000; std dev=6000
         gpa   : mean=3.0; std dev=0.4

Solution:

Since b₁=r(s_y/ s_x) then:

b₁=0.8(6000/0.4)=12000
Also, b₀=ybar-b₁(xbar) hence:
b₀=50000-12000(3.0)=14000.
Hence the equation of the SLR line is:

income=14000 + 12000(gpa)

Inferences about SLR Parameters:

Since b₀ and b₁ are statistics that are used to estimate parameters then it is not surprising that hypothesis testing and confidence interval concepts also apply. However, to develop these concepts we need to know the behavior of these statistics. That is, we need to know the sampling distribution of b₀ and b₁.

Theorem 2.1:

Consider a sample of size n selected from the population of pairs. Now, consider all other possible samples of size n that may be selected from the population of pairs but let the x values be the same as in the first sample. For each sample compute b₁. If SLR is appropriate then:

n large:
These b₁'s are normally distributed with mean and standard deviation:
m_b1 = b₁
s_b1 = s_e(sqrt(1/(n-1)s²_x)).
n small:
These b₁'s are Student t distributed with n-2 degrees of freedom and mean and standard deviation:
m_b1 = b₁
s_b1 = s_e(sqrt(1/(n-1)s²_x)).

Theorem 2.2:

n large:
These b₀'s are normally distributed with mean and standard deviation:
m_b0 = b₀
s_b0 = s_e(sqrt(1/n + xbar²/(n-1)s²_x)).
n small:
These b₀'s are Student t distributed with n-2 degrees of freedom and mean and standard deviation:
m_b0 = b₀
s_b0 = s_e(sqrt(1/n + xbar²/(n-1)s²_x)).

Problem:

Consider the CTI-02 problem above where the SLR equation was found to be:

income=14000 + 12000(gpa)

Given the statistics in the problem above, and assuming a sample size of 100 graduates, answer the following:

Construct a 90% confidence interval for the parameter b₁.
Given the following hypotheses, conduct a test of hypotheses.
H₀: b₁=10500
H_a: b₁>10500

Solution:

Since n=100 then a 90% confidence interval is:
b₁ +/- z_a s_b1 We estimate s_b1 by s_b1 where:
s_b1 = s_e(sqrt(1/(n-1)s_x)) = s_ysqrt(1-r²)(sqrt(1/(n-1)s_x))
s_b1 = 6000(sqrt(1-0.8²)(sqrt(1/(99*0.4²))) = 904.534 Hence the 90% confidence interval is: 12000 +/- 1.65(572.08)
[10507.52, 13492.48]
We apply the standard four step procedure. Notice that b₁=12000 is consistent with H_a and so we may assume H₀ true and proceed. To compute the p-value we first compute z:
z = (12000 - 10500)/904.534 = 1.66 Hence, our p-value is 4.85% and so we reject H₀ and conclude that the change in income for unit increase in gpa is more than $10500.

Note: We will usually use the SAS procedure proc reg to solve problems like the problems above. This SAS procedure will compute all parameter estimates as well as standard deviations needed for hypothesis testing and confidence intervals.