Regression
Readings:
Ott; 11.1(pg 531-534), 11.2 - 11.4
Motivation:Consider some population where, for each item in the population, we are interested in p numeric characteristics. That is, for the ith item in the population, we observe the p-tuple:
We believe that each yi is linearly related to the corresponding xi,j's (j=1,...,p-1). That is, an expression that describes the relationship would have the following form:
So, if you think of each item in our population as
a point in p dimensional space then
this expression defines a hyper-plane
which bisects the items in our population. That is,
there are points above the hyper-plane and points below the
hyper-plane and the hyper-plane describes how y changes as
the
xi,j's (j=1,...,p-1) change.
Note: We refer to the y variable as the
dependent, or response, variable
and the x variables (i.e. x1,...,
xp-1) as the independent, or explanatory, variables.
e.g: Consider the population of CTI-02 graduates. For each graduate we are interested in the 3-tuple:
where experience refers to work experience in months.
In this case, salary is the dependent variable and gpa and experience are explanatory variables. It is reasonable to believe that salary is related to gpa and experience. However, we would not expect this relationship to be deterministic. That is, for a particular setting of gpa and experience (say gpa=3.0 and experience=36mths) we may be able to identify several graduates but would expect them to have different salaries. Now, if we were to consider another setting of gpa and experience (say gpa=4.0 and experience=36mths) we may also be able to identify several graduates with different salaries but would expect higher salaries overall, perhaps with some overlap. The question then is, can we derive an expression that describes the general change in salary as gpa and experience change.
Note that any such expression may be used to predict y for a particular setting of the xi,j's. However, we should not expect to predict y exactly since, as described in our example, for a particular setting of the xi,j's we have multiple y's.
This is the Regression problem, sometimes known as the prediction problem. We will address this problem in two parts. First, we will thoroughly address the simplest of these Regression problems. That is, the Simple Linear Regression (SLR) problem where we have a single explanatory variable x1 (or just x). We will then generalize our discussion to the Multiple Regression problem.
Simple Linear Regression:
For SLR we have a population of pairs (y, x) and we believe that the relationship between x and y is linear. For example, using the CTI-02 problem above, our pairs could be (salary, gpa).
If x and y are linearly related then, from a scatter plot of x and y, we should be able to discern a straight line that describes this relationship. That is, we should be able to imagine a line which describes the trend, or the change, in y as x changes. Such a line is referred to as the regression line. It is of the following form:
However, note that this is just the equation of the regression line. To address the SLR problem we must first understand the SLR Model. The SLR Model is an expression which relates each (y, x) pair in our population to the regression line plus a set of axioms (or assumptions) that must hold in order for SLR to be appropriate.
Simple Linear Regression Model:
Consider a population of pairs (y, x). SLR is appropriate iff:
where for any particular setting of x, the associated e:
If these axioms are true then, since the e are normally distributed with mean zero and standard deviation se for a particular setting of x, the corresponding y values must also be normally distributed but with mean b0 + b1x and standard deviation se.
Note:
Theorem 1:
Consider a random sample from this population of pairs. Let us say that SLR is appropriate and let y=b0+b1x denote the sample regression line. We may use the sample regression line to estimate the population regression line.
Now, let r be the Pearson correlation coefficient between x and y. If sy is the standard deviation of y, and sx the standard deviation of x, and ybar the mean of y, and xbar the mean of x, then the slope b1 and intercept b0 of the sample regression line may be determined thus:
Also, if we let se denote the standard deviation of the residuals about the sample regression line then:
Observe that b1, b0, and se are statistics. Since we have a random sample of pairs then these statistics may be used to estimate the parameters b1, b0, and se.
Note:
Problem:
Consider the CTI-02 problem. Given the following statistics, derive the equation of the SLR line:
r=0.8 income: mean=50000; std dev=6000 gpa : mean=3.0; std dev=0.4
Solution:
Since
b1=r(sy/
sx) then:
Inferences about SLR Parameters:
Since b0 and b1 are statistics that are used to estimate parameters then it is not surprising that hypothesis testing and confidence interval concepts also apply. However, to develop these concepts we need to know the behavior of these statistics. That is, we need to know the sampling distribution of b0 and b1.
Theorem 2.1:
Consider a sample of size n selected from the population of pairs. Now, consider all other possible samples of size n that may be selected from the population of pairs but let the x values be the same as in the first sample. For each sample compute b1. If SLR is appropriate then:
These b1's are normally distributed with mean and standard deviation:
These b1's are Student t distributed with n-2 degrees of freedom and mean and standard deviation:
Theorem 2.2:
Consider a sample of size n selected from the population of pairs. Now, consider all other possible samples of size n that may be selected from the population of pairs but let the x values be the same as in the first sample. For each sample compute b0. If SLR is appropriate then:
These b0's are normally distributed with mean and standard deviation:
These b0's are Student t distributed with n-2 degrees of freedom and mean and standard deviation:
Problem:
Consider the CTI-02 problem above where
the SLR equation was found to be:
Given the statistics in the problem above, and assuming a sample size of 100 graduates, answer the following:
Solution:
Note: We will usually use the SAS procedure proc reg to solve problems like the problems above. This SAS procedure will compute all parameter estimates as well as standard deviations needed for hypothesis testing and confidence intervals.