Lecture 11/4

Simple Linear Regression (contd.)

Readings:

As outlined in the previous Simple Linear Regression lecture notes, but in addition read SAS, Chapter 4, pp108 - pp111 for an overview of the box plot.

Derivation of a and b (optional)

Given a SLR model, we may derive expressions for a and b by thinking of the true regression line as the line that best expresses the relationship between x and y. If this is so, then the line is that line which makes the residuals as small as possible. By thinking of the regression line in this way, we may use calculus to find the values of a and b which will minimize the residuals. See this link for complete details.

Residual Analysis

Given a bivariate sample from some population, we should first determine if SLR model assumptions are reasonable before using the regression equation to make inferences about our population. Since model assumptions are in terms of residuals, we must examine the sample residuals to see if they provide evidence to doubt model assumptions. In this sense, residual analysis is very similar to our hypothesis testing procedures. However, we will be somewhat less formal and will depend more on heuristics and subjectivity in our assessment.

We will start by determining sample residuals. To do this we must first derive the regression equation. If we denote the j^th residual for our i^th x value by e_ij then

e_ij = y_ij - (a + bx_i).

We may use the SAS reg procedure to derive our regression line and our sample residuals. See the income/gpa example for sample code and output.

Once sample residuals have been obtained we may proceed with our residual analysis thus:

Step 1: Normality Assumption

We assess normality as we would in any other case except we are now concerned with residuals. Therefore, we may express the hypotheses for the normality test thus:

H₀: Sample residuals taken from a normally distributed population
H_a: Sample residuals not taken from a normally distributed population

As usual, we use the SAS univariate procedure to test these hypotheses. See the income/gpa example for sample code and output.

Step 2: Homoscedastic Assumption

Unfortunately, for this class, we do not have an analytic test, as we do for the normality assumption. We must rely on a subjective assessment based on an examination of a plot of our residuals against our independent variable. Such a plot is known as a residual plot. The term residual plot is sometimes used to describe any plot which uses residuals but, for this class, our residual plots will always be constructed as follows.

Residuals on y-axis

Independent variable on x-axis

Use the vref=0 option to create a horizontal line at y=0.

Use the SAS plot procedure. See the income/gpa example for sample code and output.

The homoscedastic assumption has to do with the standard deviation of residuals for fixed values of x. Remember that standard deviation is a measure of spread. Since we are using a plot to do this assessment we will assess the assumption indirectly. We will use another measure of spread, the range, which may be easily assessed from our plot.

If the range of our residuals does not seem constant with increasing x then we will reject the homoscedastic assumption. If there is a change in spread then it must be both systematic and dramatic to warrant rejection. By systematic, we mean that the spread has to be funnel shaped. The difference in spread at the extremes of x must be dramatic. That is, after bounding our points by straight lines apply the 3:1 heuristic.

Step 3: Unbiased Assumption

Unfortunately, for this class, we again do not have an analytic test, as we do for the normality assumption. In this case we rely on a somewhat less subjective assessment based on an examination of a boxplot of the residuals. Our assessment will also be indirect. We will be looking for outliers. If there are outliers, then we reject the assumption.
Use the SAS univariate procedure. See the income/gpa example for sample code and output.