## Linear Regression

### The Regression Equation

• Example: A dataset consists of heights (x-variable) and weights (y-variable) of 977 men, of ages 18-24. Here are the summary statistics:

x = 70 inches    SDx = 3 inches

x = 162 pounds    SDy = 30 inches

r = 0.5

• We want to derive an equation, called the regression equation for predicting y from x.

• If x increases above x = 70 by one SDx = 3, how much will y increase, on the average?

• Answer: it depends on the correlation r. What happens with these three scenerios?

1. r = 0.0

2. r = 1.0

3. r = 0.5

• In general, here is the formula for the regression equation:

y - y = (r SDy / SDx) (x - x)

• Use this formula to derive the regression equation for the example at the top of this page.

• What are the predicted weights for these heights?

76    67    70

• The regression line can be thought of as a line of averages. It connects the averages of the y-values in each thin vertical strip:

• The regression line is the line that minimizes the sum of the squares of the residuals. For this reason, it is also called the least squares line.

• The regression line is also called the linear trend line.

• Beware of extrapolating beyond the range of the data points. The actual response curve may curve in an unexpected way.

### Predicted Values and Residuals

• The actual value of dependent variable is yi.

• The predicted value of yi is defined to be y^i = a xi + b, where y = a x + b is the regression equation.

• The residual is the error that is not explained by the regression equation:

ei = yi - y^i.

• A residual plot plots the residuals on the y-axis vs. the predicted values of the dependent variable on the x-axis. We would like the residuals to be

unbiased: have an average value of zero in any thin vertical strip, and

homoscedastic, which means "same stretch": the spread of the residuals should be the same in any thin vertical strip.

• The residuals are heteroscedastic if they are not homoscedastic.

• Here are six residual plots and their interpretations:

(a) Unbiased and homoscedastic. The residuals average to zero in each thin verical strip and the SD is the same all across the plot.

(b) Biased and homoscedastic. The residuals show a linear pattern, probably due to a lurking variable not included in the experiment.

(c) Biased and homoscedastic. The residuals show a quadratic pattern, possibly because of a nonlinear relationship. Sometimes a variable transform will eliminate the bias.

(d) Unbiased, but homoscedastic. The SD is small to the left of the plot and large to the right: the residuals are heteroscadastic.

(e) Biased and heteroscedastic. The pattern is linear.

(f) Biased and heteroscedastic. The pattern is quadratic.

• We would also like the residuals to be normally distributed. We check this by looking at the normal plot of the residuals.

### Root Mean Square Error

• The root mean square error (RMSE) for a regression model is similar to the standard deviation (SD) for the ideal measurement model.

• We can write this as a Miller analogy:

RMSE : regression model :: SD : ideal measurement model

• The SD estimates the deviation from the sample mean x.

• The RMSE estimates the deviation of the actual y-values from the regression line. Another way to say this is that it estimates the standard deviation of the y-values in a thin vertical rectangle.

• The RMSE is computed as

RMSE = √[( e12 + e22 + ... + en2) / n]

where ei = yi - yi^.

• The RMSE can be computed more simply as RMSE = SDy √(1 - r2).

• Example:   If SDy = 30 and r = 0.6, then

RMSE = SDy √(1 - r2) = 30 * √(1 - 0.42) = 30 * 0.6 = 18.