Simple Linear Regression

 

Theorem 1a:

Consider a bivariate sample from some population. Let us say we are interested in the relationship between x and y where x is the independent variable and y the dependent variable.

If Simple Linear Regression (SLR) is appropriate and r is the correlation between x and y and sy is the standard deviation of y and sx is the standard deviation of x and ybar the mean of y and xbar the mean of x then the slope b and intercept a of the regression line may be determined thus:

  1. b = r( sy/ sx)
  2. (ybar, xbar) is on the regression line hence:
    a = ybar - b(xbar)
Note: Interpretation - Slope: change in y for unit increase in x; Intercept: value of y when x is zero. Remember that some thought may be needed, given the context of the problem, to determine if the intercept makes sense.

Problem:

Consider the income/gpa problem. Given the following statistics, derive the regression equation:

income: mean=50000; std dev=6000
gpa: mean=3.0; std dev=0.4; r=0.8

Solution:

Since b=r( sy/ sx) then b=0.8(6000/0.4)=12000. Also, a=ybar-b(xbar)=50000 -12000(3.0)=14000. Hence the regression equation is:

income=14000 + 12000(gpa)

 

 

Theorem 1b:

Consider the bivariate population mentioned in Theorem 1a above. The population may be represented by the following SLR model if a x and y are related linearly:

y = a + bx + e

where, for fixed x, e is assumed to:

  1. be normally distributed
  2. have constant standard deviation (denoted s y|x)
  3. have mean zero

Notes:

  1. e 's are deviations from the regression line
  2. a, b, s y|x are parameters of the SLR model
  3. a, b, s y|x may be determined thus:
    1. b = r(s y/ s x)
    2. the point ( m y, m x) is on the regression line hence:
      a = m y - bm x
    3. s y|x= s ysqrt(1 - r 2)
  4. Since e is normally distributed for fixed x with mean zero and standard deviation s y|x then, for fixed x, y is also normally distributed with mean a + b x and standard deviation s y|x. Hence, the regression line may be thought of as the "line of means". That is, the line connecting the means of those y's associated with fixed values of x.

Problem:

Consider the income/gpa problem above. Using the sample statistics to estimate population parameters, determine the proportion of graduates with gpa=4.0 that got starting salaries more than $50K.

Solution:

Since income=14000 + 12000(gpa) then the mean income of these graduates would be 14000+12000(4)=$62K. Also, the std dev of these incomes would be 6000(sqrt(1-0.82))=3600. Since normality of these incomes follows if we assume the SLR model then z=(50000-62000)/3600=-3.33 and so 99.96% of graduates with gpa=4.0 got starting salaries more than $50K.

 

Optional Readings:

SAS: Chapter 9: pp289 - pp298

You will need to familiarize yourself with the output of the SAS reg procedure.