Multiple Regression contd.

General Linear Model (GLM)

Recall that given the p-tuple (yi, xi,1,..., xi,p-1) , we may express each yi in terms of the corresponding xi,j's; j=1,...,p-1 by the following model.

yi=b0 + b1xi,1 + b2xi,2 + … + bp-1xi,p-1 + ei

where for any setting of the p-1-tuple (xi,1,..,xi,p-1) the corresponding ei:

  1. are normally distributed.
  2. have mean zero.
  3. have constant se

We have seen that we may use mtrix algebra to represent this model. That is, let Y be the vector of the n yi values. Let X be an n x p matrix, where the first column of X is a column of n 1’s and the remaining columns of X are vectors for each of the p-1 x’s. Let b be the vector of coefficients (i.e. b 0, b 1,.., b p-1). Let e be the vector of the n residual values (i.e. e i). Then the model may be expressed in matrix terms thus:

Y = Xb + e

We may solve for b thus:

b = (XTX)-1XTY

We may also obtain an expression for se. That is:

se2 = 1/(n-p)(YTY - b TXTY)

Recall that this quantity is also known as the MSE and the positive square root as the RootMSE.

 

GLM Parameter Estimates and SAS

Consider a multiple regression problem where there are four explanatory variables. Let us say we have a sample of twelve observations and would like to estimate the parameters of the GLM for this problem.

Let us say the following SAS output was produced:


                             The REG Procedure
                               Model: MODEL1
                           Dependent Variable: y 

                           Analysis of Variance
 
                                 Sum of          Mean
 Source                DF       Squares        Square   F Value   Pr > F

 Model                  4    1379.29263     344.82316     11.81   0.0031
 Error                  7     204.37403      29.19629                   
 Corrected Total       11    1583.66667                                 


            Root MSE              5.40336    R-Square     0.8709
            Dependent Mean       32.83333    Adj R-Sq     0.7972
            Coeff Var            16.45693                       


                            Parameter Estimates
 
                         Parameter       Standard
    Variable     DF       Estimate          Error    t Value    Pr > |t|

    Intercept     1       -7.98075        8.14486      -0.98      0.3598
    x1            1       21.19048        4.86006       4.36      0.0033
    x2            1       -1.90504        1.98657      -0.96      0.3695
    x3            1        0.97208        1.63173       0.60      0.5701
    x4            1       10.00168        1.80694       5.54      0.0009

From this output, remember that we may obtain the following:

  1. Estimates of the coefficients of the regression equation are under the "Parameter Estimate" heading. That is, to 2dp:

    bT = [-7.98, 21.19, -1.91, 0.97, 10.00]

    Also the estimate of the standard deviation about the regression surface is se = 5.40.

  2. We may use the F statistic (i.e. 11.81) and its p-value (i.e. 0.0031) for the following hypothesis test.

    H0: b1 = b2 = b3 = b4 = 0
    H1: Not all
    bk (k=1,...,4) = 0

    Since this p-value is highly significant then we reject the null hypothesis in favor of the alternative and conclude that the model does explain some of the variability in the response variable.

  3. We may answer the following individual hypotheses by using the t statistics, and corresponding p-values, in the Parameter Estimates section of the report.

    H0: bk = 0 (k=0,1,...,4)
    H1:
    bk != 0 (k=0,1,...,4)

    Notice that only x1 and x4 are significantly non-zero.

  4. The coefficient of multiple determination is R2=0.8709. This indicates that 87.09% of the variation in the response is explained by the model.

    Notice that the adjusted coefficient of determination is Ra2 = 0.7972. Remember that we may obtain this value from R2 thus:

    Ra2 = 1 - ((n-1)/(n-p))(1 - R2)

    Remember that this quantity accounts for the number of independent variables in the model. Since R2 does not take into account the number of independent variables in the model it increases monotonically as the number of variables in the model increases.

We may use matrix algebra to obtain all of these values with the exception of the p-values.

  1. F Statistic:

    F = (1/(p-1))( b TXTY - (1/n) YTJY) /(1/(n-p))(YTY - b TXTY)

  2. R2:

    R2 = 1 - ((YTY - b TXTY) / (YTY - (1/n)YTJY) )

  3. sb:

    sb2 = diagonal of se 2 (XTX)-1

    Hence:

    sb = sqrt(sb2)

 

Miscellaneous

You should also note the following:

  1. Hat matrix:
    Y = Xb

    Y = X(XTX)-1XTY

    We refer to X(XTX)-1XT as the hat matrix and denote it H. Hence:

    Y = HY

    The elements along the diagonal of H are known as leverage values. The ith value is denoted by hii; i=1,...,n. Note that hii may be obtained thus:

    hii = XTi(XTX)-1Xi

    where Xi corresponds to the ith row of X and so refers to the particular setting of the explanatory variables associated with the ith observation. So, think of hii as the leverage of the ith observation.

    Consider the p-1 dimensional space defined by the explanatory variables only. For example, for the 3-tuple (y, x1, x2), consider the plane defined by x1 and x2. Now, hii is a measure of the distance of the point defined by the explanatory variables of the ith observation from the center of all the points in this p-1 dimensional space. That is, the points defined by the explanatory variables only. The significance of this is that the further away a particular point is from the center, the more influence it has on the coefficients of the regression equation. To see this for the two dimensional case (i.e. one x value and so the p-1 dimensional space is simply a line) try this Leverage Points & Regression Equation applet by R. W. West at U S Carolina.

    Note that the hii have the following properties:

    1. 0 =< hii =< 1
    2. S hii = p; i=1,...,n, where p is the number of coefficients in the regression equation.

  2. Residuals:

    e = Y - Xb

    Hence:

    e = Y - HY = (I - H)Y