Multiple Regression

Multiple Regression contd.

General Linear Model (GLM)

Recall that given the p-tuple (y_i, x_i,1,..., x_i,p-1) , we may express each y_i in terms of the corresponding x_i,j's; j=1,...,p-1 by the following model.

y_i=b0 + b1xi,1 + b2xi,2 + … + bp-1xi,p-1 + e_i

where for any setting of the p-1-tuple (x_i,1,..,xi,p-1) the corresponding e_i:

are normally distributed.

have mean zero.

have constant s_e

We have seen that we may use mtrix algebra to represent this model. That is, let Y be the vector of the n y_i values. Let X be an n x p matrix, where the first column of X is a column of n 1’s and the remaining columns of X are vectors for each of the p-1 x’s. Let b be the vector of coefficients (i.e. b₀, b₁,.., b_p-1). Let e be the vector of the n residual values (i.e. e _i). Then the model may be expressed in matrix terms thus:

Y = Xb + e

We may solve for b thus:

b = (X^TX)^-1X^TY

We may also obtain an expression for s_e. That is:

s_e² = 1/(n-p)(Y^TY - b ^TX^TY)

Recall that this quantity is also known as the MSE and the positive square root as the RootMSE.

GLM Parameter Estimates and SAS

Consider a multiple regression problem where there are four explanatory variables. Let us say we have a sample of twelve observations and would like to estimate the parameters of the GLM for this problem.

Let us say the following SAS output was produced:


                             The REG Procedure
                               Model: MODEL1
                           Dependent Variable: y 

                           Analysis of Variance
 
                                 Sum of          Mean
 Source                DF       Squares        Square   F Value   Pr > F

 Model                  4    1379.29263     344.82316     11.81   0.0031
 Error                  7     204.37403      29.19629                   
 Corrected Total       11    1583.66667                                 


            Root MSE              5.40336    R-Square     0.8709
            Dependent Mean       32.83333    Adj R-Sq     0.7972
            Coeff Var            16.45693                       


                            Parameter Estimates
 
                         Parameter       Standard
    Variable     DF       Estimate          Error    t Value    Pr > |t|

    Intercept     1       -7.98075        8.14486      -0.98      0.3598
    x1            1       21.19048        4.86006       4.36      0.0033
    x2            1       -1.90504        1.98657      -0.96      0.3695
    x3            1        0.97208        1.63173       0.60      0.5701
    x4            1       10.00168        1.80694       5.54      0.0009

From this output, remember that we may obtain the following:

Estimates of the coefficients of the regression equation are under the "Parameter Estimate" heading. That is, to 2dp:
b^T = [-7.98, 21.19, -1.91, 0.97, 10.00]
Also the estimate of the standard deviation about the regression surface is s_e = 5.40.
We may use the F statistic (i.e. 11.81) and its p-value (i.e. 0.0031) for the following hypothesis test.
H₀: b₁ = b₂ = b₃ = b₄ = 0
H₁: Not all b_k (k=1,...,4) = 0
Since this p-value is highly significant then we reject the null hypothesis in favor of the alternative and conclude that the model does explain some of the variability in the response variable.
We may answer the following individual hypotheses by using the t statistics, and corresponding p-values, in the Parameter Estimates section of the report.
H₀: b_k = 0 (k=0,1,...,4)
H₁: b_k != 0 (k=0,1,...,4)
Notice that only x₁ and x₄ are significantly non-zero.
The coefficient of multiple determination is R²=0.8709. This indicates that 87.09% of the variation in the response is explained by the model.
Notice that the adjusted coefficient of determination is R_a² = 0.7972. Remember that we may obtain this value from R² thus:
R_a² = 1 - ((n-1)/(n-p))(1 - R²)
Remember that this quantity accounts for the number of independent variables in the model. Since R² does not take into account the number of independent variables in the model it increases monotonically as the number of variables in the model increases.

We may use matrix algebra to obtain all of these values with the exception of the p-values.

F Statistic:
F = (1/(p-1))( b ^TX^TY - (1/n) Y^TJY) /(1/(n-p))(Y^TY - b ^TX^TY)
R²:
R² = 1 - ((Y^TY - b ^TX^TY) / (Y^TY - (1/n)Y^TJY) )
s_b:
s_b² = diagonal of s_e ² (X^TX)^-1
Hence:
s_b = sqrt(s_b²)

Miscellaneous

You should also note the following:

Hat matrix: Y = Xb

Y = X(X^TX)^-1X^TY
We refer to X(X^TX)^-1X^T as the hat matrix and denote it H. Hence:

Y = HY

The elements along the diagonal of H are known as leverage values. The i^th value is denoted by h_ii; i=1,...,n. Note that h_ii may be obtained thus:
h_ii = X^T_i(X^TX)^-1X_i
where X_i corresponds to the i^th row of X and so refers to the particular setting of the explanatory variables associated with the i^th observation. So, think of h_ii as the leverage of the i^th observation.
Consider the p-1 dimensional space defined by the explanatory variables only. For example, for the 3-tuple (y, x₁, x₂), consider the plane defined by x₁ and x₂. Now, h_ii is a measure of the distance of the point defined by the explanatory variables of the i^th observation from the center of all the points in this p-1 dimensional space. That is, the points defined by the explanatory variables only. The significance of this is that the further away a particular point is from the center, the more influence it has on the coefficients of the regression equation. To see this for the two dimensional case (i.e. one x value and so the p-1 dimensional space is simply a line) try this Leverage Points & Regression Equation applet by R. W. West at U S Carolina.
Note that the h_ii have the following properties:
1. 0 =< h_ii =< 1
2. S h_ii = p; i=1,...,n, where p is the number of coefficients in the regression equation.
Residuals:
e = Y - Xb
Hence:
e = Y - HY = (I - H)Y