Multiple Regression contd.
General Linear Model (GLM)
Recall that given the p-tuple (yi, xi,1,..., xi,p-1) , we may express each yi in terms of the corresponding xi,j's; j=1,...,p-1 by the following model.
where for any
setting of the p-1-tuple
(xi,1,..,xi,p-1) the corresponding
ei:
We have seen that we may use mtrix algebra to represent this model.
That is, let Y be the vector of the n yi values. Let X
be an n x p matrix, where the first column of X is a column of
n 1’s and the remaining columns of X are vectors for each of the
p-1 x’s. Let
se
We may solve for
b thus:We may also obtain an expression for se. That is:
Recall that this quantity is also known as the MSE and the positive square root as the RootMSE.
GLM Parameter Estimates and SAS
Consider a multiple regression problem where there are four explanatory variables. Let us say we have a sample of twelve observations and would like to estimate the parameters of the GLM for this problem.
Let us say the following SAS output was produced:
The REG Procedure Model: MODEL1 Dependent Variable: y Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 1379.29263 344.82316 11.81 0.0031 Error 7 204.37403 29.19629 Corrected Total 11 1583.66667 Root MSE 5.40336 R-Square 0.8709 Dependent Mean 32.83333 Adj R-Sq 0.7972 Coeff Var 16.45693 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -7.98075 8.14486 -0.98 0.3598 x1 1 21.19048 4.86006 4.36 0.0033 x2 1 -1.90504 1.98657 -0.96 0.3695 x3 1 0.97208 1.63173 0.60 0.5701 x4 1 10.00168 1.80694 5.54 0.0009
From this output, remember that we may obtain the following:
Also the estimate of the standard deviation about the
regression surface is
se = 5.40.
H0:
Since this p-value is highly significant then we reject the null hypothesis in favor of the alternative and conclude that
the model does explain some of the variability in the response variable.
H0:
Notice that only x1 and x4 are significantly non-zero.
Notice that the adjusted coefficient of determination
is Ra2 = 0.7972. Remember that we may obtain this value
from R2 thus:
Remember that this quantity accounts for the number of
independent variables in the model. Since R2 does
not take into account the number of
independent variables in the model it
increases monotonically as the number of variables in the model increases.
H1: Not all
bk (k=1,...,4) = 0
H1:
bk != 0 (k=0,1,...,4)
We may use matrix algebra to obtain all of these values with the exception of the p-values.
Hence:
Miscellaneous
You should also note the following:
We refer to X(XTX)-1XT as the hat matrix and denote it H. Hence:
The elements along the diagonal of H are known as leverage values. The ith value is denoted by hii; i=1,...,n. Note that hii may be obtained thus:
where Xi corresponds to the ith row of X and so refers to the particular setting of the explanatory variables associated with the ith observation. So, think of hii as the leverage of the ith observation.
Consider the p-1 dimensional space defined by the explanatory variables only. For example, for the 3-tuple (y, x1, x2), consider the plane defined by x1 and x2. Now, hii is a measure of the distance of the point defined by the explanatory variables of the ith observation from the center of all the points in this p-1 dimensional space. That is, the points defined by the explanatory variables only. The significance of this is that the further away a particular point is from the center, the more influence it has on the coefficients of the regression equation. To see this for the two dimensional case (i.e. one x value and so the p-1 dimensional space is simply a line) try this Leverage Points & Regression Equation applet by R. W. West at U S Carolina.
Note that the hii have the following properties:
Hence: