Regression contd.
Readings:
SLR and the 2-Independent Sample Problem:
Consider the two independent sample problem discussed week 6. If y1 and y2 are normally distributed and sy1=sy2 then, if either sample is small, Theorem 2a applies and, if the samples are large, Theorem 1 applies but it is a special case.
Given these conditions, the two independent sample problem is a special case of the SLR problem. To see this, consider the two independent sample problem as a problem that involves a set of pairs (y, x) where x=0 indicates that the corresponding y is from population 1 and x=1 indicates that the corresponding y is from population 2. SLR is clearly appropriate in this case since the relationship between x and y is linear and furthermore, given our assumptions, all of the SLR model axioms hold.
We may deduce the following:
Therefore, inferences about b0 are equivalent to inferences about my1 and, more importantly, inferences about b1 are equivalent to inferences about my2 - my1.
We may illustrate by considering the following hypotheses:
H0: ma-mb=0
Ha: ma-mb!=0
Now let us say the following program has been written so that we can conduct a test of hypotheses:
Code:
options ps=55 ls=76; data a; input group $ y @@; x=1; if group='a' then x=0; datalines; a 75 a 76 a 80 a 77 a 80 a 77 a 73 b 82 b 80 b 85 b 85 b 78 b 87 b 82 proc univariate normal; var y; by group; proc ttest; class group; var y; proc reg; model y=x; output out=new1 r=resid; proc univariate normal; var resid; run;Notice that we have two small samples, each of size 7. Observe that the y values for group a are also identified by x=0 and the y values for group b by x=1. If you examine the output produced by this SAS code, the relevant section from proc reg is:
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 76.85714 1.08170 71.05 <.0001 x 1 5.85714 1.52975 3.83 0.0024Notice that b0=76.85714 and b1=5.85714. Now, the Statistics and T-Tests sections of the proc ttest output are the sections needed for comparison:
Statistics Lower CL Upper CL Lower CL Variable group N Mean Mean Mean Std Dev Std Dev y a 7 74.504 76.857 79.211 1.6399 2.5448 y b 7 79.804 82.714 85.625 2.028 3.1472 y Diff (1-2) -9.19 -5.857 -2.524 2.0522 2.8619 T-Tests Variable Method Variances DF t Value Pr > |t| y Pooled Equal 12 -3.83 0.0024 y Satterthwaite Unequal 11.5 -3.83 0.0026Notice that yabar=76.857, which is equal to b0=76.85714. This is expected given the expressions above. Also, ybbar-yabar = 82.714-76.857 = 5.857, which is equal to b1=5.85714.
Now, notice that the p-value for the estimate of the slope parameter (see the Parameter Estimates section of the proc reg output) is 0.0024. This p-value also corresponds to the proc ttest p-value which is 0.0024.
Hence, in both cases we reject our null hypothesis and so we may use either proc ttest or proc reg to solve these problems.
Multiple Regression:
Recall that we are interested in the situation where for each item in some population, we have k+1 numeric characteristics of interest. That is, for the ith item in the population, we observe the k+1-tuple:
We believe that each yi is linearly related to the corresponding xi,j's (j=1,...,k). That is, an expression that describes the relationship would have the following form:
So, if you think of each item in our population as a point in k+1 dimensional space then this expression defines a hyper-plane which describes how y changes as the xi,j's (j=1,...,k) change.
Now, remember also that the simple linear regression model expresses the relationship between a dependent variable and a single independent variable:
Multiple regression models on the other hand express the relationship between a dependent variable and either several independent variables or higher order terms of a single independent variable:
Multiple Regression - General Linear Model
We may express any multiple regression model by the following general linear model where any xi; 1=1, ,k term may be first order or higher order (remember that interaction terms are higher order terms). Also any xi; 1=1, ,k term may be either a quantitative or a qualitative term:
where for any
setting of the k-tuple
(x1,..,xk) the corresponding
e:
The parameters of the model are: e.g. Consider the following multiple regression equation:
se
Note:
Remember that bj; j=1,..,k is the expected change in y for a unit increase in xj; j=1,..,k when all other xi's (i¹j) are held constant.
This equation defines a regression surface. For fixed x2, say x2=20, the regression line
reduces to y=10+0.95x1.
Thus, the intercept in this case is 10 and
the change in y for unit change in x1 is 0.95.
We will be using SAS to derive multiple regression equations.
So, let us say that for some regression problem, we
believe that a multiple regression model is appropriate.
The following problem
illustrates how we may use SAS to solve such a problem.
Problem:
A software development manager believes that he can develop a good model to predict work effort from system size. She examines project data for several recently completed projects and computes effort in 'thousand man hours' as well as system size in 'function points' for each project.
Provide the SAS code necessary to analyze these data ensuring that the following models are assessed:
Compare the proc reg output for model 1 and model 1. Which model
do you think is better. Also,
comment on the slope parameters for model 1 and model 2. In particular,
comment on the following hypotheses for each model:
H0:
b0 + b1x + e
Ha: bj!=0
That is, for model 1, j=1 and for model 2, j=1,2.
Code
options ps=55 ls=76;
data fpts1;
infile 'albrecht.dat';
input lang $ F K;
sqF=F**2;
cubeF=F**3;
label lang='Language'
F='Function Points'
sqF='Square of Function Points'
cubeF='Cube of Function Points'
K='Thousands of Work Hours';
title 'System Development Projects';
proc sort;
by F;
proc reg;
model K=F / r;
model K=F sqF / r;
model K=F sqF cubeF / r;
output out=new3 p=predict r=resid;
proc print label;
var K predict resid;
proc univariate normal plot;
var resid;
Output
The following output has been edited to reflect the proc reg
output for models 1 and
2 only.
Discussion:
The R-square value for model 1 is 0.8743 which
indicates that 87.43% of the variability in effort is accounted
for by considering function points. The R-square value for model 2
is 0.9495 which
indicates that 94.95% of the variability in effort is accounted
for by considering function points and
the square of function points. Since more variability in effort
is explained by model 2
then model 2 seems to be better.
Considering the significance of the slope parameters is equivalent to
considering the
significance of the
corresponding independent variables to the model. From the proc reg
output the p-values for
the null hypotheses by model are.
Model 1:
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 16238.57470 16238.57470 152.952 0.0001
Error 22 2335.69030 106.16774
C Total 23 18574.26500
Root MSE 10.30377 R-square 0.8743
Dep Mean 21.87500 Adj R-sq 0.8685
C.V. 47.10296
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob>|T|
INTERCEP 1 -13.387901 3.54308797 -3.779 0.0010
F 1 0.054450 0.00440268 12.367 0.0001
Model 2:
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 2 17636.02059 8818.01029 197.367 0.0001
Error 21 938.24441 44.67831
C Total 23 18574.26500
Root MSE 6.68418 R-square 0.9495
Dep Mean 21.87500 Adj R-sq 0.9447
C.V. 30.55627
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob>|T|
INTERCEP 1 8.477886 4.53528150 1.869 0.0756
F 1 -0.013682 0.01251256 -1.093 0.2866
SQF 1 0.000034368 0.00000615 5.593 0.0001
b1
is
0.2866 and so we have insufficient evidence to
reject the null hypothesis.
Now, the the p-value for
the slope parameter
b2
is
0.0001, which is highly significant, and so we
reject the null hypothesis.
Hence, we conclude that when functions points
and the square of function points are considered as candidate explanatory
variables,
effort
is dependent on the square of function points but not function points.
Multiple Regression - Model Building
Multiple regression models usually need to be considered in most
practical situations but, in
particular, should be considered in the following cases:
Model building involves several steps: