CSC423/324 Data Analysis

Regression contd.

Readings:

Ott; 12.2 - 12.4, 12.6
Ott; 13.4
Ott; 8.2

SLR and the 2-Independent Sample Problem:

Consider the two independent sample problem discussed week 6. If y₁ and y₂ are normally distributed and s_y1=s_y2 then, if either sample is small, Theorem 2a applies and, if the samples are large, Theorem 1 applies but it is a special case.

Given these conditions, the two independent sample problem is a special case of the SLR problem. To see this, consider the two independent sample problem as a problem that involves a set of pairs (y, x) where x=0 indicates that the corresponding y is from population 1 and x=1 indicates that the corresponding y is from population 2. SLR is clearly appropriate in this case since the relationship between x and y is linear and furthermore, given our assumptions, all of the SLR model axioms hold.

We may deduce the following:

Let x=0: m_y1= b₀ + b₁(0) = b₀
Let x=1: m_y2= b₀ + b₁(1) = b₀ + b₁
Hence:
b₁ = m_y2 - m_y1

Therefore, inferences about b₀ are equivalent to inferences about m_y1 and, more importantly, inferences about b₁ are equivalent to inferences about m_y2 - m_y1.

We may illustrate by considering the following hypotheses:

H₀: m_a-m_b=0
H_a: m_a-m_b!=0

Now let us say the following program has been written so that we can conduct a test of hypotheses:

Code:

options ps=55 ls=76;
data a;
input group $ y @@;
x=1;
if group='a' then x=0;
datalines;
a 75 a 76 a 80 a 77 a 80 a 77 a 73
b 82 b 80 b 85 b 85 b 78 b 87 b 82
proc univariate normal;
 var y;
 by group;
proc ttest;
 class group;
 var y;
proc reg;
 model y=x;
 output out=new1 r=resid;
proc univariate normal;
 var resid;  
run;

Notice that we have two small samples, each of size 7. Observe that the y values for group a are also identified by x=0 and the y values for group b by x=1. If you examine the output produced by this SAS code, the relevant section from proc reg is:

                         Parameter Estimates

                       Parameter       Standard
 Variable     DF       Estimate          Error    t Value    Pr > |t|

 Intercept     1       76.85714        1.08170      71.05      <.0001
 x             1        5.85714        1.52975       3.83      0.0024

Notice that b₀=76.85714 and b₁=5.85714. Now, the Statistics and T-Tests sections of the proc ttest output are the sections needed for comparison:

                               Statistics
 
                           Lower CL          Upper CL  Lower CL
 Variable  group       N     Mean    Mean      Mean   Std Dev  Std Dev

 y         a           7    74.504  76.857    79.211    1.6399   2.5448
 y         b           7    79.804  82.714    85.625     2.028   3.1472
 y         Diff (1-2)        -9.19  -5.857    -2.524    2.0522   2.8619

                                  T-Tests

 Variable    Method           Variances      DF    t Value    Pr > |t|

 y           Pooled           Equal          12      -3.83      0.0024
 y           Satterthwaite    Unequal      11.5      -3.83      0.0026

Notice that y_abar=76.857, which is equal to b₀=76.85714. This is expected given the expressions above. Also, y_bbar-y_abar = 82.714-76.857 = 5.857, which is equal to b₁=5.85714.

Now, notice that the p-value for the estimate of the slope parameter (see the Parameter Estimates section of the proc reg output) is 0.0024. This p-value also corresponds to the proc ttest p-value which is 0.0024.

Hence, in both cases we reject our null hypothesis and so we may use either proc ttest or proc reg to solve these problems.

Multiple Regression:

Recall that we are interested in the situation where for each item in some population, we have k+1 numeric characteristics of interest. That is, for the i^th item in the population, we observe the k+1-tuple:

(y_i, x_i,1,..., x_i,k)

We believe that each y_i is linearly related to the corresponding x_i,j's (j=1,...,k). That is, an expression that describes the relationship would have the following form:

y = b₀ + b₁x₁ + ,..., + b_p-1x_k

So, if you think of each item in our population as a point in k+1 dimensional space then this expression defines a hyper-plane which describes how y changes as the x_i,j's (j=1,...,k) change.

Now, remember also that the simple linear regression model expresses the relationship between a dependent variable and a single independent variable:

y=b0 + b1x + e

Multiple regression models on the other hand express the relationship between a dependent variable and either several independent variables or higher order terms of a single independent variable:

y=b0 + b1x1 + b2x2 + b3x3 + e
y=b0 + b1x + b2x2 + b3x3 + e
y=b0 + b1x1 + b2 x2 + b3x1x2 + b4 x2 + e

Multiple Regression - General Linear Model

We may express any multiple regression model by the following general linear model where any xi; 1=1,…,k term may be first order or higher order (remember that interaction terms are higher order terms). Also any xi; 1=1,…,k term may be either a quantitative or a qualitative term:

y=b0 + b1x1 + b2x2 + … + bkxk + e

where for any setting of the k-tuple (x1,..,xk) the corresponding e:

are normally distributed.

have mean zero.

have constant s_e

The parameters of the model are:

b0 - intercept
bj - slope parameter where j=1,…k
s_e - standard deviation about the regression surface for fixed (x1,..,xk)

Note: Remember that bj; j=1,..,k is the expected change in y for a unit increase in xj; j=1,..,k when all other xi's (i¹j) are held constant.

e.g. Consider the following multiple regression equation:

y=20+0.95x1-0.5x2

This equation defines a regression surface. For fixed x2, say x2=20, the regression line reduces to y=10+0.95x1. Thus, the intercept in this case is 10 and the change in y for unit change in x₁ is 0.95.

We will be using SAS to derive multiple regression equations. So, let us say that for some regression problem, we believe that a multiple regression model is appropriate. The following problem illustrates how we may use SAS to solve such a problem.

Problem:

A software development manager believes that he can develop a good model to predict work effort from system size. She examines project data for several recently completed projects and computes effort in 'thousand man hours' as well as system size in 'function points' for each project.

Provide the SAS code necessary to analyze these data ensuring that the following models are assessed:

y=b0 + b1x + e
y=b0 + b1x + b2x2 + e
y=b0 + b1x + b2x2 + b3x3 + e

Compare the proc reg output for model 1 and model 1. Which model do you think is better. Also, comment on the slope parameters for model 1 and model 2. In particular, comment on the following hypotheses for each model:

H₀: b_j=0
H_a: b_j!=0

That is, for model 1, j=1 and for model 2, j=1,2.

Code

options ps=55 ls=76;
data fpts1;
  infile 'albrecht.dat';
  input lang $ F K;
  sqF=F**2;
  cubeF=F**3;
  label lang='Language'
        F='Function Points'
        sqF='Square of Function Points'
        cubeF='Cube of Function Points'
        K='Thousands of Work Hours';
title 'System Development Projects';
proc sort;
  by F;
proc reg;
  model K=F / r;
  model K=F sqF / r;
  model K=F sqF cubeF / r;
    output out=new3 p=predict r=resid;
proc print label;
  var K predict resid;
proc univariate normal plot;
  var resid;

Output

The following output has been edited to reflect the proc reg output for models 1 and 2 only.

Model 1:
                     Analysis of Variance
                     Sum of         Mean
Source     DF      Squares       Square   F Value    Prob>F
Model       1  16238.57470  16238.57470   152.952    0.0001
Error      22   2335.69030    106.16774
C Total    23  18574.26500

       Root MSE      10.30377     R-square       0.8743
       Dep Mean      21.87500     Adj R-sq       0.8685
       C.V.          47.10296

                       Parameter Estimates
               Parameter    Standard    T for H0:      
Variable  DF    Estimate      Error   Parameter=0  Prob>|T|

INTERCEP   1  -13.387901   3.54308797   -3.779       0.0010
F          1    0.054450   0.00440268   12.367       0.0001


Model 2:
                     Analysis of Variance
                     Sum of         Mean
Source     DF      Squares       Square   F Value     Prob>F
Model       2  17636.02059   8818.01029   197.367     0.0001
Error      21    938.24441     44.67831
C Total    23  18574.26500

       Root MSE       6.68418     R-square       0.9495
       Dep Mean      21.87500     Adj R-sq       0.9447
       C.V.          30.55627

                      Parameter Estimates
                Parameter    Standard   T for H0:       
Variable  DF     Estimate       Error  Parameter=0  Prob>|T|

INTERCEP   1     8.477886   4.53528150     1.869      0.0756
F          1    -0.013682   0.01251256    -1.093      0.2866
SQF        1  0.000034368   0.00000615     5.593      0.0001

Discussion:

The R-square value for model 1 is 0.8743 which indicates that 87.43% of the variability in effort is accounted for by considering function points. The R-square value for model 2 is 0.9495 which indicates that 94.95% of the variability in effort is accounted for by considering function points and the square of function points. Since more variability in effort is explained by model 2 then model 2 seems to be better.

Considering the significance of the slope parameters is equivalent to considering the significance of the corresponding independent variables to the model. From the proc reg output the p-values for the null hypotheses by model are.

Model 1: Since the p-value for the slope parameter is 0.0001, it is highly significant, and so we reject the null hypothesis and conclude that the slope parameter for this model is distinctly non zero and so we have strong evidence to believe that effort is dependent on function points.

Model 2: The the p-value for the slope parameter b₁ is 0.2866 and so we have insufficient evidence to reject the null hypothesis. Now, the the p-value for the slope parameter b₂ is 0.0001, which is highly significant, and so we reject the null hypothesis. Hence, we conclude that when functions points and the square of function points are considered as candidate explanatory variables, effort is dependent on the square of function points but not function points.

Multiple Regression - Model Building

Multiple regression models usually need to be considered in most practical situations but, in particular, should be considered in the following cases:

If a simple linear regression model explains less than 80% of the variability in the data and there are other candidate variables.
If strong prior evidence of dependence on multiple independent variables exist.

However, we should always seek a balance between parsimony and accuracy. Compact yet expressive models with interpretable parameters are usually preferred. In any event, the challenge is in determining the appropriate model for the problem. This is the model building problem.

Model building involves several steps:

Construct a list of candidate independent variables.

Qualitative and quantitative variables may be included.
If prior knowledge of suitable candidates does not exist then start with all available (for this class assume that all provided are suitable).

Generate all possible first order simple linear regression models. If n candidates build n models.

Select the best simple linear regression model.
Note: For now we will consider best to be highest R2. If more than 80% of the variation is explained consider stopping and continuing at step 4.
Generate all possible first order k variable multiple regression models, starting with k=2, by adding to the variable already in the model all remaining variables.
1. Increase k by 1 and repeat step 3 until there are no more variables or until a less than 5% increase in R2 is observed. If there are no more variables to be added continue at step 4.
Check model assumptions by examining residual plots.
If model assumptions hold model building is done. Complete the exercise by interpreting parameters.
If model assumptions do not hold consider remedial measures:
1. Transform independent variables. If there is bias, consider higher order terms.
2. Transform dependent variable. Only transform the independent variable if there is non-constant standard deviation. We will consider log transforms only.