Tips

Exam Format:

Expect three questions, one SAS question, and two Procedures/Theory questions. The exam will start at 6:00pm and you will have 90 minutes to complete all questions.

SAS Review

For important sections of SAS text see the readings for SAS.

The following are particularly important:

DATA step:

Naming convention
SAS types
Data step statements:
1. The data statement
2. The set statement
3. The input statement
4. Simple assignment statements; basic arithmetic operators
5. The if statement; comparison operators
6. Comments

PROC step:

You should be aware of the basic syntax and output of:

proc means
proc univariate
proc ttest
proc anova
proc npar1way
proc plot
proc reg

Remember the format of SAS procedures:

proc <procedure_name> [option]…;
[procedure_statement];…

The following options are important:

data= option for all procedures
normal option for proc univariate
plot option for proc univariate
wilcoxon option for proc npar1way

You must be familiar with the following procedure statements and where appropriate:

var procedure_statement
by procedure_statement
label procedure_statement
model procedure_statement
class procedure_statement

Statistics Review

Review chapters 1- 5 of the Knafl notes. In particular review the algorithms presented on 1.2.35 - 1.2.39 and 1.4.72 - 1.4.77.

Remember that you may be presented with either hypothesis testing or prediction problems and so you should be comfortable with the following material.

Hypothesis testing problems:

One, two and ANOVA (i.e. limited to 3 sample):

Hypotheses:

Null hypothesis - H0.

Alternative hypothesis - H1; Ha; Hr
P-value:
1. Significance level:
  
  Highly significant: p-value < 0.01
  
  Significant: p-value # 0.05
  
  Non significant: p-value > 0.05

Two sample problem - Heuristic

If sub-sample sizes are different then apply the two independent sample approach.
If the same subjects are used for each treatment then apply the paired sample approach.
Apply the two independent sample approach in all other cases except the following:
1. Subjects are matched by some characteristic.
  i.e. Twins; siblings; skill level.

One & paired two sample problem - Procedure

Assess normality:
1. test of normality and normal plot
If normality is reasonable or the sample size is large:
1. conduct the t-test
2. assess sensitivity
  1. conduct signed rank test
  2. conduct sign test
If normality is not reasonable:
1. assess symmetry
2. if symmetry is reasonable
  1. conduct signed rank test
  2. assess sensitivity
    1. conduct sign test
3. if not
  1. conduct sign test
look at stem & leaf plot and box plot for patterns

Two sample problem (Independent) - Procedure

Check for outliers
1. inspect box plot of residuals
If no outliers
1. Assess normality
  1. Shapiro-Wilk test of normality and normal plot for residuals
2. If normality is reasonable or sample size large:
  1. conduct the t-test (PROC TTEST)
    1. if equal variances reasonable
      1. check p-value for "equal"
      2. assess sensitivity to equal variance assumption
    2. if equal variances not reasonable
      1. check p-value for "unequal"
    3. assess sensitivity to the assumption of normality
      1. conduct rank-sum test (PROC NPAR1WAY)
3. If normality not reasonable:
  1. conduct rank-sum test
4. if appropriate conduct post-hoc
5. inspect box plot & 'stem & leaf' plot for patterns

K independent sample problem (i.e. for 3 samples) - Procedure

Check for outliers and bias
1. inspect box plot of residuals for outliers
2. inspect residual plot for bias
If no outliers and no bias
1. Assess normality
  1. Shapiro-Wilk test of normality and normal plot for residuals
2. If normality is reasonable or sample size large:
  1. check equal variance assumption
    1. check residual plot
  2. if equal variances reasonable
    1. check ANOVA p-value
  3. if equal variances not reasonable
    1. STOP analysis
  4. assess sensitivity to the assumption of normality
    1. conduct rank-sum test (PROC NPAR1WAY)
3. If normality not reasonable:
  1. conduct rank-sum test
4. if appropriate conduct post-hoc
  i.e. DUNCAN for ANOVA; plots for NPAR1WAY
5. inspect box plot & 'stem & leaf' plot for patterns

Statistics Review

Prediction problems:

Correlation

Interpretation:
A measure of the linear association between two variables (-1£r£1; -1£r£1).

Used in computing estimates for simple linear regression parameters.

"beta1"=r{SDy/SDx}

"beta0" may be computed by recognizing that (ybar, xbar) is on the regression line.
Regression:
1. Simple Linear Regression
2. PRESS
3. Regression & ANOVA

Regression

The regression model is:
y=b0 + b1x1 + b2x2 + … + bkxk + e
where e:

have mean zero for each value of (x1,..,xk)
have constant s for each value of (x1,..,xk)
are normally distributed for each value of (x1,..,xk)

and model parameters are:

b0 - intercept
bj - slope parameter where j=1,…k
s - standard deviation about the regression surface for fixed (x1,..,xk)

PRESS

Smaller PRESS is better.

PRESS is produced by the model statement when the r option is specified.
PRESS is a special case of a technique known as cross-validation.

Regression & ANOVA

Regression is a more general form of ANOVA and therefore we can solve any ANOVA problem, including two sample problems, using regression. However, the following must hold:

We can only solve independent sample problems
We must convert classification variables to numeric values.

Regression & ANOVA

Two sample problem - Interpretation of b0 & b1. Remember that the regression line is the "line of means" and so if the sub-populations means are indexed by c and t and x is set to 1 for sub-population t then:

E(y|x=0) º mc = b0
E(y|x=1) º mt = b0 + b1

We can rearrange the expression for mt and express b1 in terms of mt and mc:
b1 = mt - b0
b1 = mt - mc

Exercise:

For a two independent sample problem you are presented with the proc means output below and asked to complete the proc reg output which follows.

Output - proc means

---------------- Treatment Groups=c ---------------------

N          Mean       Std Dev       Minimum       Maximum
----------------------------------------------------------
10    5.0600000     1.1890239     3.1000000     6.5000000
----------------------------------------------------------


----------------- Treatment Groups=t --------------------

N          Mean       Std Dev       Minimum       Maximum
----------------------------------------------------------
10    8.5600000     1.4713561     6.5000000    10.6000000
----------------------------------------------------------

Output - proc reg

Model: MODEL1
Dependent Variable: LEAD
                      Analysis of Variance
                    Sum of     Mean
Source     DF     Squares     Square   F Value    Prob>F
Model      1     61.25000   61.25000    34.231    0.0001
Error     18     32.20800    1.78933
C Total   19     93.45800

   Root MSE       1.33766     R-square       0.6554
   Dep Mean       6.81000     Adj R-sq       0.6362
   C.V.          19.64258

                       Parameter Estimates
              Parameter    Standard  T for H0:
Variable  DF   Estimate       Error  Parameter=0  Prob > |T|
INTERCEP  1    ________  0.42300512    11.962      0.0001
X         1    ________  0.59821958     5.851      0.0001

Solution:

The "Parameter Estimate" section may be completed by referring to the section on the interpretation of b0 & b1. The INTERCEP estimate would therefore be the mean for group c and the X estimate would be the postive difference between the two means.