Exam Format:
Expect three questions, one SAS question, and two Procedures/Theory questions.
The exam will start at 6:00pm and you will have 90 minutes to complete all
questions.
SAS Review
For important sections of SAS text see the readings for SAS.
The following are particularly important:
DATA step:
- Naming convention
- SAS types
- Data step statements:
- The data statement
- The
set statement
- The
input statement
- Simple assignment statements; basic arithmetic operators
- The
if statement; comparison operators
- Comments
PROC step:
You should be aware of the basic syntax and output of:
- proc means
- proc univariate
- proc ttest
- proc anova
- proc npar1way
- proc plot
- proc reg
Remember the format of SAS procedures:
proc <procedure_name> [option]
;
[procedure_statement];
- The following options are important:
-
data= option for all procedures
-
normal option for proc univariate
-
plot option for proc univariate
-
wilcoxon option for proc npar1way
- You must be familiar with the following procedure statements and where appropriate:
- var procedure_statement
- by procedure_statement
- label procedure_statement
- model procedure_statement
- class procedure_statement
Statistics Review
Review chapters 1- 5 of the Knafl notes. In particular review the
algorithms presented on 1.2.35 - 1.2.39 and 1.4.72 - 1.4.77.
Remember that you may be presented with either hypothesis testing or
prediction problems and so you should be comfortable with the following material.
Hypothesis testing problems:
- One, two and ANOVA (i.e. limited to 3 sample):
- Hypotheses:
- Null hypothesis - H0.
- Alternative hypothesis - H
1; Ha; Hr
- P-value:
- Significance level:
- Highly significant: p-value < 0.01
- Significant: p-value # 0.05
- Non significant: p-value > 0.05
Two sample problem - Heuristic
If sub-sample sizes are different then apply the two independent sample approach.
If the same subjects are used for each treatment then apply the paired sample approach.
Apply the two independent sample approach in all other cases except the following:
- Subjects are matched by some characteristic.
i.e. Twins; siblings; skill level.
One & paired two sample problem - Procedure
Assess normality:
- test of normality and normal plot
If normality is reasonable or the sample size is large:
- conduct the t-test
- assess sensitivity
- conduct signed rank test
- conduct sign test
If normality is not reasonable:
- assess symmetry
- if symmetry is reasonable
- conduct signed rank test
- assess sensitivity
- conduct sign test
- if not
- conduct sign test
look at stem & leaf plot and box plot for patterns
Two sample problem (Independent) - Procedure
Check for outliers
- inspect box plot of residuals
If no outliers
- Assess normality
- Shapiro-Wilk test of normality and normal plot for residuals
- If normality is reasonable or sample size large:
- conduct the t-test (PROC TTEST)
- if equal variances reasonable
- check p-value for "equal"
- assess sensitivity to equal variance assumption
- if equal variances not reasonable
- check p-value for "unequal"
- assess sensitivity to the assumption of normality
- conduct rank-sum test (PROC NPAR1WAY)
- If normality not reasonable:
- conduct rank-sum test
- if appropriate conduct post-hoc
- inspect box plot & 'stem & leaf' plot for patterns
K independent sample problem (i.e. for 3 samples) - Procedure
Check for outliers and bias
- inspect box plot of residuals for outliers
- inspect residual plot for bias
If no outliers and no bias
- Assess normality
- Shapiro-Wilk test of normality and normal plot for residuals
- If normality is reasonable or sample size large:
- check equal variance assumption
- check residual plot
- if equal variances reasonable
- check ANOVA p-value
- if equal variances not reasonable
- STOP analysis
- assess sensitivity to the assumption of normality
- conduct rank-sum test (PROC NPAR1WAY)
- If normality not reasonable:
- conduct rank-sum test
- if appropriate conduct post-hoc
i.e. DUNCAN for ANOVA; plots for NPAR1WAY
- inspect box plot & 'stem & leaf' plot for patterns
Statistics Review
Prediction problems:
- Correlation
- Interpretation:
- A measure of the linear association between two variables (-1£r£1; -1£r£1).
- Used in computing estimates for simple linear regression parameters.
- "beta1"=r{SDy/SDx}
- "beta0" may be computed by recognizing that (ybar, xbar) is on the regression line.
- Regression:
- Simple Linear Regression
- PRESS
- Regression & ANOVA
Regression
The regression model is:
y=b0 + b1x1 + b2x2 +
+ bkxk + e
where e:
- have mean zero for each value of (x
1,..,xk)
have constant s
for each value of (x1,..,xk)
are normally distributed
for each value of (x1,..,xk)
and model parameters are:
b0 - intercept
bj - slope parameter where j=1,
k
s - standard deviation about the regression surface for fixed (x1,..,xk)
PRESS
- Smaller PRESS is better.
- PRESS is produced by the
model statement when the r option is specified.
PRESS is a special case of a technique known as cross-validation.
Regression & ANOVA
Regression is a more general form of ANOVA and therefore we can solve any ANOVA problem, including two sample problems, using regression.
However, the following must hold:
- We can only solve independent sample problems
- We must convert classification variables to numeric values.
Regression & ANOVA
Two sample problem - Interpretation of
b0 & b1.
Remember that the regression line is the "line of means" and so if the sub-populations means are indexed by c and t and x is set to 1 for sub-population t then:
º mc = b0
E(y|x=1) º mt = b0 + b1
We can rearrange the expression for mt and express b1 in terms of mt and mc:
b1 = mt - b0
b1 = mt - mc
Exercise:
For a two independent sample problem you are presented with the
proc means output below and asked to complete the proc reg output which follows.
Output - proc means
---------------- Treatment Groups=c ---------------------
N Mean Std Dev Minimum Maximum
----------------------------------------------------------
10 5.0600000 1.1890239 3.1000000 6.5000000
----------------------------------------------------------
----------------- Treatment Groups=t --------------------
N Mean Std Dev Minimum Maximum
----------------------------------------------------------
10 8.5600000 1.4713561 6.5000000 10.6000000
----------------------------------------------------------
Output - proc reg
Model: MODEL1
Dependent Variable: LEAD
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 1 61.25000 61.25000 34.231 0.0001
Error 18 32.20800 1.78933
C Total 19 93.45800
Root MSE 1.33766 R-square 0.6554
Dep Mean 6.81000 Adj R-sq 0.6362
C.V. 19.64258
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEP 1 ________ 0.42300512 11.962 0.0001
X 1 ________ 0.59821958 5.851 0.0001
Solution:
The "Parameter Estimate" section may be completed by referring to the section on the interpretation of b0 & b1. The INTERCEP estimate would therefore be the mean for group c and
the X estimate would be the postive difference between the two means.