Exam Format:

Expect three questions, one SAS question, and two Procedures/Theory questions. The exam will start at 6:00pm and you will have 90 minutes to complete all questions.

SAS Review

For important sections of SAS text see the readings for SAS.

The following are particularly important:

DATA step:

  1. Naming convention
  2. SAS types
  3. Data step statements:
    1. The data statement
    2. The set statement
    3. The input statement
    4. Simple assignment statements; basic arithmetic operators
    5. The if statement; comparison operators
    6. Comments

PROC step:

You should be aware of the basic syntax and output of:

  1. proc means
  2. proc univariate
  3. proc ttest
  4. proc anova
  5. proc npar1way
  6. proc plot
  7. proc reg

Remember the format of SAS procedures:

proc <procedure_name> [option]…;
[procedure_statement];…

  1. The following options are important:
    1. data= option for all procedures
    2. normal option for proc univariate
    3. plot option for proc univariate
    4. wilcoxon option for proc npar1way
  2. You must be familiar with the following procedure statements and where appropriate:
    1. var procedure_statement
    2. by procedure_statement
    3. label procedure_statement
    4. model procedure_statement
    5. class procedure_statement


Statistics Review

Review chapters 1- 5 of the Knafl notes. In particular review the algorithms presented on 1.2.35 - 1.2.39 and 1.4.72 - 1.4.77.

Remember that you may be presented with either hypothesis testing or prediction problems and so you should be comfortable with the following material.



Hypothesis testing problems:

  1. One, two and ANOVA (i.e. limited to 3 sample):
  2. Hypotheses:
    1. Null hypothesis - H0.
    2. Alternative hypothesis - H1; Ha; Hr
  3. P-value:
    1. Significance level:
      1. Highly significant: p-value < 0.01
      2. Significant: p-value # 0.05
      3. Non significant: p-value > 0.05


Two sample problem - Heuristic
  1. If sub-sample sizes are different then apply the two independent sample approach.
  2. If the same subjects are used for each treatment then apply the paired sample approach.
  3. Apply the two independent sample approach in all other cases except the following:
    1. Subjects are matched by some characteristic.
      i.e. Twins; siblings; skill level.


One & paired two sample problem - Procedure
  1. Assess normality:
    1. test of normality and normal plot
  2. If normality is reasonable or the sample size is large:
    1. conduct the t-test
    2. assess sensitivity
      1. conduct signed rank test
      2. conduct sign test
  3. If normality is not reasonable:
    1. assess symmetry
    2. if symmetry is reasonable
      1. conduct signed rank test
      2. assess sensitivity
        1. conduct sign test
    3. if not
      1. conduct sign test
  4. look at stem & leaf plot and box plot for patterns


Two sample problem (Independent) - Procedure
  1. Check for outliers
    1. inspect box plot of residuals
  2. If no outliers
    1. Assess normality
      1. Shapiro-Wilk test of normality and normal plot for residuals
    2. If normality is reasonable or sample size large:
      1. conduct the t-test (PROC TTEST)
        1. if equal variances reasonable
          1. check p-value for "equal"
          2. assess sensitivity to equal variance assumption
        2. if equal variances not reasonable
          1. check p-value for "unequal"
        3. assess sensitivity to the assumption of normality
          1. conduct rank-sum test (PROC NPAR1WAY)
    3. If normality not reasonable:
      1. conduct rank-sum test
    4. if appropriate conduct post-hoc
    5. inspect box plot & 'stem & leaf' plot for patterns


K independent sample problem (i.e. for 3 samples) - Procedure
  1. Check for outliers and bias
    1. inspect box plot of residuals for outliers
    2. inspect residual plot for bias
  2. If no outliers and no bias
    1. Assess normality
      1. Shapiro-Wilk test of normality and normal plot for residuals
    2. If normality is reasonable or sample size large:
      1. check equal variance assumption
        1. check residual plot
      2. if equal variances reasonable
        1. check ANOVA p-value
      3. if equal variances not reasonable
        1. STOP analysis
      4. assess sensitivity to the assumption of normality
        1. conduct rank-sum test (PROC NPAR1WAY)
    3. If normality not reasonable:
      1. conduct rank-sum test
    4. if appropriate conduct post-hoc
      i.e. DUNCAN for ANOVA; plots for NPAR1WAY
    5. inspect box plot & 'stem & leaf' plot for patterns


Statistics Review

Prediction problems:

  1. Correlation
    1. Interpretation:
      1. A measure of the linear association between two variables (-1£r£1; -1£r£1).
    2. Used in computing estimates for simple linear regression parameters.
      1. "beta1"=r{SDy/SDx}
      2. "beta0" may be computed by recognizing that (ybar, xbar) is on the regression line.
  2. Regression:
    1. Simple Linear Regression
    2. PRESS
    3. Regression & ANOVA



Regression

The regression model is:
y=
b0 + b1x1 + b2x2 + … + bkxk + e
where
e:

  1. have mean zero for each value of (x1,..,xk)
  2. have constant s for each value of (x1,..,xk)
  3. are normally distributed for each value of (x1,..,xk)

and model parameters are:
  1. b0 - intercept
  2. bj - slope parameter where j=1,…k
  3. s - standard deviation about the regression surface for fixed (x1,..,xk)



PRESS
  1. Smaller PRESS is better.
  2. PRESS is produced by the model statement when the r option is specified.
  3. PRESS is a special case of a technique known as cross-validation.


Regression & ANOVA

Regression is a more general form of ANOVA and therefore we can solve any ANOVA problem, including two sample problems, using regression. However, the following must hold:

  • We can only solve independent sample problems
  • We must convert classification variables to numeric values.


Regression & ANOVA

Two sample problem - Interpretation of b0 & b1. Remember that the regression line is the "line of means" and so if the sub-populations means are indexed by c and t and x is set to 1 for sub-population t then:

We can rearrange the expression for mt and express b1 in terms of mt and mc:
b1 = mt - b0
b1 = mt - mc

Exercise:

For a two independent sample problem you are presented with the
proc means output below and asked to complete the proc reg output which follows.


Output - proc means

---------------- Treatment Groups=c ---------------------

N          Mean       Std Dev       Minimum       Maximum
----------------------------------------------------------
10    5.0600000     1.1890239     3.1000000     6.5000000
----------------------------------------------------------


----------------- Treatment Groups=t --------------------

N          Mean       Std Dev       Minimum       Maximum
----------------------------------------------------------
10    8.5600000     1.4713561     6.5000000    10.6000000
----------------------------------------------------------




Output - proc reg
Model: MODEL1
Dependent Variable: LEAD
                      Analysis of Variance
                    Sum of     Mean
Source     DF     Squares     Square   F Value    Prob>F
Model      1     61.25000   61.25000    34.231    0.0001
Error     18     32.20800    1.78933
C Total   19     93.45800

   Root MSE       1.33766     R-square       0.6554
   Dep Mean       6.81000     Adj R-sq       0.6362
   C.V.          19.64258

                       Parameter Estimates
              Parameter    Standard  T for H0:
Variable  DF   Estimate       Error  Parameter=0  Prob > |T|
INTERCEP  1    ________  0.42300512    11.962      0.0001
X         1    ________  0.59821958     5.851      0.0001





Solution:

The "Parameter Estimate" section may be completed by referring to the section on the interpretation of
b0 & b1. The INTERCEP estimate would therefore be the mean for group c and the X estimate would be the postive difference between the two means.