Programming Assignment 3
CSC 323 Data Analysis and
Statistical Software
Due: Section 801 - 3/16/2004; Section 501 - 3/17/2004
A colleague has developed a new compression algorithm for
compressing documents and is interested in developing a model to
predict Processing Time from Document Size. She has
tested the algorithm on documents of varying sizes and asks
you for help in completing the analysis.
She presents you with the
data from her experiment and explains
that each observation
consists of the following values:
- Document Size (# of words); 1-5
- Processing Time (ms); 6-8
- Document ID; 9-14
- Readability Index; 15-16
- Write a SAS program to analyze this dataset. Your program
should accomplish the following: (50%)
Note: Code and output for the
income/gpa
problem discussed in class is
available.
- Read your data from an external file.
- Execute the PRINT procedure.
- Produce a scatterplot of the dependent variable
vs. the independent variable.
- Generate estimates of the slope and intercept using
PROC REG.
Note: Do not compute these values by
hand.
Note: For PROC PRINT, be sure to use labels for
column headings rather than variable names. Use names for
data sets and variables that are meaningful. You should
generate an appropriate title for the output of these
procedures.
- Write a report to summarize your findings.
Your report should address the following. Note that
an example
of the format required for your report is provided. (50%)
- State the regression model. Use appropriate symbols for
all of its parameters.
- Provide estimates of the model parameters and
state the regression equation. You must interpret the
coefficients of the regression equation.
- Interpret the correlation coefficient.
- Predict processing time for documents of size 3000 words.
- The term residual is used to refer to the difference between
the y value of a particular observation and the y
value, from the regression line, for the
corresponding x value of that observation. Given the data used for this
analysis, are the residuals normally distributed? Justify your
answer.
Hint: Use proc reg to
generate residuals for your data and
refer to the
normality hypotheses notes referenced in Project #2 to see how to address
normality.