Programming Assignment 3

CSC 323 Data Analysis and Statistical Software

Due: 6/11/2004

 

A colleague has developed a new compression algorithm for compressing documents and is interested in developing a model to predict Processing Time from Document Size. She has tested the algorithm on documents of varying sizes and asks you for help in completing the analysis.

She presents you with the data from her experiment and explains that each observation consists of the following values:

  1. Write a SAS program to analyze this dataset. Your program should accomplish the following: (50%)

  2. Note: Code and output for the income/gpa problem discussed in class is available.
    1. Read your data from an external file.
    2. Execute the PRINT procedure.
    3. Produce a scatterplot of the dependent variable vs. the independent variable.
    4. Generate estimates of the slope and intercept using PROC REG.
      Note: Do not compute these values by hand.

    Note: For PROC PRINT, be sure to use labels for column headings rather than variable names. Use names for data sets and variables that are meaningful. You should generate an appropriate title for the output of these procedures.

  3. Write a report to summarize your findings. Your report should address the following. Note that an example of the format required for your report is provided. (50%)
    1. State the regression model. Use appropriate symbols for all of its parameters.
    2. Provide estimates of the model parameters and state the regression equation. You must interpret the coefficients of the regression equation.
    3. Interpret the correlation coefficient.
    4. Predict processing time for documents of size 2200 words.
    5. The term residual is used to refer to the difference between the y value of a particular observation and the y value, from the regression line, for the corresponding x value of that observation. Given the data used for this analysis, are the residuals normally distributed? Justify your answer.
      Hint: Use proc reg to generate residuals for your data and refer to the normality hypotheses notes referenced in Project #2 to see how to address normality.