Assignment #1

CSC424/334 - Advanced Data Analysis

Due: 4/26/2004

Consider the matrix algebra representation of the general linear regression model discussed in class (i.e. see Lecture #2 and Lecture #3). Using this representation, develop a module that is able to access a dataset of n p-tuples and accomplish the following:
Note: You may assume that p=10 (i.e. one dependent variable and nine independent variables).

  1. Parameter Estimates & Related Statistics:
    1. Compute estimates of the coefficients of the regression equation.
    2. Compute an estimate of the standard deviation about the regression surface (i.e. Root MSE).
    3. Compute the F statistic for the regression model.
    4. Compute estimates of the standard error for each coefficient of the regression equation.
    5. Compute t statistics for each coefficient of the regression equation.
    6. Compute the coefficient of multiple determination.
    7. Compute the adjusted coefficient of multiple determination.
  2. Diagnostics:
    1. Obtain the residuals for each observation.
    2. Obtain the leverage values for each setting of the independent variables (i.e. for each observation).

    Your module should essentially be able to generate the extended version of the report produced by the SAS reg procedure (with the exception of p-values) when a full regression model is specified.

    Note that you are not restricted to using SAS for implementation of this module You have two broad options. Either use the IML procedure available in SAS, or use your choice of a 3rd generation programming language (i.e. you may use Java, C++ or any other 3rd generation programming language).
    Note: If you choose not to use IML then you must confirm with me, before beginning your implementation, that your choice of a 3rd generation language is appropriate.

    Refer to the example discussed in class when you are ready to test your module. The code discussed in class is available. Note that the code has embedded data.

    Submission:

    Submit hardcopy of your code, the test data, and the output produced. Do not submit your assignment by email.

    Extra Credit (+10%):

    Develop your module so that it can access an external dataset containing n p-tuples where p is an arbitrary value. You may assume that prior to execution your program may be edited in one (but only one) place. Also, assume that the dataset is an ascii file and that each observation is on a separate physical line (i.e. terminated by cr/lf). To further simplify, assume that the values for each variable are delimited by a space and the single response variable occupies the first position on each physical line.