Assignment #2

CSC424/334 - Advanced Data Analysis

Due: 5/17/2004

Review the Principal Components Analysis method discussed in class (i.e. see Lecture #4 and Lecture #6). Given our discussion of this method, develop an IML module that can access a dataset of your choice (e.g. the "European Jobs" dataset below) from an external file, and accomplish the following:

  1. Construct the covariance matrix.
  2. Determine the Spectral Decomposition of the covariance matrix.
  3. Derive the principal components matrix C.
  4. Show that the total variance of the matrix of observations (i.e. the dataset) is equal to the total variance of C.
  5. Confirm your results using the SAS princomp procedure.
    Hint: See this princomp code example and the corresponding output.

    European Jobs Dataset:

    This dataset (Courtesy of the DASL archive) contains employment details for different industries in several European countries. The Principal Components Analysis method may be used to analyze this data. For example, insight into countries with similar employment patterns may be obtained.

Submission:

Submit hardcopy of your code, the test data, and the output produced. Do not submit your assignment by email.

Extra Credit (+10%):

Discuss the principal components obtained from your analysis. Identify the top two principal components (i.e. the two that explain the most variability). What interpretation, if any, can you give to these principal components.
Note: Before considering an interpretation you may want to consider whether the covariance or correlation matrix should be used. If you decide that the correlation matrix should be used then redo the analysis using the correlation matrix. Also, you may want to review the NFL-2000 PCA paper mentioned on the Lecture Notes page.