Assignment #3

Programming Assignment 3

CSC 323 - Data Analysis and Statistical Software

Due: Section 502 - 3/20/2002; Section 503 - 3/19/2002

You are a new hire at a local IT consulting firm and have been assigned to the quality assurance team. You discover that the team is investigating the relationship between software module quality and a variety of metrics that may be obtained by examining software modules. Your manager is particularly intrigued by two of these metrics, complexity and coverage, and has indicated that you will be helping with the data analysis to investigate the relationship between quality and each of these metrics. After some research on software metrics you discover the following:

The complexity of a module is essentially the number of possible paths that could be taken in processing an input. A network of nodes and edges, may represent any software module, written in any programming language. If represented in this way, a path is merely the sequence of edges that must be traversed in going from the start of the program to the end of the program. Furthermore, the number of possible paths (i.e. complexity) of a module may be determined thus:
complexity = # of edges - # of nodes + 1
The coverage of a module is measured during the testing process. There are several ways of measuring coverage but, in this case, the metric of interest is simply the proportion of decisions (i.e. if...then...else; do...while etc.) in a module that are executed by the series of test cases used during the testing process.

Your manager explains that a randomly selected sample of modules was selected from the software portfolio for evaluation. Your manager further explains that incident reports were examined and a quality score assigned to each module. Furthermore, the test logs have been examined for each module and coverage determined. Several additional metrics were also recorded for each module. Your manager presents you with the data collected for this experiment. Each observation in the file consists of the following values:

Nodes per LOC; 1-8
Edges; 9-10
Size (LOC); 11-13
Quality; 14-16
Coverage; 17-19

Use simple linear regression methods to conduct an analysis of these data. Remember, that you have been asked to study the relationship between quality and two factors that are thought to affect quality. These factors are module complexity and testing coverage. Coverage has been supplied but complexity must be computed from the data provided.

Consider quality to be the response variable and coverage and complexity to be candidate explanatory variables.

Your program should accomplish the following: (30%)
Note: Code and output for the income/gpa problem discussed in class is available.
1. Read your data from an external file.
2. Compute complexity.
3. Execute the PRINT procedure.
4. Produce a scatterplot of the response variable vs. each of the candidate explanatory variables.
5. Execute PROC CORR for the response variable and each of the candidate explanatory variables.
6. For the best explanatory variable:
  1. Generate estimates of your slope and intercept using PROC REG.
  2. Execute PROC UNIVARIATE, with appropriate options, to test normality of your residuals.
  Note: For PROC PRINT, be sure to use labels for column headings rather than variable names. Use names for data sets and variables that are meaningful. You should generate an appropriate title for the output of these procedures.
Your analysis should address the following. An example of the format required for your analysis is provided. Note that you must complete the section that discusses your reason for selecting the best explanatory variable and you must include a section for worked problems. (70%)
1. Identify the best explanatory variable, that is, the explanatory variable that does the best job of explaining variability in the response variable.
2. For the best explanatory variable:
  1. State the regression model. Use appropriate symbols for all of its parameters.
  2. Provide estimates of the model parameters and state the regression equation. You must interpret the coefficients of the regression equation.
3. Assess normality for the residuals. You must state the appropriate hypotheses to assess this assumption. If normality is reasonable:
  1. Predict the quality score for:
    - a complexity score of 20, if you determine that complexity is the best explanatory variable OR
    - a coverage score of 90, if you determine that coverage is the best explanatory variable
  2. Determine the proportion of modules that you would expect to obtain a quality score greater than 65 given:
    - a complexity score of 20, if you determine that complexity is the best explanatory variable OR
    - a coverage score of 90, if you determine that coverage is the best explanatory variable