Programming Assignment 3
CSC 323 - Data Analysis and Statistical Software
Due: Section 502 - 3/20/2002; Section 503 - 3/19/2002
You are a new hire at a local IT consulting firm and have
been assigned to the quality assurance team. You discover
that the team is investigating the relationship between software module
quality and a variety of metrics that may be obtained by examining
software modules. Your manager is particularly intrigued by two
of these metrics, complexity and coverage, and has
indicated that you will be helping with the data analysis
to investigate the relationship between quality and each of these
metrics. After some research on software metrics you discover the
following:
- The complexity of a module is essentially the number of
possible paths that could be taken in processing an input.
A network of nodes and edges,
may represent any software module,
written in any programming language.
If represented in this way,
a path is merely the sequence of edges that must be traversed in going
from the start of the program to the end of the program. Furthermore, the
number of possible paths (i.e. complexity) of a module may be determined thus:
complexity = # of edges - # of nodes + 1
- The
coverage
of a module is measured during the testing process.
There are several ways of measuring coverage but, in this case,
the metric of interest is simply the
proportion of decisions (i.e. if...then...else; do...while etc.)
in a module that are executed by the series of test cases used
during the testing process.
Your manager explains that a randomly selected sample of modules
was selected from the software portfolio
for evaluation. Your manager further explains that incident
reports were examined and a quality score
assigned to each module. Furthermore, the test logs
have been examined for each module and coverage determined. Several
additional metrics were also recorded for each module. Your manager
presents you with the
data collected for this experiment.
Each observation in the file consists of the following values:
- Nodes per LOC; 1-8
- Edges; 9-10
- Size (LOC); 11-13
- Quality; 14-16
- Coverage; 17-19
Use simple linear regression methods to conduct an
analysis of these data. Remember, that you
have been asked to study the relationship between quality and
two factors that are thought to affect quality. These factors are
module complexity and testing coverage. Coverage has been supplied but
complexity must be computed from the data provided.
Consider
quality
to be the response variable and
coverage and complexity to be candidate explanatory
variables.
- Your program should accomplish the following: (30%)
Note: Code
and output for the
income/gpa
problem discussed in class is
available.
- Read your data from an external file.
- Compute complexity.
- Execute the PRINT procedure.
- Produce a scatterplot of the response variable
vs. each of the candidate explanatory variables.
- Execute PROC CORR for the response variable and each of the candidate explanatory variables.
- For the best explanatory variable:
- Generate estimates of your slope and intercept using PROC REG.
- Execute PROC UNIVARIATE, with appropriate options,
to test normality of your residuals.
Note: For PROC PRINT, be sure to use labels for
column headings rather than variable names. Use names for
data sets and variables that are meaningful. You should
generate an appropriate title for the output of these
procedures.
- Your analysis should address the following.
An example
of the format required for your analysis is provided. Note that
you must
complete the section that discusses your reason for selecting the best
explanatory variable and you must include a section for
worked problems. (70%)
- Identify the best explanatory variable, that is, the explanatory variable that does the best job of explaining variability in the response variable.
- For the best explanatory variable:
- State the regression model. Use appropriate symbols for all of its parameters.
- Provide estimates of the model parameters and state the regression equation. You must interpret the
coefficients of the regression equation.
- Assess normality for the residuals.
You must state
the appropriate hypotheses to assess this assumption. If
normality is reasonable:
- Predict the quality score for:
- a complexity score of 20, if you determine that complexity
is the best
explanatory variable OR
- a coverage score of 90, if you determine that coverage
is the best explanatory
variable
- Determine the proportion of modules that
you would expect to obtain a quality
score greater than 65 given:
- a complexity score of 20, if you determine that complexity
is the best
explanatory variable OR
- a coverage score of 90, if you determine that coverage
is the best explanatory
variable