CSC423/324 - Data Analysis

Introduction/Review I

Following is a review of material that you should have covered in a first class in data analysis or statistics.

Readings

Ott, Chapter 1, section 1.1
Ott, Chapter 3, section 3.3 - 3.5
Ott, Chapter 4, section 4.10 - 4.12

Concepts & Terminology

Measures of Central Tendency:

Given a set of numeric values, the following are two of these measures.

Mean (Average):
The sum of all values divided by the total number of values.
Median:
The middle value when values are arranged from lowest to highest.

Measures of Variability:

Given a set of numeric values, the following are three of these measures.

Range:
The difference between the largest and the smallest values.
Variance:
The sum of squared deviations from the mean divided by n-1.
Standard Deviation:
The positive square root of the variance.

Population:

The set of all things of interest to the data analyst.

e.g.

The set of Java programs in a portfolio.
The set of database transactions processed by a DBMS in a particular time period.
The set of CTI-2002 graduates.

Since data analysis is a quantitative discipline, we are interested in some characteristic of these things that may be expressed numerically. That is, for the set of Java programs we may be interested in size or quality, for the set of database transactions we may be interested in wait time, for the set of CTI-2002 graduates we may be interested in salary or gpa.

Parameter:

Any numeric quantity computed from a population. These quantities are denoted by greek symbols.

e.g. Mean - m ; Variance - s ²; Correlation - r etc.

Sample:

Any subset selected from a population.
Note: If the population contains N items and the sample n items then 1 < n < N.

Statistic:

Any numeric quantity computed from a sample. These quantities are denoted by english symbols.

e.g. Mean - ybar; Variance - s²; Correlation - r; etc.

For valid statistical inferences, the sample should be selected by some selection scheme which ensures representation (i.e. preserves unbiasdness). Such schemes require some type of random selection procedure.

Distribution:

Consider a population where the measurement of interest is some numeric quantity. We refer to the dispersion of this measurement over the range of measurements as the distribution of the population.

We often refer to the shape of a distribution. You may think of the shape of a distribution as the smoothed surface defined by the top of the columns of a histogram constructed from the values of the measurement of interest. Many populations in the real world have a mound shaped distribution. We are particularly interested in two such distributions:

The Normal distribution.
The Student t distribution.

Note: The readings above cover the Normal distribution but not the Student t distribution. Also, you will notice that the presentation in the text is in terms of probabilities, not proportions as presented below. For our purposes, think of these as equivalent notions.

Normal Distribution

Consider a population where the numeric measurement of interest has mean m and standard deviation s. If the values of this measurement are normally distributed then the following properties apply:

These values are symmetrically distributed about the mean.

A histogram of these values will be bell shaped.

These values satisfy the empirical rule:

68.27% of the values will be within one standard deviation of the mean.

95.45% of the values will be within two standard deviation of the mean.

99.73% of the values will be within three standard deviation of the mean.

Standard Normal Distribution :
Consider a population where the numeric measurement of interest is normally distributed with m=0 and s=1. Such a normal distribution is referred to as a Standard Normal Distribution. Tables (see Ott, page 1091) are available for the Standard Normal Distribution which allow us to determine the proportion of values between any two values to a reasonable degree of accuracy.

Problem:

You are told that the measurement of interest for some population has a standard normal distribution. Find the proportion of observations that satisfy the following:

greater than 1.25

less than -0.4

between 0.4 and 1.3

between -1.5 and 1.5

Solutions:

Remember to draw pictures:

greater than 1.25

From the standard normal table (page 1091) we know that 89.94% of observations are less than 1.25. Hence, the desired proportion is 100 - 89.44 = 10.56%

less than -0.4

From the standard normal table 34.46%

between 0.4 and 1.3

From the standard normal table 90.32 - 65.54 = 24.78%

between -1.5 and 1.5

From the standard normal table 93.32 - 6.68 = 86.64%.

Transformation Rule

Given a set of measurements, known to be normally distributed, but for which m… 0 or s… 1 the proportion of measurements between two arbitrary values (say x ₁ and x ₂) may be determined if the x values are first transformed to z values and then the Standard Normal table used. The steps are:

Convert x ₁ and x ₂ to z ₁ and z ₂ using:
z = (x - m)/ s
Use the Standard Normal table to determine the proportion between z ₁ and z ₂.

Problem:

For some population, the measurement of interest is known to be normally distributed with mean 50 and standard deviation 10. Determine the percentage of observations that are between 35 and 70?
Solution:

First, use the transformation rule to standardize 35 and 70:

z₁ = (35 - 50)/10 = -1.5

z₂ = 70 - 50)/10 = 2.

You now need to determine the area under the standard normal curve between -1.5 and 2.0. That is, 97.72 – 6.68 = 91.04%
Student t Distribution :
Consider a population where the numeric measurement of interest has mean m and standard deviation s. If these measurements are Student t distributed then the following properties apply:
Properties:
1. These measurements are symmetrically distributed about the mean.
2. A histogram of these measurements will be bell shaped.
Notes:
1. The empirical rule does not apply. The Student t with mean m and standard deviation s has fatter tails than the corresponding Normal distribution with the same mean and standard deviation.
2. For mean m=0 and standard deviation s=1 it is referred to as the Standard Student t distribution.
3. For m… 0 or s… 1 we may use the transformation rule:
Problem:

You are told that the measurement of interest for some population has a Student t distribution. Solve the following:
1. Let the degrees of freedom be 25. If the mean is zero and the standard deviation is one then determine the proportion greater than 2.
2. Let the degrees of freedom be 12. If the mean is fifty and the standard deviation is ten then determine the proportion less than 85.
Solutions:

Remember to draw pictures.
Note: Remember that you may obtain the answer by scanning the appropriate df row for the desired t value. Note that the column headings are expressed as decimals and represent the proportion greater than t.
1. Since the mean is zero and the standard deviation is one then we may directly use the standard Student t table. Scanning the df=25 row, 2 is between 1.708 and 2.060 and so the desired proportion is between 2.5% and 5%.
2. In this case, since the mean is not zero and the standard deviation is not one then you must first use the transformation rule.
  t = (85 - 50)/10 = 3.5
  So, we may solve by determining the proportion less than 3.5. Scanning the df=12 row, 3.5 is between 3.055 and 3.930 and so the proportion greater than 3.5 is between 0.1% and 0.5%. However, we want the proportion less than 3.5 and so the desired proportion is between 99.5% and 99.9%.