Practice Midterm

Practice Midterm -- Summer 2009

Multiple Choice Questions

For each question, show your work or give a reason explaining your answer. 4 points for the reason, 1 point for the correct answer.

How many statistics are needed to parsimoniously describe a univariate normal dataset.
a. 1 b. 2 b. 4 b. 5
Ans: b. 2. The sample mean and SD completely describe a univariate normal dataset.
Which of these variables is an example of a nominal variable?
a. class rank b. height c. income d. occupation
Ans: d. occupation. Nominal variables are categorical or nonnumeric variables.
Find the IQR of this dataset. Use the Tukey's hinges method for computing Q1 and Q3.
a. 0.009 b. 0.015 c. 0.032 d. 0.049
Ans: Q1 is the median of the bottom half of the dataset (bottom six numbers) is the mean of 0.028 and 0.030, which is 0.029. Q3 is the median of the top half (top six numbers), is the mean of 0.039 and 0.049, which if 0.044. IQR = 0.044 - 0.039 = 0.015.
Which of the following univariate datasets is skewed to the left?

Ans: a. This box plot has a long whisker to the left. b is symmetric, and c and d are skewed to the right.
Horse pregnancies are normally distributed with a mean gestation period of 336 days with an SD of 3 days. What percentage of horse pregnancies last longer than 340 days.
a. 9.2 b. 42.8 d. 90.8 d. 92.0
Ans: a. z = (340 - 336) / 3 = 1.33. Looking up the bin (-∞ 1.33] in the normal table gives an area of 0.9082. The area under the bin [1.33, ∞) is 1 - 0.9082 = 0.0918 = 9.2%.
What IQ do you need to be in the 80th percentile?
a. 80.0 b. 106.2 d. 112.6 d. 122.8
Ans: c. Look up 0.8000 in the body of the normal table. The closest z-score is 0.84. Since IQ scores are scaled to have a mean of 100 and SD = 15, we have 0.84 = (x - 100) / 15. Solving for x gives 112.6.
Which of the following is not a good idea with a regression equation?
a. extrapolation b. interpolation c. prediction d. validation
Ans: a. Extrapolation (going past the range of the data for predictions) is never a good idea because the response curve might be nonlinear in an unexpected way.
If the correlation between x and y is 0.85, then what percentage of variation in y can be explained by the variation in x?
a. 15% b. 55% d. 72% d. 85%
Ans: d. The r-squared value is the percentage of variation in y that can be explained by x. R² = 0.85² = 0.7225.
Compute the correlation of x and y. They are already standardized (have mean=0 and SD=1).
Your answer will be different than the correlation computed by SPSS because SPSS uses SD⁺ instead of SD.
a. -0.55 b. -0.35 d. 0.35 d. 0.45
Ans: Compute the average of the products:
Which of these statements is false about Carl Friedrich Gauss?
1. He discovered a summation formula when he was five years old.
2. He discovered the Central Limit Theorem.
3. He so alienated his sons that two of them moved to America from Germany.
4. He was the first to publish the least squared method for obtaining a regression line.
Ans: b is false. DeMoivre first stated the Central Limit Theorem.

Short Essay

For full credit, use complete sentences and paragraphs. Give examples if you wish. Your explanation should make sense to someone that does not understand statistics, like your mother.

What do the sample mean and SD tell you about a dataset?
Why is correlation not always the same as causation?

Problems

Show all of your work. You may use a calculator.

Given this table of grouped data for a histogram, do the following:

Bin Percentage of Observations

[1,3) 30%

[3,4) 40%

[4,5) 10%

[5,7) 10%

[7,11] 10%
1. Give the heights of the histogram bars. The units of the heights are percent per horizontal unit. You do not need to submit your drawing of the histogram.
  Ans: Recall that the area, not the height, of each bar is proportional to the number of observations in that bin. The heights are 30/(3-1) = 15, 40/(4-3) = 40, 10/(5-4) = 10, 10/(7-5) = 5, 10/(11-7) = 2.5.
2. Compute Q1, Q2, Q3 and IQR.
  Ans: The cumulative frequencies are 30, 70 80, 90, 100.
  Q1: 25% is 5/6 of the way from 0 to 30, so Q1 must be 5/6 of the way from 1 to 3: 1 + (5/6)(3-1) = 2.667.
  Q2: 50% is half of the way from 30 to 70, so Q2 is half of the way from 3 to 4: 3.5.
  Q3: 75% is half of the way from 70 to 80, so Q3 is half of the way from 4 to 5: 4.5.
  IQR = Q3 - Q1 = 4.5 - 2.667 = 1.833.
3. Compute the mean using a weighted average.
  Ans: (2 × 30 + 3.5 × 40 + 4.5 × 10 + 6 × 10 + 9 × 10) / (30 + 40 + 10 + 10 + 10) = 3.95.
4. What percentage of the observations are between 4.5 and 7.0?
  Ans: 4.5 to 5 is half of the bin [4,5) and 5 to 7 is the compute bin, so the percentage is (1/2)10 + 10 = 15%.
Pick 5 numbers that have a median of 4 and a mean of 7. Show or explain how you chose the numbers of the dataset.
Ans: Pick the numbers 4, 4, 4, 4, x. No matter what x is, the median is 4. Since the mean is 7, solve for x in (4 + 4 + 4 + 4 + x) / 5 = 7 to get x = 19.
Compute the SD⁺ of this dataset by hand:
Ans: 2.32
An analysis shows that the midterm (x-variable) and final scores (y-variable) in a large class are bivariate normal. Here are the summary variables:
1. About what percentage of the students have final exam scores over 80?
  Ans: z = (80 - 65) / 20 = 0.75. The normal table gives 0.7704 for the area over the bin [0.75, ∞), so the answer is 1 - 0.7704 = 0.2296 or 23%.
2. Find the regression line for predicting midterm score from final score.
  Ans: The regression line is
3. What is the predicted final score for a student with an midterm score of 80?
  Ans: Plug x = 80 into the regression equation: 74.6
4. What is the predicted final score for a student with an midterm score of 50?
  Ans: Plug x = 50 into the regression equation: 60.2

Bin	Percentage of Observations
[1,3)	30%
[3,4)	40%
[4,5)	10%
[5,7)	10%
[7,11]	10%

SPSS Analysis

Perform the following analyses with SPSS. Save your output file as a Word .doc file. Type any interpretation of the output into the output file itself. Questions marked with an asterisk (*) require typed output in addition to the SPSS output.

The dataset lake-michigan-levels.xls contains the variables year and waterLevel in feet above sea level.
1. Create labels for the variables in the dataset.
2. Print the dataset including both year and waterLevel.
3. *Determine these univariable statistics Q0, Q1, Q2, Q3, Q4, mean, SD, SE(ave).
  Ans: Q0 = 576.8, Q1 = 577.64, Q2 = 578.81, Q3 = 579.76, Q4 = 581.56, mean = 578.86, SD = 1.31, SE(ave) = 0.252.
4. *Determine a 95% confidence interval for the true water level.
  Ans: [578.34, 579.37]
5. *Graph the boxplot. Are there any outliers?
  Ans: No outliers on the boxplot.
6. *Graph the normal plot. What does the normal plot tell you?
  Ans:
7. Plot the water level vs. year. Does the plot appear unbiased and homoscedastic?
  Ans: The plot is relatively homoscedastic, but biased. There is a trend from higher to lower water levels.
The dataset carweight-mpg.xls contains the variables model, weight, and cityMPG, for several current models.
1. Create labels for the variables in the dataset.
2. Print the dataset.
3. *Determine the correlation of cityMPG and weight.
  Ans: r = -0.819, a negative relationship
4. *Determine the regression equation, saving the unstandardized predicted values and unstandardized residuals. What is the regression equation?
  Ans:y = -0.007 x + 41.607
5. Create the boxplot of the residuals.
6. *Create the residual plot (scatterplot of the residuals vs. unstandardized predicted values). Is the residual plot unbiased and homoscedastic.
  Ans: The residual plot shows that the residuals are relatively unbiased and homoscedastic.*Create the normal plot of the residuals. What does the residual plot tell you?
  Ans: Except for one point on the upper right, the residuals are fairly normal.