Study Question: How is the correlation coefficient (\(r\)) related to the association between two quantitative variables?
Example: Consider a sample of 31 observations of the girth (in), height (ft), and volume (cubic feet) of black cherry trees from Allegheny National Forest.
Tree | Girth | Height | Volume |
---|---|---|---|
1 | 8.3 | 70 | 10.3 |
2 | 8.6 | 65 | 10.3 |
3 | 8.8 | 63 | 10.2 |
4 | 10.5 | 72 | 16.4 |
5 | 10.7 | 81 | 18.8 |
6 | 10.8 | 83 | 19.7 |
7 | 11.0 | 66 | 15.6 |
8 | 11.0 | 75 | 18.2 |
9 | 11.1 | 80 | 22.6 |
10 | 11.2 | 75 | 19.9 |
11 | 11.3 | 79 | 24.2 |
12 | 11.4 | 76 | 21.0 |
13 | 11.4 | 76 | 21.4 |
14 | 11.7 | 69 | 21.3 |
15 | 12.0 | 75 | 19.1 |
16 | 12.9 | 74 | 22.2 |
17 | 12.9 | 85 | 33.8 |
18 | 13.3 | 86 | 27.4 |
19 | 13.7 | 71 | 25.7 |
20 | 13.8 | 64 | 24.9 |
21 | 14.0 | 78 | 34.5 |
22 | 14.2 | 80 | 31.7 |
23 | 14.5 | 74 | 36.3 |
24 | 16.0 | 72 | 38.3 |
25 | 16.3 | 77 | 42.6 |
26 | 17.3 | 81 | 55.4 |
27 | 17.5 | 82 | 55.7 |
28 | 17.9 | 80 | 58.3 |
29 | 18.0 | 80 | 51.5 |
30 | 18.0 | 80 | 51.0 |
31 | 20.6 | 87 | 77.0 |
The joint distribution of any two variables can be
visualized using a scatter plot.
Example: The data below record the average outdoor temperature (C) and gas consumption (1000s of cubic feet) to heat a home for a week. But note that between the 26th and 28th weeks cavity-wall insulation was added.
Week | Insulation | Temp | Gas |
---|---|---|---|
1 | Before | 7.4 | 4.2 |
2 | Before | 7.5 | 3.9 |
3 | Before | 4.2 | 5.8 |
4 | Before | 7.6 | 3.5 |
5 | Before | 2.5 | 6.0 |
6 | Before | 8.0 | 4.0 |
7 | Before | 0.4 | 6.4 |
8 | Before | 7.5 | 4.0 |
9 | Before | 6.9 | 3.7 |
10 | Before | 3.9 | 4.7 |
11 | Before | 6.0 | 4.4 |
12 | Before | 6.2 | 4.5 |
13 | Before | 9.1 | 3.1 |
14 | Before | 5.4 | 4.9 |
15 | Before | 3.2 | 5.8 |
16 | Before | 3.6 | 5.6 |
17 | Before | 6.0 | 4.9 |
18 | Before | 6.0 | 4.3 |
19 | Before | 10.2 | 2.6 |
20 | Before | 7.0 | 3.9 |
21 | Before | -0.7 | 6.9 |
22 | Before | 2.9 | 5.8 |
23 | Before | 4.3 | 5.2 |
24 | Before | 8.5 | 3.6 |
25 | Before | -0.8 | 7.2 |
26 | Before | 6.3 | 4.6 |
28 | After | 4.9 | 3.7 |
29 | After | 5.0 | 3.6 |
30 | After | 7.5 | 2.6 |
31 | After | 7.1 | 3.0 |
32 | After | 9.7 | 1.5 |
33 | After | 2.3 | 4.1 |
34 | After | 1.5 | 4.2 |
35 | After | 2.5 | 4.0 |
36 | After | 4.2 | 3.5 |
37 | After | 8.7 | 2.8 |
38 | After | 1.6 | 4.2 |
39 | After | -0.7 | 4.8 |
40 | After | 6.2 | 2.8 |
41 | After | 3.9 | 3.9 |
42 | After | 4.6 | 3.7 |
43 | After | 7.2 | 2.8 |
44 | After | 4.7 | 3.5 |
45 | After | 4.3 | 3.5 |
46 | After | 8.8 | 1.3 |
47 | After | 4.9 | 3.4 |
48 | After | 0.8 | 4.6 |
49 | After | 5.3 | 3.7 |
50 | After | 4.9 | 4.0 |
51 | After | 1.0 | 4.7 |
52 | After | 4.0 | 3.5 |
53 | After | 1.4 | 4.0 |
54 | After | 3.1 | 3.2 |
55 | After | 4.0 | 3.7 |
56 | After | 8.0 | 2.7 |
57 | After | 2.5 | 3.5 |
Again a scatter plot with some color coding to indicate whether the
observation was before or after insulation was added is useful.
The correlation coefficient measures the
direction and strength of the linear
association between two quantitative variables. The correlation
is symbolized by \(r\) when it is a
statistic (i.e., describing a sample of observations) and \(\rho\) when it is a parameter (i.e.,
describing the population distribution of the variables).
Example: Consider the correlations among nine test scores of seventh- and eigth-grade children.
Variable | |||||||||
---|---|---|---|---|---|---|---|---|---|
Variable | ViP | Cub | Loz | PaC | SeC | WoM | SpA | SpC | SpD |
Visual Perception (ViP) | 1.00 | 0.33 | 0.45 | 0.34 | 0.31 | 0.32 | 0.10 | 0.31 | 0.49 |
Cubes (Cub) | 0.33 | 1.00 | 0.42 | 0.23 | 0.16 | 0.19 | 0.07 | 0.17 | 0.25 |
Lozenges (Loz) | 0.45 | 0.42 | 1.00 | 0.33 | 0.29 | 0.35 | 0.07 | 0.24 | 0.37 |
Paragraph Comprehension (PaC) | 0.34 | 0.23 | 0.33 | 1.00 | 0.72 | 0.71 | 0.21 | 0.10 | 0.31 |
Sentence Completion (SeC) | 0.31 | 0.16 | 0.29 | 0.72 | 1.00 | 0.69 | 0.25 | 0.20 | 0.36 |
Word Meaning (WoM) | 0.32 | 0.19 | 0.35 | 0.71 | 0.69 | 1.00 | 0.18 | 0.12 | 0.27 |
Speeded Addition (SpA) | 0.10 | 0.07 | 0.07 | 0.21 | 0.25 | 0.18 | 1.00 | 0.59 | 0.42 |
Speeded Counting (SpC) | 0.31 | 0.17 | 0.24 | 0.10 | 0.20 | 0.12 | 0.59 | 1.00 | 0.53 |
Speeded Discrimination (SpD) | 0.49 | 0.25 | 0.37 | 0.31 | 0.36 | 0.27 | 0.42 | 0.53 | 1.00 |
The relative strengths of such correlations have been used to study intelligence.
Example: Correlation of BMI for monozygotic (i.e.,
identical) and dizygotic (i.e., fraternal) twins.
Falconer’s formula of \(2(r_{mz} - r_{dz}) =
2(0.68 - 0.37) = 0.62\) has been used as a (crude) estimate of
heritability.
Example: Suppose we had a matched-pairs design using
genetically-related individuals (e.g., cousins, siblings, or identical
twins) for a study like that that investigated the relationship between
schizophrenia and left hippocampus volume.
The standard error of \(\bar{x}_d =
\bar{x}_1-\bar{x}_2\) is \[
\sqrt{\frac{\sigma_1^2}{n} + \frac{\sigma_2^2}{n} -
\rho\frac{2\sigma_1\sigma_2}{n}},
\] where \(\rho\) is the
correlation coefficient for the population distribution.
Study Question: How does \(\rho\) affect the standard error of \(\bar{x}_d = \bar{x}_1 - \bar{x}_2\) when using dependent samples?
Study Question: How is the correlation coefficient (\(r\)) involved in using one variable to predict another using linear regression?
Example: Consider the problem of trying to
predict tree volume using its girth.
Perhaps we could use an equation like \[
\hat{y} = a + bx,
\] where \(\hat{y}\) is the
predicted volume, and \(x\) is
girth. Here \(a\) and \(b\) are the parameters of the
equation known as the “intercept” and “slope” of the equation,
respectively. How should we choose \(a\) and \(b\)?
The correlation is related to \(a\) and \(b\). The method of least squares can be used to choose \(a\) and \(b\) such that the \(\hat{y}\)’s and the \(y\)’s are “close” in the sense that \[ \sum (y - \hat{y})^2 = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \cdots + (y_n - \hat{y}_n)^2 \] is as small as possible. The values of \(a\) and \(b\) that achieve this are given by \[ b = r\frac{s_y}{s_x}, \ \ \ a = \bar{y} - b\bar{x}, \] where \(s_y\) and \(s_x\) are the standard deviations of \(y\) (volume) and \(x\) (girth), respectively, and \(\bar{y}\) and \(\bar{x}\) are their means.
The correlation can also be used to compute the standard error of prediction which can be viewed as the average squared difference between \(y\) and \(\hat{y}\). This is \[ s_e = s_y\sqrt{1 - r^2}. \] So larger values of \(r\) (in absolute value) lead to smaller prediction errors (on average).