Correlations are ubiquitous. For example, news articles reporting that a research paper found no correlation between X and Y. Also, it is related to (in)dependence, which plays an important role in linear regression. This post will explain the Pearson correlation coefficient. The explanation is mainly based on the book by Hogg et al. (2018).
In the context of a book on mathematical statistics certain variable names make sense. However, in this post some names are changed to make the information more coherent. One convention which is adhered to is that single values are lowercase, and multiple values are capitalized. Furthermore, since in most empirical research we only need discrete statistics the continuous versions of formulas are omitted.
We start by defining some general notions. Let be a pair of random variables where each sample is added exactly once, and the variables have a bivariate distribution. (A bivariate distribution is simply the combination of two distributions. For two normal distributions the three dimensional frequency plot would look like a mountain.) Denote the means of and respectively by and . In the situations below the expectation for some random variable equals the mean, that is, . (The expectation equals the mean when the probabilities for all values in a random variable are the same.)
To understand the correlation coefficient we must first understand covariance. Covariance is defined as
Let and be discrete random variables defined by respectively , , and for the range 1 to 7. Let be the reverse of . The probabilities are chosen such that they are the same for all the values in these random variables.
using DataFrames range = 1:7 A = [x + 1 for x in range] B = [0.5x + 3 for x in range] C = [5 for x in range] D = reverse(A) df = DataFrame(x = range, A = A, B = B, C = C, D = D)
We can plot the variables to obtain the following figure. Note that stack is needed to prepare the data for Gadfly, see Appendix 1 for the effect of the
using Gadfly sdf = stack(df, [:A, :B, :C, :D]) p = plot(sdf, x = :x, y = :value, color = :variable )
To get an intuition for the covariance, consider a negative covariance. Rice (2006) states that the covariance will be negative if when is larger than its mean, tends to be smaller than its mean. To get a example of a perfect negative linear relationship look at and . When is larger than its mean, is smaller than its mean and vice versa. Therefore should be negative. We can manually check this:
In this calculation we have ignored Bessel's correction. With Bessel's correction the result would have been . It can be observed that the negative result is caused by the fact that for each multiplication in either is negative or is negative, hence is negative. The results for the other covariances when comparing with are
as calculated in Appendix 2. The numbers in Example 1 are all integers. In real world situations that is often not the case. This will lead to rounding errors. To minimise the rounding errors the covariance can be rewritten. The rewrite uses the linearity of expectation, that is, :
To appreciate the efficacy of this rewrite we redo the calculation for , see Example 2.
as was also obtained from the earlier calculation.
The next step in being able to explain the correlation coefficient is defining the standard deviation, which is defined in terms of the variance. The variance is a "measure of the spread around a center" (Rice (2006)). The standard deviation is about how spread out the values of the random variable are, on average, about its expectation.
Formally, the variance of is
and the standard deviation of is
where and are the common denotations for these concepts.
The covariance can be used to get a sense of how much two variables are associated. However, the size of the result depends not only on the strength of the association, but also on the data. For example, if there is a huge size difference in the numbers in a variable, then the covariance could appear large while in fact the correlation is negligible. The covariance is based on the dispersion of the values for two variables around their expectation. So, to normalize the covariance we can divide it by the standard deviation.
The Pearson correlation coefficient between and is defined as
Note that the units cancel out, hence the correlation is dimensionless. For the correlation coefficient it holds that , as can be shown by using the Cauchy-Schwarz inequality.
To show that when and are independent, then reason as follows. When and are independent, then . We know that . Since and , , and by that .
For a set of sample data the correlation coefficient is usually denoted by (Gupta & Guttman (2014)). The association is considered weak, moderate or strong when respectively is lower than 0.3, is in between 0.3 and 0.7, or is higher than 0.7.
The coefficient reduces two sets of values to a number representing their relatedness. As with any reduction you will lose information. In this case the number does not say anything about how linear the relationship is. Instead the correlation coefficient assumes linearity. It can be observed from the calculation in Example 1 that the reported number is meaningless if the variables are not reasonably linear.
If the correlation coefficient is -1 or 1 we know that the relationship is perfectly linear. In that case values from can be used to determine values in and vice versa.
Finally, it should be noted that correlation does not imply causation, or more clearly: "Causation causes correlation, but not necessarily the converse" (Gupta & Guttman (2014))
Before stacking, the DataFrame looks like
df = 7×5 DataFrame Row │ x A B C D │ Int64 Int64 Float64 Int64 Int64 ─────┼───────────────────────────────────── 1 │ 1 2 3.5 5 8 2 │ 2 3 4.0 5 7 3 │ 3 4 4.5 5 6 4 │ 4 5 5.0 5 5 5 │ 5 6 5.5 5 4 6 │ 6 7 6.0 5 3 7 │ 7 8 6.5 5 2
and after stacking, it has changed to
sdf = 28×3 DataFrame Row │ x variable value │ Int64 String Float64 ─────┼────────────────────────── 1 │ 1 A 2.0 2 │ 2 A 3.0 3 │ 3 A 4.0 4 │ 4 A 5.0 5 │ 5 A 6.0 6 │ 6 A 7.0 7 │ 7 A 8.0 8 │ 1 B 3.5 9 │ 2 B 4.0 10 │ 3 B 4.5 11 │ 4 B 5.0 12 │ 5 B 5.5 13 │ 6 B 6.0 14 │ 7 B 6.5 15 │ 1 C 5.0 16 │ 2 C 5.0 17 │ 3 C 5.0 18 │ 4 C 5.0 19 │ 5 C 5.0 20 │ 6 C 5.0 21 │ 7 C 5.0 22 │ 1 D 8.0 23 │ 2 D 7.0 24 │ 3 D 6.0 25 │ 4 D 5.0 26 │ 5 D 4.0 27 │ 6 D 3.0 28 │ 7 D 2.0
using Statistics # Note that this function does not apply Bessel's correction. function mycov(X, Y) min_mean_x(x)::Float64 = x - mean(X) min_mean_y(y)::Float64 = y - mean(Y) mean(min_mean_x.(X) .* min_mean_y.(Y)) end @show mycov(A, A) @show mycov(A, B) @show mycov(A, C) @show mycov(A, D)
mycov(A, A) = 4.0 mycov(A, B) = 2.0 mycov(A, C) = 0.0 mycov(A, D) = -4.0