HUIJZER.XYZ

Simple and binary regression

2020-03-05

One of the most famous scientific discoveries was Newton's laws of motion. The laws allowed people to make predictions. For example, the acceleration for an object can be predicted given the applied force and the mass of the object. Making predictions remains a popular endeavour. This post explains the simplest way to predict an outcome for a new value, given a set of points.

To explain the concepts, data on apples and pears is generated. Underlying relations for the generated data are known. The known relations can be compared to the results from the regression.

  1. Generating data
  2. Simple linear regression
  3. Binary logistic regression
  4. References

Generating data

The goal is to predict whether a fruit is an apple or a pear. Say we have a sample of size nn. Say the sample consist of two properties for each fruit, namely

The properties for the fruit at index ii are respectively denoted by hih_i and wiw_i. Prediction will be done for the fruit type, which is denoted by yiy_i. The sample indices can be visualised as follows.

IHWY
1h1h_1w1w_1y1y_1
2h2h_2w2w_2y2y_2
............
nnhnh_nwnw_nyny_n

Let half of the elements be apples and half of the elements be pears. So, nn is even. Let

yi={appleif i is odd, andpearif i is even. y_{i} = \begin{cases} \text{apple} & \text{if $i$ is odd}, \: \text{and} \\ \text{pear} & \text{if $i$ is even}. \end{cases}

Let the height and fruit type be correlated. To this end define

hi={one sample of size 1 drawn from N(10,1)if Yi is apple, andone sample of size 1 drawn from N(12,1)if Yi is pear. h_{i} = \begin{cases} \text{one sample of size 1 drawn from } N(10, 1) & \text{if $Y_i$ is apple}, \: \text{and} \\ \text{one sample of size 1 drawn from } N(12, 1) & \text{if $Y_i$ is pear}. \end{cases}

Let the height and width also be correlated. Define the width to be 0.6 times the height. Specifically,

wi=0.6Hi. w_i = 0.6 H_i.

In Julia, this can be defined as

using DataFrames
using Distributions
using Random
using Statistics

r_2(x) = round(x; digits=2)
Random.seed!(18)
n = 12
I = 1:n
Y = [i % 2 != 0 for i in I]
H = r_2.([y == "apple" ? rand(Normal(10, 1)) : rand(Normal(12, 1)) for y in Y])
W = r_2.([0.6h for h in H])

df = DataFrame(I = I, H = H, W = W, Y = Y)

IHWY
113.728.23true
210.956.57false
312.97.74true
410.716.43false
512.947.76true
611.386.83false
712.617.57true
812.567.54false
912.377.42true
1012.17.26false
1112.057.23true
1211.857.11false

Simple linear regression

A simple linear regression fits a line through points in two dimensions. It should be able to infer the relation between HH and WW.

using Gadfly

# These two are useful for plotting.
wmin = minimum(W) - 0.2
wmax = maximum(W) + 0.2

plot(df, x = :W, y = :H)

The algorithmic way to fit a line is via the method of least squares. Any straight line can be described by a linear equation of the form y=p1x+p0y = p_1 x + p_0, where the first parameter p0p_0 is the intercept with yy and the second parameter p1p_1 is the slope. Adding an error ee, and rewriting gives

yi=p0+p1xi+eiei=yip0p1xi. \begin{aligned} y_i & = p_0 + p_1 x_i + e_i \\ e_i & = y_i - p_0 - p_1 x_i. \\ \end{aligned}

An algorithm could now be defined to naively minimize the sum of all the errors

S(p0,p1)=i=1nei, S'(p_0, p_1) = \sum_{i=1}^n e_i,

with respect to the choice of p0p_0 and p1p_1. This would not always result in a well fitted line because errors might cancel each other out. For example, when e1=10e_1 = 10 and e2=10e_2 = -10, then e1+e2=0e_1 + e_2 = 0. This is solved by squaring the errors. The method of least squares minimizes

Sl(p0,p1)=i=1nei2=i=1n(yip0p1xi)2 S_l(p_0, p_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - p_0 - p_1 x_i)^2

with respect to the choice of p0p_0 and p1p_1 (Rice (2006)). The simplest estimator for the points is the mean. We can plot this and show horizontal lines for the errors.

# Linear and generalized linear models (GLMs).
using GLM

m = mean(H) 
plot(df, x = :W, y = :H,
  Geom.point,
  yintercept = [m], Geom.hline(),
  layer(xend = :W, yend = [m], Geom.segment())
)

We can generalize the sum of squares error calculation to

S(U,V)=i=1n(uivi)2, S(U, V) = \sum_{i=1}^n (u_i - v_i)^2,

for arrays UU and VV.

S(U, V) = sum((U .- V).^2)

Then, the squared sum of the errors for this simplest estimator is

r_2(S(H, repeat([mean(H)], length(H))))
8.28

This error cannot be compared to other errors, since it is not standardized. A standardized metric would be the Pearson correlation coefficient rr. See the blog post on correlations for more information.

m1 = lm(@formula(H ~ W), df)

# This is just a convenience function around `GLM.predict`.
predict_value(model, x) = 
  first(skipmissing(predict(model, DataFrame(W = [x]))))

plot(df, x = :W, y = :H, 
  Geom.point,
  layer(x -> predict_value(m1, x), wmin, wmax)
)

The intercept and slope for the fitted line are

intercept(linear_model) = coef(linear_model)[1]
slope(linear_model) = coef(linear_model)[2]
r_2(intercept(m1)) = -0.03
r_2(slope(m1)) = 1.67

Binary logistic regression

Next, lets try to estimate the relation between YY and WW. The method of least squares is unable to calculate an error for "apple" and "pear". A work-around is to encode "apple" as 0 and "pear" as 1. A line can now be fitted.

digits = [i % 2 != 0 ? 0 : 1 for i in I]
df[:, :Y_digit] = digits
m2 = lm(@formula(Y_digit ~ W), df)

plot(df, x = :W, y = :Y_digit,
  xmin = [wmin], xmax = [wmax],
  Geom.point,
  layer(x -> predict_value(m2, x), wmin, wmax),
  layer(y = predict(m2), Geom.point)
)

As can be observed, the model does not take into account that YY is a binary variable. The model even predicts values outside the expected range, that is, values outside the range 0 to 1. A better fit is the logistic function.

m3 = glm(@formula(Y_digit ~ W), df, Binomial(), LogitLink())

plot(df, x = :W, y = :Y_digit, 
  Geom.point,
  layer(x -> predict_value(m3, x), wmin, wmax),
  layer(y = predict(m3), Geom.point)
)

The correlation coefficient rr should not be used to compare the models, since the logistic model only predicts the class. In other words, the logistic model is a classificatier. Classification accuracy is a better metric:

accuracy=number of correct predictionstotal number of predictions. \text{accuracy} = \frac{\text{number of correct predictions}}{\text{total number of predictions}} .
accuracy(trues, pred) = count(trues .== pred) / length(pred)

The threshold is set to 0.5 to get a binary prediction from both models. More precisely: let pred(xi)\text{pred}(x_i) denote the prediction for yiy_i, and yiy_i' denote the binary prediction for yiy_i. For each prediction

yi={1if 0.5<pred(xi),and0if pred(xi)0.5. y_i' = \begin{cases} 1 & \text{if $0.5 < \text{pred}(x_i)$}, \: \text{and} \\ 0 & \text{if $\text{pred}(x_i) \leq 0.5$}. \\ \end{cases}
binary_values(model) = [0.5 < x ? 1 : 0 for x in predict(model)]

The least squares error and accuracy for both models are as follows.

DataFrame(
	model = ["Linear regression", "Logistic regression"],
	S = r_2.([S(digits, predict(m2)), S(digits, predict(m3))]),
	accuracy = r_2.([accuracy(digits, binary_values(m2)), accuracy(digits, binary_values(m3))])
)

modelSaccuracy
Linear regression1.510.83
Logistic regression1.420.83

As can be seen, the error is lower for the logistic regression. However, for the accuracy both models score the same in this case. This is due to the fact that there is only a very small area where the linear and logistic model make different predictions. When this area becomes bigger (for example, when doing regression in multiple dimensions) or when more points lie in this area, then the accuracy for the logistic regression will be better compared to the linear regression.

References

Rice, J. A. (2006). Mathematical statistics and data analysis. Cengage Learning.