HUIJZER.XYZ

Increasing model accuracy by using foreknowledge

2021-06-16

Typically, when making predictions via a linear model, we fit the model on our data and make predictions from the fitted model. However, this doesn't take much foreknowledge into account. For example, when predicting a person's length given only the weight and gender, we already have an intuition about the effect size and direction. Bayesian analysis should be able to incorporate this prior information.

In this blog post, I aim to figure out whether foreknowledge can, in theory, increase model accuracy. To do this, I generate data and fit a linear model and a Bayesian binary regression. Next, I compare the accuracy of the model parameters from the linear and Bayesian model.

  1. Data generation
  2. Linear regression
  3. Bayesian regression
  4. Conclusion

Data generation

Let's say that the data generation formula for the grade gig_i for some individual ii, with age aia_i and recent grade rir_i, is

gi=aeai+reri+ϵi=1.1ai+1.05ri+ϵi g_i = a_e * a_i + r_e * r_i + \epsilon_i = 1.1 * a_i + 1.05 * r_i + \epsilon_i where aea_e is the coefficient for the age, rer_e is a coefficient for the nationality and ϵi\epsilon_i is some random noise for individual ii.

We generate data for nn individuals via

using DataFrames
using Distributions
using Random

function generate_data(i::Int)
  Random.seed!(i)

  n = 120
  I = 1:n
  P = [i % 2 == 0 for i in I]
  r_2(x) = round(x; digits=2)

  A = r_2.([p ? rand(Normal(aₑ * 18, 1)) : rand(Normal(18, 1)) for p in P])
  R = r_2.([p ? rand(Normal(rₑ * 6, 3)) : rand(Normal(6, 3)) for p in P])
  E = r_2.(rand(Normal(0, 1), n))
  G = aₑ .* A + rₑ .* R .+ E
  G = r_2.(G)

  df = DataFrame(age=A, recent=R, error=E, grade=G, pass=P)
end

df = generate_data(1)
first(df, 8)

agerecenterrorgradepass
18.32.26-0.7421.76false
20.182.311.1425.76true
17.41.92-0.920.26false
19.797.930.0230.12true
17.16-1.9-1.5515.33false
20.116.750.1929.4true
20.31.020.9224.32false
17.53-1.950.1617.4true

We can see the positive correlations of age and grade, and recent and grade.

as well as the differences in densities when splitting the individuals on pass or fail:

Linear regression

First, we fit a linear model and verify that the coefficients are estimated reasonably well. Here, the only prior information that we give the model is the structure of the data, that is, a formula.

using GLM

linear_model = lm(@formula(grade ~ age + recent), df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

grade ~ 1 + age + recent

Coefficients:
───────────────────────────────────────────────────────────────────────
               Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  0.85718   1.44699     0.59    0.5547  -2.00851     3.72287
age          1.0436    0.0750508  13.91    <1e-25   0.894965    1.19223
recent       1.07894   0.0337224  31.99    <1e-58   1.01216     1.14573
───────────────────────────────────────────────────────────────────────
Notice how these estimated coefficients are close to the coefficients that we set for age and recent, namely ae=1.11.0436a_e = 1.1 \approx 1.0436 and re=1.051.07894 r_e = 1.05 \approx 1.07894 , as expected.

Bayesian regression

For the Bayesian regression we fit a model via Turing.jl. Now, we give the model information about the structure of the data as well as priors for the size of the coefficients. For demonstration purposes, I've set the priors to the correct values. This is reasonable because I was wondering whether finding a good prior could have a positive effect on the model accuracy.

using Statistics
using StatsFuns: logistic
using Turing


@model function bayesian_model(ages, recents, grades, n)
    intercept ~ Normal(0, 5)
    βₐ ~ Normal(aₑ, 1)
    βᵣ ~ Normal(rₑ, 3)
    σ ~ truncated(Cauchy(0, 2), 0, Inf)

    μ = intercept .+ βₐ * ages .+ βᵣ * recents
    grades ~ MvNormal(μ, σ)
end

n = nrow(df)
bm = bayesian_model(df.age, df.recent, df.grade, n)
chns = Turing.sample(bm, NUTS(), MCMCThreads(), 10_000, 3)

Lets plot the density for the coefficient estimates βa\beta_a and βr\beta_r.

coefficienttrue valuelinear estimatelinear errorbayesian estimatebayesian error
aₑ1.11.0445.1 %1.0474.8 %
rₑ1.051.0792.8 %1.082.8 %

Conclusion

After giving the true coefficients to the Bayesian model in the form of priors, it does score better than the linear model. However, the differences aren't very big. This could be due to the particular random noise in this sample E or due to the relatively big sample size. The more samples, the more likely it is that the data will overrule the prior. In any way, there are real-world situations where gathering extra data is more expensive than gathering priors via reading papers. In those cases, the increased accuracy introduced by using priors could have serious benefits.