HUIJZER.XYZ

# Random forest classification in Julia

2021-01-21

Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. If code in this post doesn't work for you, then check that you're using the right versions. I've used both models to have a baseline for the random forest. The model is evaluated on a mock variable $U$ generated from two distributions, namely

\begin{aligned} d_1 &= \text{Normal}(\mu_1, \sigma) \: \: \text{and} \\ d_2 &= \text{Normal}(\mu_2, \sigma), \end{aligned}

where $\mu_1 = 10$, $\mu_2 = 12$ and $\sigma = 2$. The random variable $V$ is just noise meant to fool the classifier.

This data isn't meant to show that random forests are good classifiers. One way to do that would be to have more variables than observations .

## Data generation

import MLDataUtils

using AlgebraOfGraphics
using CairoMakie
using CategoricalArrays
using DataFrames
using Distributions
using MLJ
using Random

n = 80
μ1 = 10
μ2 = 12
σ = 2

d1 = Normal(μ1, σ)
d2 = Normal(μ2, σ)

Random.seed!(123)
classes = categorical(rand(["A", "B"], n))

df = DataFrame(
class = categorical(classes),
U = [class == "A" ? rand(d1) : rand(d2) for class in classes],
V = rand(Normal(100, 10), n)
)

first(df, 10)
10×3 DataFrame
Row │ class  U         V
│ Cat…   Float64   Float64
─────┼───────────────────────────
1 │ B      11.5802   111.588
2 │ A      10.0062    99.9623
3 │ A      12.719     93.8876
4 │ A      10.6082   110.054
5 │ B       8.8399   108.481
6 │ B      13.2351    99.2827
7 │ A       9.97171   90.8848
8 │ B       8.75458  113.906
9 │ A       8.96639   89.2201
10 │ B      10.3287   113.208
uv = data(df) * mapping(:U, :V, color=:class)

draw(uv) ## Train and test split

Training and evaluating (testing) on the same data is not particulary useful because we want to know how well our model generalizes. For more information, see topics such as overfitting. Instead, we split the data up in a train and test set.

using StableRNGs

rng = StableRNG(123)
train, test = MLJ.partition(eachindex(classes), 0.7; shuffle=true, rng)
length(train) = 56
length(test) = 24


## Model fitting

LinearBinary = @load LinearBinaryClassifier pkg=GLM verbosity=0
logistic_model = LinearBinary();

DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree verbosity=0
tree = DecisionTree()
forest_model = EnsembleModel(atom=tree, n=10);

logistic = machine(logistic_model, (U = df.U, V = df.V), df.class)
fit!(logistic; rows=train)
fitted_params(logistic).coef
2-element Vector{Float64}:
0.7170450596947773
0.04093656719487316

The second coefficient in the linear model is close to zero. This is exactly what the model should do since $V$ is random noise.

forest = machine(forest_model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train);



## Accuracy

Now that we have fitted the two models, we can compare the accuracies and plot the receiver operating characteristic.

logistic_predictions = predict_mode(logistic, rows=test)
forest_predictions = predict_mode(forest, rows=test)
truths = classes[test]

r3(x) = round(x; sigdigits=3)

accuracy(logistic_predictions, classes[test]) |> r3
0.625
accuracy(forest_predictions, classes[test]) |> r3
0.5
using MLJBase

logistic_predictions = MLJ.predict(logistic, rows=test)
logistic_fprs, logistic_tprs, _ = roc_curve(logistic_predictions, truths)
logistic_aoc = auc(logistic_predictions, truths) |> r3
0.686
forest_predictions = MLJ.predict(forest, rows=test)
forest_fprs, forest_tprs, _ = roc_curve(forest_predictions, truths)
forest_aoc = auc(forest_predictions, truths) |> r3
0.571
logistic_df = DataFrame(
x = logistic_fprs,
y = logistic_tprs,
method = "logistic"
)

forest_df = DataFrame(
x = forest_fprs,
y = forest_tprs,
method = "forest"
)

roc_df = vcat(logistic_df, forest_df)

xy = data(roc_df)
xy *= smooth() + visual(Scatter)
xy *= mapping(
:x => "False positive rate",
:y => "True positive rate",
color=:method)

draw(xy) ## K-fold cross-validation

By doing a train and test split, we basically threw a part of the data away. For small datasets, like the dataset in this example, that is not very efficient. Therefore, we also do a k-fold cross-validation.

Random.seed!(123)
rng = MersenneTwister(123)
indexes = shuffle(rng, eachindex(classes))
folds = MLDataUtils.kfolds(indexes, k = 8)

function fitted_accuracy(model, train, test)
forest = machine(model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train)
predictions = predict_mode(forest, rows=test)
return accuracy(predictions, classes[test]) |> r3
end

accuracies = [fitted_accuracy(logistic_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.7, 0.7, 0.4, 0.6, 0.7, 0.8, 0.7, 0.6], 0.65)
accuracies = [fitted_accuracy(forest_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.5, 0.6, 0.3, 0.5, 0.5, 0.7, 0.5, 0.4], 0.5)

Biau, G., Scornet, E. (2016). A Random Forest Guided Tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7