HUIJZER.XYZ

Random forest classification in Julia

2021-01-21

Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. If code in this post doesn't work for you, then check that you're using the right versions. I've used both models to have a baseline for the random forest. The model is evaluated on a mock variable UU generated from two distributions, namely

d1=Normal(μ1,σ) andd2=Normal(μ2,σ), \begin{aligned} d_1 &= \text{Normal}(\mu_1, \sigma) \: \: \text{and} \\ d_2 &= \text{Normal}(\mu_2, \sigma), \end{aligned}

where μ1=10\mu_1 = 10, μ2=12\mu_2 = 12 and σ=2\sigma = 2. The random variable VV is just noise meant to fool the classifier.

This data isn't meant to show that random forests are good classifiers. One way to do that would be to have more variables than observations (Biau & Scornet, 2016).

  1. Data generation
  2. Train and test split
  3. Model fitting
  4. Accuracy
  5. K-fold cross-validation
  6. References

Data generation

import MLDataUtils

using CategoricalArrays
using DataFrames
using Distributions
using Gadfly
using MLJ
using Random

n = 80
μ1 = 10
μ2 = 12
σ = 2

d1 = Normal(μ1, σ)
d2 = Normal(μ2, σ)

Random.seed!(123)
classes = categorical(rand(["A", "B"], n))

df = DataFrame(
    class = categorical(classes),
    U = [class == "A" ? rand(d1) : rand(d2) for class in classes],
    V = rand(Normal(100, 10), n)
)

first(df, 10)
10×3 DataFrame
 Row │ class  U         V
     │ Cat…   Float64   Float64
─────┼───────────────────────────
   1 │ B      11.5802   111.588
   2 │ A      10.0062    99.9623
   3 │ A      12.719     93.8876
   4 │ A      10.6082   110.054
   5 │ B       8.8399   108.481
   6 │ B      13.2351    99.2827
   7 │ A       9.97171   90.8848
   8 │ B       8.75458  113.906
   9 │ A       8.96639   89.2201
  10 │ B      10.3287   113.208
plot(df, x = :U, y = :V, color = :class)

Train and test split

Training and evaluating (testing) on the same data is not particulary useful because we want to know how well our model generalizes. For more information, see topics such as overfitting. Instead, we split the data up in a train and test set.

using StableRNGs

rng = StableRNG(123)
train, test = MLJ.partition(eachindex(classes), 0.7; shuffle=true, rng)
length(train) = 56
length(test) = 24

Model fitting

LinearBinary = @load LinearBinaryClassifier pkg=GLM verbosity=0
logistic_model = LinearBinary();

DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree verbosity=0
tree = DecisionTree()
forest_model = EnsembleModel(atom=tree, n=10);

logistic = machine(logistic_model, (U = df.U, V = df.V), df.class)
fit!(logistic; rows=train)
fitted_params(logistic).coef
2-element Vector{Float64}:
 0.7170450596947772
 0.040936567194873125

The second coefficient in the linear model is close to zero. This is exactly what the model should do since VV is random noise.

forest = machine(forest_model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train);

Accuracy

Now that we have fitted the two models, we can compare the accuracies and plot the receiver operating characteristic.

logistic_predictions = predict_mode(logistic, rows=test)
forest_predictions = predict_mode(forest, rows=test)
truths = classes[test]

r3(x) = round(x; sigdigits=3)

accuracy(logistic_predictions, classes[test]) |> r3
0.625
accuracy(forest_predictions, classes[test]) |> r3
0.5
using MLJBase

logistic_predictions = MLJ.predict(logistic, rows=test)
logistic_fprs, logistic_tprs, _ = roc_curve(logistic_predictions, truths)
logistic_aoc = auc(logistic_predictions, truths) |> r3
0.686
forest_predictions = MLJ.predict(forest, rows=test)
forest_fprs, forest_tprs, _ = roc_curve(forest_predictions, truths)
forest_aoc = auc(forest_predictions, truths) |> r3
0.571
plot(x = logistic_fprs, y = logistic_tprs, color = ["logistic"],
    Guide.xlabel("False positive rate"),
    Guide.ylabel("True positive rate estimate"),
    Geom.smooth(method = :loess, smoothing = 0.99),
    layer(
        x = forest_fprs, y = forest_tprs, color = ["forest"],
        Geom.smooth(method = :loess, smoothing = 0.99),
    )
)

K-fold cross-validation

By doing a train and test split, we basically threw a part of the data away. For small datasets, like the dataset in this example, that is not very efficient. Therefore, we also do a k-fold cross-validation.

Random.seed!(123)
rng = MersenneTwister(123)
indexes = shuffle(rng, eachindex(classes))
folds = MLDataUtils.kfolds(indexes, k = 8)

function fitted_accuracy(model, train, test)
    forest = machine(model, (U = df.U, V = df.V), df.class)
    fit!(forest; rows=train)
    predictions = predict_mode(forest, rows=test)
    return accuracy(predictions, classes[test]) |> r3
end

accuracies = [fitted_accuracy(logistic_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.7, 0.7, 0.4, 0.6, 0.7, 0.8, 0.7, 0.6], 0.65)
accuracies = [fitted_accuracy(forest_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.5, 0.6, 0.3, 0.5, 0.5, 0.7, 0.5, 0.4], 0.5)

References

Biau, G., Scornet, E. (2016). A Random Forest Guided Tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7