Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. If code in this post doesn't work for you, then check that you're using the right versions. I've used both models to have a baseline for the random forest. The model is evaluated on a mock variable generated from two distributions, namely
where , and . The random variable is just noise meant to fool the classifier.
This data isn't meant to show that random forests are good classifiers. One way to do that would be to have more variables than observations (Biau & Scornet, 2016).
import MLDataUtils
using CategoricalArrays
using DataFrames
using Distributions
using Gadfly
using MLJ
using Random
n = 80
μ1 = 10
μ2 = 12
σ = 2
d1 = Normal(μ1, σ)
d2 = Normal(μ2, σ)
Random.seed!(123)
classes = CategoricalArray(rand(["A", "B"], n))
df = DataFrame(
class = CategoricalArray(classes),
U = [class == "A" ? rand(d1) : rand(d2) for class in classes],
V = rand(Normal(100, 10), n)
)
first(df, 10)
10×3 DataFrame
Row │ class U V
│ Cat… Float64 Float64
─────┼───────────────────────────
1 │ B 11.5802 111.588
2 │ A 10.0062 99.9623
3 │ A 12.719 93.8876
4 │ A 10.6082 110.054
5 │ B 8.8399 108.481
6 │ B 13.2351 99.2827
7 │ A 9.97171 90.8848
8 │ B 8.75458 113.906
9 │ A 8.96639 89.2201
10 │ B 10.3287 113.208
plot(df, x = :U, y = :V, color = :class)
Training and evaluating (testing) on the same data is not particulary useful because we want to know how well our model generalizes. For more information, see topics such as overfitting. Instead, we split the data up in a train and test set.
using StableRNGs
rng = StableRNG(123)
train, test = MLJ.partition(eachindex(classes), 0.7; shuffle=true, rng)
length(train) = 56
length(test) = 24
LinearBinary = @load LinearBinaryClassifier pkg=GLM verbosity=0
logistic_model = LinearBinary();
DecisionTree = @load DecisionTreeClassifier verbosity=0
tree = DecisionTree()
forest_model = EnsembleModel(atom=tree, n=10);
logistic = machine(logistic_model, (U = df.U, V = df.V), df.class)
fit!(logistic; rows=train)
fitted_params(logistic).coef
2-element Vector{Float64}:
0.7170450596947772
0.040936567194873125
The second coefficient in the linear model is close to zero. This is exactly what the model should do since is random noise.
forest = machine(forest_model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train);
Now that we have fitted the two models, we can compare the accuracies and plot the receiver operating characteristic.
logistic_predictions = predict_mode(logistic, rows=test)
forest_predictions = predict_mode(forest, rows=test)
truths = classes[test]
r3(x) = round(x; sigdigits=3)
accuracy(logistic_predictions, classes[test]) |> r3
0.625
accuracy(forest_predictions, classes[test]) |> r3
0.5
using MLJBase
logistic_predictions = MLJ.predict(logistic, rows=test)
logistic_fprs, logistic_tprs, _ = roc_curve(logistic_predictions, truths)
logistic_aoc = auc(logistic_predictions, truths) |> r3
0.686
forest_predictions = MLJ.predict(forest, rows=test)
forest_fprs, forest_tprs, _ = roc_curve(forest_predictions, truths)
forest_aoc = auc(forest_predictions, truths) |> r3
0.571
plot(x = logistic_fprs, y = logistic_tprs, color = ["logistic"],
Guide.xlabel("False positive rate"),
Guide.ylabel("True positive rate estimate"),
Geom.smooth(method = :loess, smoothing = 0.99),
layer(
x = forest_fprs, y = forest_tprs, color = ["forest"],
Geom.smooth(method = :loess, smoothing = 0.99),
)
)
By doing a train and test split, we basically threw a part of the data away. For small datasets, like the dataset in this example, that is not very efficient. Therefore, we also do a k-fold cross-validation.
Random.seed!(123)
rng = MersenneTwister(123)
indexes = shuffle(rng, eachindex(classes))
folds = MLDataUtils.kfolds(indexes, k = 8)
function fitted_accuracy(model, train, test)
forest = machine(model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train)
predictions = predict_mode(forest, rows=test)
return accuracy(predictions, classes[test]) |> r3
end
accuracies = [fitted_accuracy(logistic_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.7, 0.7, 0.4, 0.6, 0.7, 0.8, 0.7, 0.6], 0.65)
accuracies = [fitted_accuracy(forest_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
([0.5, 0.6, 0.3, 0.5, 0.5, 0.7, 0.5, 0.4], 0.5)
Biau, G., Scornet, E. (2016). A Random Forest Guided Tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7