Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. If code in this post doesn't work for you, then check that you're using the right versions. I've used both models to have a baseline for the random forest. The model is evaluated on a mock variable $U$ generated from two distributions, namely

$\begin{aligned} d_1 &= \text{Normal}(\mu_1, \sigma) \: \: \text{and} \\ d_2 &= \text{Normal}(\mu_2, \sigma), \end{aligned}$where $\mu_1 = 10$, $\mu_2 = 12$ and $\sigma = 2$. The random variable $V$ is just noise meant to fool the classifier.

This data isn't meant to show that random forests are good classifiers. One way to do that would be to have more variables than observations (Biau & Scornet, 2016).

```
import MLDataUtils
using CategoricalArrays
using DataFrames
using Distributions
using Gadfly
using MLJ
using Random
n = 80
μ1 = 10
μ2 = 12
σ = 2
d1 = Normal(μ1, σ)
d2 = Normal(μ2, σ)
Random.seed!(123)
classes = categorical(rand(["A", "B"], n))
df = DataFrame(
class = categorical(classes),
U = [class == "A" ? rand(d1) : rand(d2) for class in classes],
V = rand(Normal(100, 10), n)
)
first(df, 10)
```

```
10×3 DataFrame
Row │ class U V
│ Cat… Float64 Float64
─────┼───────────────────────────
1 │ B 11.5802 111.588
2 │ A 10.0062 99.9623
3 │ A 12.719 93.8876
4 │ A 10.6082 110.054
5 │ B 8.8399 108.481
6 │ B 13.2351 99.2827
7 │ A 9.97171 90.8848
8 │ B 8.75458 113.906
9 │ A 8.96639 89.2201
10 │ B 10.3287 113.208
```

`plot(df, x = :U, y = :V, color = :class)`

Training and evaluating (testing) on the same data is not particulary useful because we want to know how well our model generalizes. For more information, see topics such as overfitting. Instead, we split the data up in a *train* and *test* set.

```
using StableRNGs
rng = StableRNG(123)
train, test = MLJ.partition(eachindex(classes), 0.7; shuffle=true, rng)
```

```
length(train) = 56
length(test) = 24
```

```
LinearBinary = @load LinearBinaryClassifier pkg=GLM verbosity=0
logistic_model = LinearBinary();
DecisionTree = @load DecisionTreeClassifier pkg=DecisionTree verbosity=0
tree = DecisionTree()
forest_model = EnsembleModel(atom=tree, n=10);
logistic = machine(logistic_model, (U = df.U, V = df.V), df.class)
fit!(logistic; rows=train)
fitted_params(logistic).coef
```

```
2-element Vector{Float64}:
0.7170450596947772
0.040936567194873125
```

The second coefficient in the linear model is close to zero. This is exactly what the model should do since $V$ is random noise.

```
forest = machine(forest_model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train);
```

```
```

Now that we have fitted the two models, we can compare the accuracies and plot the receiver operating characteristic.

```
logistic_predictions = predict_mode(logistic, rows=test)
forest_predictions = predict_mode(forest, rows=test)
truths = classes[test]
r3(x) = round(x; sigdigits=3)
accuracy(logistic_predictions, classes[test]) |> r3
```

`0.625`

`accuracy(forest_predictions, classes[test]) |> r3`

`0.5`

```
using MLJBase
logistic_predictions = MLJ.predict(logistic, rows=test)
logistic_fprs, logistic_tprs, _ = roc_curve(logistic_predictions, truths)
logistic_aoc = auc(logistic_predictions, truths) |> r3
```

`0.686`

```
forest_predictions = MLJ.predict(forest, rows=test)
forest_fprs, forest_tprs, _ = roc_curve(forest_predictions, truths)
forest_aoc = auc(forest_predictions, truths) |> r3
```

`0.571`

```
plot(x = logistic_fprs, y = logistic_tprs, color = ["logistic"],
Guide.xlabel("False positive rate"),
Guide.ylabel("True positive rate estimate"),
Geom.smooth(method = :loess, smoothing = 0.99),
layer(
x = forest_fprs, y = forest_tprs, color = ["forest"],
Geom.smooth(method = :loess, smoothing = 0.99),
)
)
```

By doing a train and test split, we basically threw a part of the data away. For small datasets, like the dataset in this example, that is not very efficient. Therefore, we also do a k-fold cross-validation.

```
Random.seed!(123)
rng = MersenneTwister(123)
indexes = shuffle(rng, eachindex(classes))
folds = MLDataUtils.kfolds(indexes, k = 8)
function fitted_accuracy(model, train, test)
forest = machine(model, (U = df.U, V = df.V), df.class)
fit!(forest; rows=train)
predictions = predict_mode(forest, rows=test)
return accuracy(predictions, classes[test]) |> r3
end
accuracies = [fitted_accuracy(logistic_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
```

`([0.7, 0.7, 0.4, 0.6, 0.7, 0.8, 0.7, 0.6], 0.65)`

```
accuracies = [fitted_accuracy(forest_model, train, test) for (train, test) in folds]
accuracies, mean(accuracies) |> r3
```

`([0.5, 0.6, 0.3, 0.5, 0.5, 0.7, 0.5, 0.4], 0.5)`

Biau, G., Scornet, E. (2016). A Random Forest Guided Tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7