2022-06-25

Why I still recommend Julia (for Data Science)

Yuri Vishnevsky wrote that he no longer recommends Julia. This caused lengthy discussions at Hacker News, Reddit and the Julia forum. Yuri argues that Julia shouldn't be used in any context where correctness matters. Based on the amount and the ferocity of the comments, it is natural to conclude that Julia as a whole must produce incorrect results and therefore cannot be a productive environment. However, the scope of the blog post and the discussions are narrow. In general, I still recommend Julia for data science applications because it is fundamentally productive and, with care, correct.

Show more

2022-03-19

Optimizing Julia code

I'm lately doing for the first time some optimizations of Julia code and I sort of find it super beautiful.

This is how I started a message on the Julia language Slack in response to a question about why optimising Julia code is so difficult compared to other languages. In the message I argued against that claim. Optimising isn't hard in Julia if you compare it to Python or R where you have to be an expert in Python or R and C/C++. Also, in that message I went through a high-level overview of how I approached optimising. The next day, Frames Catherine White, who is a true Julia veteran, suggested that I write a blog post about my overview, so here we are.

Show more

2022-02-16

Static site authentication

More and more companies start providing functionality for static site hosting. For example, GitHub announced Pages in 2008, Netlify was founded in 2014, GitLab annouced pages in 2016 and Cloudflare launched a Pages beta in 2020. Nowadays, even large general cloud providers, such as Digital Ocean, Google, Microsoft or Amazon, have either dedicated products or dedicated tutorials on static site hosting.

In terms of usability, the products score similarly. Setting up hosting usually involves linking a Git account to a hoster which will allow the hoster to detect changes and update the site based on the latest state of the repository. In terms of speed, the services are also pretty similar.

Show more

2021-12-01

Bayesian Latent Profile Analysis (mixture modeling)

Updated on 2021-12-15: Include ordered constraint.

This post discusses some latent analysis techniques and runs a Bayesian analysis for example data where the outcome is continuous, also known as latent profile analysis (LPA). My aim will be to clearly visualize the analysis so that it can easily be adjusted to different contexts.

In essence, latent analyses about finding hidden groups in data (Oberski, 2016). Specifically, they are called mixture models because the underlying distributions are mixed together.

Show more

2021-11-17

Collinear Bayes

In my post on Shapley values and multicollinearity, I looked into what happens when you fit a complex uninterpretable model on collinear or near-collinear data and try to figure out which features (variables) are important. The results were reasonable but not great. Luckily, there are still more things to try. Gelman et al. (2020) say that Bayesian models can do reasonably well on collinear data because they show high uncertainty in the estimated coefficients. Also, Bayesian models have a chance of fitting the data better as is beautifully shown in the Stan documentation. It can be quite tricky to implement though because a good parameterization is necessary (https://statmodeling.stat.columbia.edu/2019/07/07/collinearity-in-bayesian-models/).

Show more

2021-10-27

Nested cross-validation

Nested cross-validation is said to be an improvement over cross-validation. Unfortunately, I found most explanations quite confusing, so decided to simulate some data and see what happens.

In this post, I simulate two models: one linear model which perfectly fits the data and one which overfits the data. Next, cross-validation and nested cross-validation are plotted. To keep the post short, I've hidden the code to produce the plots.

import MLJLinearModels
import MLJDecisionTreeInterface

using DataFrames: DataFrame, select, Not
using Distributions: Normal
using CairoMakie: Axis, Figure, lines, lines!, scatter, scatter!, current_figure, axislegend, help, linkxaxes!, linkyaxes!, xlabel!, density, density!, hidedecorations!, violin!, boxplot!, hidexdecorations!, hideydecorations!
using MLJ: CV, evaluate, models, matching, @load, machine, fit!, predict, predict_mode, rms
using Random: seed!
using Statistics: mean, std, var, median
using MLJTuning: TunedModel, Explicit
using MLJModelInterface: Probabilistic, Deterministic

Show more

2021-06-16

Increasing model accuracy by using foreknowledge

Typically, when making predictions via a linear model, we fit the model on our data and make predictions from the fitted model. However, this doesn't take much foreknowledge into account. For example, when predicting a person's length given only the weight and gender, we already have an intuition about the effect size and direction. Bayesian analysis should be able to incorporate this prior information.

In this blog post, I aim to figure out whether foreknowledge can, in theory, increase model accuracy. To do this, I generate data and fit a linear model and a Bayesian binary regression. Next, I compare the accuracy of the model parameters from the linear and Bayesian model.

Show more

2021-01-21

Random forest classification in Julia

Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. I've added the linear regression as a baseline for the random forest. The models are evaluated on a mock variable U generated from two distributions, namely

\begin{aligned}
d_1 &= \text{Normal}(10, 2) \: \: \text{and} \\
d_2 &= \text{Normal}(12, 2),
\end{aligned}

The random variable V is just noise meant to test the classifier, generated via

Show more

2021-01-21

Random forest, Shapley values and multicollinearity

Linear statistical models are great for many use-cases since they are easy to use and easy to interpret. Specifically, linear models can use features (also known as independent variables, predictors or covariates) to predict an outcome (also known as dependent variables).

In a linear model, a higher coefficient for a feature, the more a feature played a role in making a prediction. However, when variables in a regression model are correlated, these conclusions don't hold anymore.

Show more

2020-12-16

GitHub and GitLab commands cheatsheet

Both GitHub and GitLab provide shortcuts for interacting with the layers they have built on top of Git. These shortcuts are a convenient and clean way to interact with things like issues and PRs. For instance, using Fixes #2334 in a commit message will close issue #2334 automatically when the commit is applied to the main branch. However, the layers on top of Git differ between the two, and therefore the commands will differ as well. This document is a cheatsheet for issue closing commands; I plan to add more of these commands over time.

Show more

◀ prev

▶ next