Bayesian Latent Profile Analysis (mixture modeling)

Updated on 2021-12-15: Include ordered constraint.

This post discusses some latent analysis techniques and runs a Bayesian analysis for example data where the outcome is continuous, also known as latent profile analysis (LPA). My aim will be to clearly visualize the analysis so that it can easily be adjusted to different contexts.

In essence, latent analyses about finding hidden groups in data (Oberski, 2016). Specifically, they are called mixture models because the underlying distributions are mixed together.

2021-11-17

Collinear Bayes

In my post on Shapley values and multicollinearity, I looked into what happens when you fit a complex uninterpretable model on collinear or near-collinear data and try to figure out which features (variables) are important. The results were reasonable but not great. Luckily, there are still more things to try. Gelman et al. (2020) say that Bayesian models can do reasonably well on collinear data because they show high uncertainty in the estimated coefficients. Also, Bayesian models have a chance of fitting the data better as is beautifully shown in the Stan documentation. It can be quite tricky to implement though because a good parameterization is necessary (https://statmodeling.stat.columbia.edu/2019/07/07/collinearity-in-bayesian-models/).

2021-10-27

Nested cross-validation

Nested cross-validation is said to be an improvement over cross-validation. Unfortunately, I found most explanations quite confusing, so decided to simulate some data and see what happens.

In this post, I simulate two models: one linear model which perfectly fits the data and one which overfits the data. Next, cross-validation and nested cross-validation are plotted. To keep the post short, I've hidden the code to produce the plots.

import MLJLinearModels
import MLJDecisionTreeInterface

using DataFrames: DataFrame, select, Not
using Distributions: Normal
using CairoMakie: Axis, Figure, lines, lines!, scatter, scatter!, current_figure, axislegend, help, linkxaxes!, linkyaxes!, xlabel!, density, density!, hidedecorations!, violin!, boxplot!, hidexdecorations!, hideydecorations!
using MLJ: CV, evaluate, models, matching, @load, machine, fit!, predict, predict_mode, rms
using Random: seed!
using Statistics: mean, std, var, median
using MLJTuning: TunedModel, Explicit
using MLJModelInterface: Probabilistic, Deterministic

2021-06-16

Increasing model accuracy by using foreknowledge

Typically, when making predictions via a linear model, we fit the model on our data and make predictions from the fitted model. However, this doesn't take much foreknowledge into account. For example, when predicting a person's length given only the weight and gender, we already have an intuition about the effect size and direction. Bayesian analysis should be able to incorporate this prior information.

In this blog post, I aim to figure out whether foreknowledge can, in theory, increase model accuracy. To do this, I generate data and fit a linear model and a Bayesian binary regression. Next, I compare the accuracy of the model parameters from the linear and Bayesian model.

2021-01-21

Random forest classification in Julia

Below is example code for fitting and evaluating a linear regression and random forest classifier in Julia. I've added the linear regression as a baseline for the random forest. The models are evaluated on a mock variable U generated from two distributions, namely

 $\begin{aligned} d_1 &= \text{Normal}(10, 2) \: \: \text{and} \\ d_2 &= \text{Normal}(12, 2), \end{aligned}$

The random variable V is just noise meant to test the classifier, generated via

2021-01-21

Random forest, Shapley values and multicollinearity

Linear statistical models are great for many use-cases since they are easy to use and easy to interpret. Specifically, linear models can use features (also known as independent variables, predictors or covariates) to predict an outcome (also known as dependent variables).

In a linear model, a higher coefficient for a feature, the more a feature played a role in making a prediction. However, when variables in a regression model are correlated, these conclusions don't hold anymore.

2020-12-16

GitHub and GitLab commands cheatsheet

Both GitHub and GitLab provide shortcuts for interacting with the layers they have built on top of Git. These shortcuts are a convenient and clean way to interact with things like issues and PRs. For instance, using Fixes #2334 in a commit message will close issue #2334 automatically when the commit is applied to the main branch. However, the layers on top of Git differ between the two, and therefore the commands will differ as well. This document is a cheatsheet for issue closing commands; I plan to add more of these commands over time.

2020-11-29

Design cheatsheet

I like to complain that design can distract from the main topic and is therefore not important. However, design is important. If your site, presentation or article looks ugly, then you already are one step behind in convincing the audience. The cheatsheet below can be used to quickly fix design mistakes.

Colors

Suprisingly, you should Never Use Black. Instead you can use a colors which are near black. For example:

Tint	HTML color code	Example text
Pure black	#000000	~~~ Lorem ipsum dolor sit amet ~~~
Grey	#4D4D4D	~~~ Lorem ipsum dolor sit amet ~~~
Green	#506455	~~~ Lorem ipsum dolor sit amet ~~~
Blue	#113654	~~~ Lorem ipsum dolor sit amet ~~~
Pink	#564556	~~~ Lorem ipsum dolor sit amet ~~~

2020-11-14

Frequentist and Bayesian coin flipping

To me, it is still unclear what exactly is the difference between Frequentist and Bayesian statistics. Most explanations involve terms such as "likelihood", "uncertainty" and "prior probabilities". Here, I'm going to show the difference between both statistical paradigms by using a coin flipping example. In the examples, the effect of showing more data to both paradigms will be visualised.

Generating data

Lets start by generating some data from a fair coin flip, that is, the probability of heads is 0.5.

import CairoMakie

using AlgebraOfGraphics: Lines, Scatter, data, draw, visual, mapping
using Distributions
using HypothesisTests: OneSampleTTest, confint
using StableRNGs: StableRNG

2020-11-07

Installing NixOS with encryption on a Lenovo laptop

In this post, I walk through the steps to install NixOS on a Lenovo Yoga 7 with an encrypted root disk. This tutorial is mainly based on the tutorial by Martijn Vermaat and comments by @ahstro and dbwest.

USB preparation

Download NixOS and figure out the location of the USB drive with lsblk. Use the location of the drive and not the partition, so /dev/sdb instead of /dev/sdb1. Then, prepare the USB with