Updated on 2021-12-15: Include ordered constraint.

This post discusses some latent analysis techniques and runs a Bayesian analysis for example data where the outcome is continuous, also known as latent profile analysis (LPA). My aim will be to clearly visualize the analysis so that it can easily be adjusted to different contexts.

In essence, latent analyses about finding hidden groups in data (Oberski, 2016). Specifically, they are called mixture models because the underlying distributions are mixed together.

For example, suppose we had data about dog weights and did not register the breed. If we assume that the sample only consists of Australian Sheperds and American Terriers, that American Terriers are larger on average, and that both groups are normally distributed, then we can take the observed data and estimate the latent distributions. For example, we can generate some data:

begin
    using Bijectors: OrderedBijector, ordered, inv
    using CairoMakie
    using DataFrames
    using StableRNGs: StableRNG
    using Statistics: mean, median, std
    using Turing
end

d1 = Normal(22, 2.4); # Australian Shepherd.

d2 = Normal(28, 2.8); # American Terrier.

combined = let
    rng = StableRNG(1)
    [i % 2 == 1 ? rand(rng, d1) : rand(rng, d2) for i in 1:200]
end;

and visualize it:

Next to LPA, this problem is also known as gaussian (finite) mixture modelling. When the observed variables are discrete, the appropriate model is known as latent class analysis (LCA) or binomial (finite) mixture model. LCA is also known as latent Dirichlet allocation in the machine learning literature.

As a sidenote, Latent Dirichlet allocation is one of the older methods used for natural language processing tasks (Blei et al., 2003). By old, here, I mean that it is one of the methods used before the advent of deep learning around 2013. A typical natural language processing tasks would be to classify documents. In this setting, LDA can be used to interpret text by finding words with a similar "topic". For example, the topic can be "education" with words such as "school", "students", "schools" and "teachers" (Blei et al., 2003). This example also shows one of the main differences between LDA usage in machine learning and LCA usage in science. Both are essentially the same models, but where social science often sticks to a few latent classes to manually interpret, machine learning happily runs the model for 100 latent classes (Blei et al., 2003; p. 1007).

First attempt

Enough background, let's fit a latent profile model via Turing. Based on the Turing tutorial and a great Discourse post, we can fit a nice model which runs in less than a minute:

@model function exchangeable_mixture(k::Int, Y)
    w ~ Dirichlet(k, 1)
    
    μ ~ filldist(Normal(25, 4), 2)
    
    n = length(Y)
    Y ~ filldist(MixtureModel(Normal, μ, w), n)
end;

exchangeable_model = exchangeable_mixture(2, combined);

n_samples = 1_000;

Just to be sure, let's sample from the prior to see how things look:

exchangeable_prior = sample(exchangeable_model, Prior(), n_samples);

This looks as expected. When looking at the observed distribution in the first figure in this blog, it is not unreasonable to set the means for the μ priors around the middle of the data. With that, the sampler has a nice place to start and should be able to estimate the parameters reasonably quickly.

The w specifies the weights of the latent classes. In the combined data, I've drawn half of the samples from d1 and half of the samples from d2, so the weights should be 0.5 and 0.5. This Dirichlet prior is a, so called, unit simplex. To me it just looks like an uniform prior, but I guess there is a good reason for the unit simplex term. This prior is reasonable because it doesn't tell the sampler much about the location of the weights except that they are between 0 and 1.

So, the prior look good. It's time to obtain the posterior. To do this, we need to use HMC and not NUTS. Normally, NUTS is the best sampler to use, but for latent models NUTS is often having problems. Online, people seem to suggest that this is due to the multimodality, that is, because there are two solutions to this problem:

μ1 estimates the mean of Sheperds and μ2 estimates the mean of Terriers OR
μ2 estimates the mean of Teriers and μ2 estimates the mean of Shepers.

This is called the identifiability problem (Casella & Berger, 2002), the label switching problem (Obserki, 2016) or labeling degeneracies (Betancourt, 2017).

So, let's sample 3 chains in parallel with HMC:

exchangable_posterior = let
    rng = StableRNG(1)
    sampler = HMC(0.2, 10)
    sample(rng, exchangeable_model, sampler, MCMCThreads(), n_samples, 3)
end;

Hmm. That didn't work. When the one or more chains don't move at all (a horizontal line in the left plot) for mixture models, then try reducing the leapfrog step size (the first argument to HMC).

exchangable_posterior_smaller_stepsize = let
    rng = StableRNG(1)
    sampler = HMC(0.01, 10)
    sample(rng, exchangeable_model, sampler, MCMCThreads(), n_samples, 4)
end;

That looks much better. However, now we're dealing with the label switching problem. Normally, to get the parameter estimate, we could just take the mean over all the chains. In this case, we couldn't do that and instead should take the mean over only one chain? That would work, but isn't ideal either.

Fixing the label switching

Betancourt (2017) suggests using an ordered prior for μ. Via Bijectors.jl.OrderedBijector this should be possible in Turing.jl too. Unfortunately, I wasn't able to figure it out. (It appears that the Stan model is transforming things to the log scale and that works well together with the ordered prior. I'm too lazy to convert things to the log scale and back again, so that's why I'm not doing that.)

As a workaround, I came up with the idea to enforce the ordering in another way, namely to cutoff the range of possible values via two non-overlapping uniform distributions. This can be thought of as drawing a line through the middle of the two means which will lock both parameters in their own region.

@model function mm(k::Int, Y)
    w ~ Dirichlet(k, 1)
    
    μ1 = Uniform(10, 25)
    μ2 = Uniform(25, 40)
    μ ~ arraydist([μ1, μ2])
    
    n = length(Y)
    Y ~ filldist(MixtureModel(Normal, μ, w), n)
end;

mixture_model = mm(2, combined);

mixture_model_prior = sample(mixture_model, Prior(), n_samples);

After a bit of fiddling and increasing the number of leapfrog steps to use (the second argument to HMC), this shows chains with nice convergence and mixing:

mixture_posterior = let
    rng = StableRNG(1)
    sampler = HMC(0.01, 20)
    sample(rng, mixture_model, sampler, MCMCThreads(), n_samples, 3)
end;

	parameters	mean	std	mcse	ess_bulk	ess_tail	rhat	ess_per_sec
1	Symbol("w[1]")	0.549693	0.0843669	0.0116713	77.4745	34.9351	1.07559	2.11511
2	Symbol("w[2]")	0.450307	0.0843669	0.0116713	77.4745	34.9351	1.07559	2.11511
3	Symbol("μ[1]")	22.1324	1.19495	0.347121	24.4974	18.9747	1.10219	0.668797
4	Symbol("μ[2]")	28.6765	1.36629	0.138542	148.005	30.5804	1.08689	4.04064

And we now have almost correct estimates for all parameter locations, see the mean column.

To make this even better, it is possible to tune the HMC parameters based on how well the chains converge or by setting a slightly more informative prior. In this case, by setting two non-overlapping truncated normals:

@model function normals_mm(k::Int, Y)
    w ~ Dirichlet(k, 1)
    
    μ1 = truncated(Normal(24, 4), -Inf, 25)
    μ2 = truncated(Normal(26, 4), 25, Inf)
    μ ~ arraydist([μ1, μ2])

    n = length(Y)
    Y ~ filldist(MixtureModel(Normal, μ, w), n)
end;

normal_mixture_model = normals_mm(2, combined);

normal_mixture_posterior = let
    rng = StableRNG(1)
    sampler = HMC(0.02, 40)
    sample(rng, mixture_model, sampler, MCMCThreads(), n_samples, 3)
end;

	parameters	mean	std	mcse	ess_bulk	ess_tail	rhat	ess_per_sec
1	Symbol("w[1]")	0.508567	0.169839	0.0664015	7.33405	6.09653	2.52923	0.335547
2	Symbol("w[2]")	0.491433	0.169839	0.0664015	7.33405	6.22694	2.52923	0.335547
3	Symbol("μ[1]")	18.4868	2.32791	0.938956	6.65775	6.0	3.19946	0.304605
4	Symbol("μ[2]")	34.0229	3.97056	1.61304	6.64	NaN	3.20655	0.303793

Finally, this seems to properly converge. Alas, the model is now quite sensitive to choice of prior and choice of sampler parameters.

Variational inference

Some say that variational inference (VI) can deal much better with mixed models than Markov chain Monte Carlo. (I forgot the reference but read it in some paper while trying to debug models.) Let's put that claim to the test.

VI doesn't have such a nice interface as the Monte carlo based models, but we can run multithreaded sampling with only a few lines of code. The outcomes are put in a DataFrame here to allow for easier plotting:

function sample_vi(model; samples_per_step=10, max_iters=1_000)
    n_chains = 3
    dfs = Vector{DataFrame}(undef, n_chains)
    colnames = names(mixture_model_prior, :parameters)
    Threads.@threads for i in 1:n_chains
        q = vi(model, ADVI(samples_per_step, max_iters))
        M = rand(q, n_samples)::Matrix{Float64}
        df = DataFrame(transpose(M), colnames)
        df[!, :chain] = fill(i, nrow(df))
        df[!, :iteration] = 1:nrow(df)
        dfs[i] = df
    end
    vcat(dfs...)
end;

vi_posterior = sample_vi(mixture_model);

	parameters	mean	std
1	Symbol("w[1]")	0.53518	0.0356584
2	Symbol("w[2]")	0.46482	0.0356584
3	Symbol("μ[1]")	21.6457	0.107201
4	Symbol("μ[2]")	28.6288	0.122096

This outcome is

closer to the correct outcome than the Monte Carlo based posteriors,
is easier to get right (less fiddling with sampler parameters) and
runs about 4 times as fast at the time of writing (about 20 seconds for Monte Carlo samplers above versus 5 seconds for VI).

One caveat though: don't run only one VI chain on the exchangeable model above! It will happily give a completely incorrect outcome without showing sampling problems! To avoid that, run multiple VI chains like shown here.

Ordered constraint

The drawback of the earlier solutions to the label switching problem is that an estimate should be available for the location of the distributions. This isn't always the case, especially when the problem would involve more latent distributions than presented here. A better solution would be to enforce an ordering on the means μ, that is, to enforce that μ1 ≤ μ2 ≤ ... ≤ μn for n means. With such an ordering, it is impossible for the labels to switch.

After lots and lots of fiddling, I did manage to use an ordered prior in Turing.jl. Thanks to help by Tor Fjelde in a GitHub issue. The trick is to use the Bijectors.OrderedBijector, put the desired values through the inverse of the bijector and put these outcomes in an ordered(arraydist(...)).

Also, for technical reasons, the numbers put through the inverse cannot be the same, so that's why the second number is slightly larger. I've fiddled a bit with how much difference there is between the numbers and a smaller difference shows a better prior plot, but worse HMC posterior. Very strange.

inv_ordered(X::Vector) = Bijectors.inverse(Bijectors.OrderedBijector())(X);

M = inv_ordered([25, 25.01])

2-element Vector{Float64}:
 25.0
 -4.6051701859879355

@model function ordered_mixture(k::Int, Y)
    w ~ Dirichlet(k, 1)

    μ ~ ordered(arraydist([Normal(m, 4) for m in M]))

    n = length(Y)
    Y ~ filldist(MixtureModel(Normal, μ, w), n)
end;

ordered_model = ordered_mixture(2, combined);

In the end, the HMC sampler keeps being the most robust. I've tried NUTS(1_000, 20) too and it did work albeit taking minutes to finish and giving erratic w estimates. Also, I've tried VI and that just didn't work with the ordering constraint and no amount of sampler parameter tuning seemed to solve the problem.

So, there we go, let's see the best model for the data:

ordered_hmc_posterior = let
    rng = StableRNG(1)
    sampler = HMC(0.001, 100)
    sample(rng, ordered_model, sampler, MCMCThreads(), 2 * n_samples, 3)
end;

One last thing. Convergence takes a while, so let's throw away the first 700 samples.

ordered_warmed_posterior = let
    rng = StableRNG(1)
    sampler = HMC(0.001, 100)
    discard_initial = 1000
    sample(rng, ordered_model, sampler, MCMCThreads(), 2 * n_samples, 3; discard_initial)
end;

	parameters	mean	std	mcse	ess_bulk	ess_tail	rhat	ess_per_sec
1	Symbol("w[1]")	0.536398	0.0356455	0.00129203	761.0	1651.15	1.0021	4.12846
2	Symbol("w[2]")	0.463602	0.0356455	0.00129203	761.0	1651.15	1.0021	4.12846
3	Symbol("μ[1]")	21.6523	0.109502	0.00282875	1498.73	2775.08	1.00107	8.13068
4	Symbol("μ[2]")	28.6743	0.110521	0.0117081	89.0494	230.993	1.0255	0.483098

Awesome.

Conclusion

It is very tricky to get mixture modeling right with Markov chain Monte Carlo. When sampling from the posterior, the sampler couldn't deal well with the label switching problem. Fiddling like this with Bayesian samplers has benefits too, namely that the problematic sampling did indicate a problem in the model specification.

There were two solutions: One solution is to define very strong location priors so that μ[1] and μ[2] have difficulty switching places. This works with VI and, in turn, reduces the running time and is less sensitive to the choice of sampler parameters. The drawback is that much information is required to know where the location should be or what is a good cutoff point. Another solution is to use the ordered constraint. Unfortunately, I couldn't get this to work with VI nor with NUTS. Therefore, some manual tuning of sampler parameters is required to get a good outcome. The running time is reasonable with about 20 seconds for the last model in this post.

Overall, it took about six days to write this blog which is a bit more than I would have guessed. The main reason why it took so long was that Turing.jl provides a lot of flexibility when defining the models. In other words, there are many ways in which you can shoot yourself in the foot. At the same time, it's really great to know that tuning models to specific use-cases is possible unlike frequentist models. Hopefully, this post will provide a helpful foundation for more complex Bayesian mixed models.

References

Betancourt, M. Identifying Bayesian Mixture Models. Stan documentation. https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.

Casella, G. & Berger, R. L. (2002). Statistical Inference. Second edition. Cengage learning.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. Chapman and Hall/CRC. https://doi.org/10.1201/9780429258411.

Oberski D. L. (2016) Mixture Models: Latent Profile and Latent Class Analysis. In: Robertson J., Kaptein M. (eds) Modern Statistical Methods for HCI. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-319-26633-6_12.

Built with Julia 1.10.4 and

Bijectors 0.13.13
CairoMakie 0.11.11
DataFrames 1.6.1
StableRNGs 1.0.2
Statistics 1.10.0
Turing 0.30.9