HUIJZER.XYZ

Why I still recommend Julia (for Data Science)

2022-06-25

Yuri Vishnevsky wrote that he no longer recommends Julia. This caused lengthy discussions at Hacker News, Reddit and the Julia forum. Yuri argues that Julia shouldn't be used in any context where correctness matters. Based on the amount and the ferocity of the comments, it is natural to conclude that Julia as a whole must produce incorrect results and therefore cannot be a productive environment. However, the scope of the blog post and the discussions are narrow. In general, I still recommend Julia for data science applications because it is fundamentally productive and, with care, correct.

Classless

To understand Julia, we have to first understand Julia's lack of classes. In object-oriented programming, classes allow objects and methods on those objects to be put in one space. The idea is that when you are, for example, writing business logic with books, then you want to neatly put the operations on the book near the definition of the book. One such operation can be to sell. The sell function can do things like marking the book object as sold. This can work fine, you can even add new classes such as bicycles and spoons and add a sell method to them as long as they all inherit from the same superclass, say, Product. This Product class would define an empty sell method which needs to be extended by all subclasses and allows the compiler and static linter tools to know that any Product subclass has a sell method which is great for discoverability of those methods. So, as an object-oriented programmer, a lot of thought is put into finding the right inheritance so that methods can be re-used between different classes. Also, class and object variables hold the data and are coupled to methods too. In summary, methods are coupled to other methods and to data.

This coupling is counter to the idea of modularity. Modular artifacts are easier to produce and maintain. For example, mechanical artifacts such as cars, bridges and bikes are not glued together but instead use fasteners such as nuts and bolts. If the products would be all glued together, then the tiniest mistake in the production of one of these artifacts means that it can be thrown away immediately. The same holds for maintenance. In most cases, repairing a glued together artifact is more expensive than replacing it by a new artifact. And this is exactly the problem with object-oriented programming. In essence, the whole system is glued together since not only methods are glued to their class and data, but also classes are glued together.

This leads us to the description of Julia's class-system. Julia doesn't have a class-system. Instead of linking functions to classes, all functions are separate objects which can be located via the global method table. Before we can dive into that, we have to start by describing how function calls work in Julia.

Multiple dispatch

Instead of a hierarchy on objects, Julia implements a hierarchy on types. For example, the code

plus_one(x::Number) = x + 1;

defines a function plus_one with one method. In this blog post, the semicolon denotes that the output shouldn't be printed. The method takes one object x of type Number. We can extend this function by adding another method:

plus_one(x::String) = "$(x) 1";

Now, calling the function with a number will increase the number by one and calling it with a string will add a 1 at the end:

plus_one(0)
1
plus_one("my text")
"my text 1"

Important to note is that Julia will always call the most specialized methods for any type. For instance, we could add another method for Float64:

plus_one(x::Float64) = error("Float64 is not supported");

to cause the function call to error for an 64-bit floating point number. We can also add a fallback with

plus_one(x) = error("Type $(typeof(x)) is not supported");

to get

plus_one(Symbol("my text"))
ErrorException("Type Symbol is not supported")

Each addition of a method modifies the method tables. For each function, there is a table which allows Julia's compiler to convert each function call to the right method call. The table can also handle dispatch on multiple types, which is much more involved, hence the name multiple dispatch. This is what enables Julia's composability.

Composability

Due to these method tables, Julia allows arbitrary combinations of packages to work together. For example, the Measurements.jl package can be used to propagate uncertainties caused by physical measurements. We can define such a measurement as follows:

using Measurements: measurement
u = measurement(4.5, 0.1)

$$4.5 \pm 0.1$$

and pass the measurement into plus_one:

plus_one(u)

$$5.5 \pm 0.1$$

Even though plus_one was written without having Measurements in mind, it still works. The reason that it works is that plus_one calls the function + on its input argument x and 1. Julia will look this up in the method table for + and realize that it has to call a + method defined by Measurements which handles the operation. And this is what surprises me a bit in the aforementioned critique of Julia. The critique was that many packages don't always produce the right results when used together. Basically, the critique is taking the standpoint that there is a huge combination of packages that can be misused together and so the language is inherently broken. Instead, I'd say that there is a huge combination of packages that can be used together providing the language with an enormous amount of possibilities. It is up to the programmer to verify the interaction before use and, preferably, add tests to one of the packages. In other languages, the programmer would need to manually write glue code to make packages work together. Like said above, glued together code is hard to create and maintain.

Performance

As mentioned before, much happens via the method tables. However, such a lookup isn't quick. Imagine having a loop over thousands of numbers and doing a table lookup in each iteration. Julia's trick is to solve this during the compilation phase and to generate highly-optimized LLVM code. This allows Julia programs to be quick if desired. In a nutshell, if you want to optimize Julia code, the goals are to reduce allocations, improve type stability and improve inferability. These targets can be achieved by many introspecting tools provided by the language such as @code_warntype and @code_llvm which show, respectively, the generated typed and LLVM code. For more information, see my blog post on optimizing Julia code. Therefore, in comparison to many other high-level languages, Julia is one of the few where writing fast code doesn't require expertise in low and high-level languages at the same time, and one of the few where you don't have to glue your high-level code to low-level code.

Compilation time

A valid critique of Julia is the compilation time or, actually, the time it takes to get to first X (TTFX) where X can be anything such as a HTTP response, a plot or prediction from a model. Julia's dynamic nature, the expressiveness that the compiler allows, and the decision to aggressively optimize the LLVM code causes much time to be spent in compilation. In practise, loading a few libraries and calling a few functions can easily take a few minutes whereas the second call to the same function can take less than a second. Restarting the Julia process will trigger compilation from scratch again. Mitigations are to keep the process alive and live reload code. Tools for this are Revise.jl or Pluto.jl. With this, Julia's compilation time is faster than ahead of time compiled languages because only the methods that changed will be recompiled. Hence, the development workflow is to start the process, get a cup of coffee while it loads for the first time and keep the process alive for the rest of the day.

As a side note, there is hope that the compilation time will be reduced further. There are various complementary angles of attack. For example, Julia 1.8 reduces compilation time up to 50% and this could be reduced further by caching binary code in Julia 1.9. In many cases, the compilation time can also be reduced by writing easy-to-compile code which often is synonymous for easy-to-read code. Finally, it is also likely that CPUs will become faster over time and that tooling around reducing TTFX will improve.

Consistency

One factor influencing how easy it is to learn a language is consistency. The more consistent a language is, the less rote memorization is needed for edge cases. This is something that has to be done right from the start of a language since breaking changes are hard to sell to users. Julia has spent much effort in getting the APIs right in Julia 1.0. One such example is code vectorization. In many languages, writing fast code for arrays means passing the array into function as a whole. Only via that way, the function can apply optimizations such as parallelization or SIMD. However, when writing a function that takes an array what should it return? Logical reasoning would say an array. Some language communities say an array, a dataframe, a vector of strings containing text and the numbers or, well, anything. As a user, you're required to learn these idiosyncrasies by heart. In Julia, the idea is to instead let the user decide how to call functions at the caller side and optimize that. For example, users can easily write SIMD loops via LoopVectorization.jl or use the built-in broadcasting which generates specialized code for the loops. For example, we can apply broadcasting via the dot operator on the plus_one function:

plus_one.([1, 2, 3])
3-element Vector{Int64}:
 2
 3
 4

which will be optimized to a SIMD loop as can be verified in the REPL via @code_typed plus_one.([1, 2, 3]) or @code_llvm plus_one.([1, 2, 3]).

Operating system independence

Another reason that I still recommend Julia is that it manages platform independence well. With this, I mean that packages as a whole are easy to move between operating systems; not just the language itself. To do this, Julia has solved problems with package management and third-party binary objects.

Package management is handled by the built-in package manager Pkg.jl. Simply put, instead of spending efforts on multiple package managers, one is maintained by the core language team and used by everyone. The manager supports environments, like Python's virtualenv, and a manifest specifying all dependencies and indirect dependencies for reproducibility. These environments allow installing multiple versions of the same package at the same time and supports private registries in support of corporate usage. In practice, Pkg.jl works flawlessly and is easy-to-use.

At the same time, third-party binaries, such as the LaTeX engine Tectonic or the GR Framework for visualization applications, can be provided via the Artifacts.jl system. Binaries can be specified by an URL and a hash. Then, the system will automatically download the right binaries during package installation and cache them. On top of this, Yggdrasil has goes one step futher and bundles third-party binaries. It does this via an automated build system comparable to nixpkgs. The system compiles many binaries from scratch to support as many platforms as possible. Now a crazy fact. Often, Yggdrasil's binaries are available for more platforms than the original binaries! Like nixpkgs, the binaries can be dynamically linked to other binaries already existing in the system, say, upstream binaries. This avoids having to statically compile everything and, hence, create large binaries. For the interested reader, it works by wrapping the environment in a temporary environment with links to the upstream binaries.

In practice, these systems mean that you almost never need to configure your operating system. In other words, you only have to install Julia or get the default Julia Docker image. You don't need much system configuration nor some kind of virtualization for CI or production environments. Docker or apt-get are mostly restricted to cases where packages aren't yet available in Julia.

Expressiveness and conclusion

To finally answer why I still recommend Julia, I have to talk about expressiveness. Expressiveness, in my opinion, is synonymous to productivity. Expressiveness is about how concisely and readily an idea can be expressed in a language (Farmer, 2010). It is similar to Dijkstra's idea that programs should be elegant. Put in more modern terms, I'd say that a more expressive language requires fewer characters to implement an solution to a problem while maintaining human readability.

From the definition, it becomes clear that a languages expressiveness relates to the idea or problem at hand. At this moment, I wouldn't call Julia expressive for problems where the core packages are missing, static compilation is a must-have, garbage collection is undesirable, or where low compilation times are essential. Here, by "core packages", I mean foundational packages for things like HTTP, dataframes, images or neural networks. However, the situation is different when the core packages are available, dynamic languages are acceptable, and garbage collection and compilation time are acceptable. In that sweet spot, Julia shines because

When looking from this perspective, I don't see any other language coming even close to Julia in terms of expressiveness. Even more so, as the language will reduce the time to first X and as more packages will become available, I expect its expressiveness to take the lead in many more domains. This means that Julia is already a very productive language and that it will become a more and more productive language over time for data science applications.

That is why I still recommend Julia.

Acknowledgements

Thanks to Jose Storopoli, Scott Paul Jones, and Magnus Lie Hetland for providing suggestions for this text.

Built with Julia 1.10.5 and

Measurements 2.11.0