Optimizing Julia code

I'm lately doing for the first time some optimizations of Julia code and I sort of find it super beautiful.

This is how I started a message on the Julia language Slack in response to a question about why optimising Julia code is so difficult compared to other languages. In the message I argued against that claim. Optimising isn't hard in Julia if you compare it to Python or R where you have to be an expert in Python or R and C/C++. Also, in that message I went through a high-level overview of how I approached optimising. The next day, Frames Catherine White, who is a true Julia veteran, suggested that I write a blog post about my overview, so here we are.

In this blog post, I'll describe what type stability is and why it is important for performance. Unlike most other posts, I'll discuss it in the context of performance (raw throughput) and in the context of time to first X (TTFX). Julia is sort of notorious for having really bad TTFX in certain cases. For example, creating a plot with the Makie.jl package takes 40 seconds at the time of writing. On the second call, it takes about 0.001 seconds. This blog post explains the workflow that you can use to reduce running time and TTFX.

Type stability

Let's first talk about that type stability thing that everyone keeps talking about. Why is it important? To show this, let's write naive Julia code. Specifically, for this example, we write code which can hide the type from the compiler, that is, we need to add some kind of indirection so that the compiler cannot infer the types. This can be done via a dictionary. Note that our dictionary returns different types, namely an Float32 and a Float64:

numbers = Dict(:one => 1f0, :two => 2.0);

function double(mapping, key::Symbol)
    return 2 * mapping[key]
end;

This code works, we can pass :one or :two and the number will be doubled:

double(numbers, :one)

2.0f0

double(numbers, :two)

4.0

Let's look at the optimized LLVM code via @code_warntype. Here, you can ignore the with_terminal; it's only needed because this blog post is running in a Pluto.jl notebook.

MethodInstance for double(::Dict{Symbol, AbstractFloat}, ::Symbol)
  from double(mapping, key::Symbol) @ Main REPL[2]:1
Arguments
  #self#::Core.Const(Main.double)
  mapping::Dict{Symbol, AbstractFloat}
  key::Symbol
Body::Any
1 ─ %1 = Main.:*::Core.Const(*)
│   %2 = Base.getindex(mapping, key)::AbstractFloat
│   %3 = (%1)(2, %2)::Any
└──      return %3

When looking at this in a terminal, the Any type at line %3 and the AbstractFloat at line %2 are highlighted in red. Also, the Any behind Body is highlighted in red.

Ouch. The optimized code looks quite good with one Base.getindex and a 2 * %1, but we do get some big red warnings about the output type which is an Any. That color indicates that something is wrong. What is wrong is that an Any type cannot easily be put into a memory spot. For a concrete type such as Float64, we know how much space we need so we don't need a pointer and we can even put the number nearer to the CPU so that it can quickly be accessed. To see whether a type is concrete, we can use isconcretetype:

isconcretetype(Float64)

true

isconcretetype(AbstractFloat)

false

To make matters worse, Julia does a lot of optimizing, but it cannot do much for abstract types. For example, let's write two very simple functions:

function use_double(mapping, x)
    doubled = 2 * double(mapping, x)
    string(doubled)
end;

use_double(numbers, :one)

"4.0"

This is how the @code_warntype looks:

@code_warntype use_double(numbers, :one)

MethodInstance for use_double(::Dict{Symbol, AbstractFloat}, ::Symbol)
  from use_double(mapping, x) @ Main REPL[4]:1
Arguments
  #self#::Core.Const(Main.use_double)
  mapping::Dict{Symbol, AbstractFloat}
  x::Symbol
Locals
  doubled::Any
Body::Any
1 ─ %1 = Main.:*::Core.Const(*)
│   %2 = Main.double(mapping, x)::Any
│        (doubled = (%1)(2, %2))
│   %4 = Main.string::Core.Const(string)
│   %5 = doubled::Any
│   %6 = (%4)(%5)::Any
└──      return %6

Here, the Any type at line %2, %5, and %6 are abstract types so in red when viewed in a terminal. Also, the Any behind Body is again highlighted in red. And the doubled local variable is red too.

The Any type propagated. Now, also the use_naive_double function has an Any output type. And, the type of the variable doubled isn't known when the function is compiled meaning that the call string(doubled) ends up being a runtime dispatch. This means that Julia has to lookup the right method during running time in the method lookup table. If the type was known, Julia would just hardcode the link to the right method and thus avoid a method table lookup or it would just copy-paste the content of the function to avoid jumping at all. This is called inlining.

To see that in action, let's go on a little digression and take a look at optimised code for the case when the types are known. For this, consider two simple functions:

inner(x) = 2 * x;

outer(x) = 3 * inner(x);

We can now call this for, say an Int and get an output:

outer(2)

Let's look at the LLVM code for this function:

@code_llvm outer(2)

; Function Signature: outer(Int64)
;  @ REPL[8]:1 within `outer`
define i64 @julia_outer_2413(i64 signext %"x::Int64") #0 {
top:
; ┌ @ int.jl:88 within `*`
   %0 = mul i64 %"x::Int64", 6
   ret i64 %0
; └
}

Hopefully, you're now thinking "WOW!". The compiler figured out that inner is just 2 * x so there is no need to step into that function, we can just calculate 2 * x directly. But then, it figures out that 2 * 3 * x = 6 * x, so we can get the answer in one LLVM instruction.

On the other hand, what if we add a Base.inferencebarrier to block inference inside the outer function:

blocked_outer(x) = 3 * inner(Base.inferencebarrier(x));

@code_llvm blocked_outer(2)

; Function Signature: blocked_outer(Int64)
;  @ REPL[10]:1 within `blocked_outer`
define nonnull ptr @julia_blocked_outer_2429(i64 signext %"x::Int64") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
  %gcframe2 = alloca [3 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 24, i1 true)
  %pgcstack = call ptr inttoptr (i64 6655373676 to ptr)(i64 261) #8
  store i64 4, ptr %gcframe2, align 16
  %task.gcstack = load ptr, ptr %pgcstack, align 8
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %pgcstack, align 8
  %box_Int64 = call nonnull align 8 dereferenceable(8) ptr @ijl_box_int64(i64 signext %"x::Int64") #2
  %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
  store ptr %box_Int64, ptr %gc_slot_addr_0, align 16
  store ptr %box_Int64, ptr %jlcallframe1, align 8
  %0 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#2433.jit", ptr nonnull %jlcallframe1, i32 1)
  store ptr %0, ptr %gc_slot_addr_0, align 16
  store ptr @"jl_global#2436.jit", ptr %jlcallframe1, align 8
  %1 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
  store ptr %0, ptr %1, align 8
  %2 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#2435.jit", ptr nonnull %jlcallframe1, i32 2)
  %frame.prev7 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev7, ptr %pgcstack, align 8
  ret ptr %2
}

To see the difference in running time, we can compare the output @benchmark for both:

using BenchmarkTools: @benchmark

@benchmark outer(2)

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.840 ns … 39.890 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.070 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.123 ns ±  1.198 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                               ▄█▄▇▃
  ▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▃▂▃▃▃▄▅▅▇▆█████▆▆▃▃▃▂▂▂ ▃
  1.84 ns        Histogram: frequency by time        3.36 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark blocked_outer(2)

BenchmarkTools.Trial: 10000 samples with 988 evaluations.
 Range (min … max):  46.660 ns … 195.516 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     57.996 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.208 ns ±   5.836 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▅▆██▇▇▄▁
  ▁▁▁▁▁▁▁▂▂▂▂▃▃▄▅▇█████████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▁▂▂▂▂▂▂▂▁▁ ▃
  46.7 ns         Histogram: frequency by time         79.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, even though benchmarks below 1 ns aren't reliable, we can see that the inferable function (outer) is much faster. Next, we'll show that this is not all due to having the extra call to Base.inferencebarrier.

We've seen that knowing the types is important for the compiler, so let's improve the type inference for the function above. We could fix it in a few ways. We could add a type hint at the function. For example, a type hint could look like this:

function with_type_hint(x)
    Base.inferrencebarrier(x)::Int
end;

With this, the output type of the function body is known:

@code_warntype with_type_hint(1)

MethodInstance for with_type_hint(::Int64)
  from with_type_hint(x) @ Main REPL[12]:1
Arguments
  #self#::Core.Const(Main.with_type_hint)
  x::Int64
Body::Int64
1 ─ %1 = Base.inferrencebarrier::Any
│   %2 = (%1)(x)::Any
│   %3 = Main.Int::Core.Const(Int64)
│   %4 = Core.typeassert(%2, %3)::Int64
└──      return %4

which solves further inference problems if we use this method, but it is a bit risky. The Core.typeassert will assert the type and throw an error if the type turns out to be wrong. This hinders writing generic code. Also, it takes the system a little bit of time to actually assert the type.

So, instead it would be better to go to the root of the problem. Above, we had a dictionary numbers:

numbers

Dict{Symbol, AbstractFloat} with 2 entries:
  :two => 2.0
  :one => 1.0

The type is:

typeof(numbers)

Dict{Symbol, AbstractFloat}

Where AbstractFloat is a abstract (non-concrete) type meaning that it cannot have direct instance values, and more importantly meaning that we cannot say with certainty which method should be called for an object of such a type.

We can make this type concrete by manually specifying the type of the dictionary. Now, Julia will automatically convert our Float32 to a Float64:

typednumbers = Dict{Symbol, Float64}(:one => 1f0, :two => 2.0);

Let's look again to the @code_warntype:

@code_warntype use_double(typednumbers, :one)

MethodInstance for use_double(::Dict{Symbol, Float64}, ::Symbol)
  from use_double(mapping, x) @ Main REPL[4]:1
Arguments
  #self#::Core.Const(Main.use_double)
  mapping::Dict{Symbol, Float64}
  x::Symbol
Locals
  doubled::Float64
Body::String
1 ─ %1 = Main.:*::Core.Const(*)
│   %2 = Main.double(mapping, x)::Float64
│        (doubled = (%1)(2, %2))
│   %4 = Main.string::Core.Const(string)
│   %5 = doubled::Float64
│   %6 = (%4)(%5)::String
└──      return %6

Great! None of the types are red (abstract) when shown in a terminal. So, this is now exactly the same function as above, but all the types are concrete and the compiler is happy.

Let's run the benchmarks for both numbers and typednumbers:

@benchmark use_double(numbers, :one)

BenchmarkTools.Trial: 10000 samples with 434 evaluations.
 Range (min … max):  207.373 ns …  34.702 μs  ┊ GC (min … max): 0.00% … 98.86%
 Time  (median):     274.493 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   299.830 ns ± 478.759 ns  ┊ GC (mean ± σ):  5.18% ±  3.92%

                   ▂▄▆█▆▂
  ▁▃▂▂▂▂▂▂▂▂▂▃▃▄▅▇▇██████▅▃▃▃▃▃▃▃▄▄▄▄▄▄▃▃▃▂▂▂▂▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  207 ns           Histogram: frequency by time          404 ns <

 Memory estimate: 168 bytes, allocs estimate: 5.

@benchmark use_double(typednumbers, :one)

BenchmarkTools.Trial: 10000 samples with 867 evaluations.
 Range (min … max):  125.767 ns …  16.412 μs  ┊ GC (min … max):  0.00% … 98.36%
 Time  (median):     168.858 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   209.523 ns ± 328.130 ns  ┊ GC (mean ± σ):  15.29% ± 11.09%

  ▅█▅▁  ▂▂                                                      ▁
  ████▇███▆▄▃▁▃▅▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▅▄▃▅▄▄▃▄▅▄▄▄▄▄▅▃▅▅ █
  126 ns        Histogram: log(frequency) by time       1.97 μs <

 Memory estimate: 432 bytes, allocs estimate: 3.

So, that's a reduction in running time which we basically got for free. The only thing we needed to do was look through our naive code and help out the compiler a bit by adding more information.

And this is exactly what I find so beautiful about the Julia language. You have this high-level language where you can be very expressive, write in whatever style you want and don't have to bother about putting type annotations on all your functions. Instead, you first focus on your proof of concept and get your code working and only then you start digging into optimizing your code. To do this, you can often get pretty far already by looking at @code_warntype.

But, what if your code contains more than a few functions? Let's take a look at some of the available tooling.

Tooling

The most common tool for improving performance is a profiler. Julia has a profiler in the standard library:

using Profile

This is a sampling-based profiler meaning that it takes samples to estimate how much time is spent in each function.

@profile foreach(x -> blocked_outer(2), 1:100)

We can now call Profile.print() to see the output and how many samples were taken in each function. However, in most cases we want to have a nice plot. Here, I use ProfileSVG.jl, but other options are also listed in the Julia Profiling documentation. See especially PProf.jl since that viewer can show graphs as well as flame graphs.

using ProfileSVG: @profview

@profview foreach(x -> blocked_outer(2), 1:10_000_000)

The idea in this plot is that the wider a block, the more time is spent on it. Also, blocks which lay on top of other block indicate that they were called inside the outer block. As can be seen, the profiler is very useful to get an idea of which function takes the most time to run. In this case, most of the time is actually spent in Julia's eval logic. When going up to see where the time is spent, we see that it's mostly spent inside blocked_outer. Given that foreach is not much wider, for example, we can see that the logic for the loop itself is not taking much time.

However, this doesn't tell us what is happening exactly. For that, we need to dive deeper and look critically at the source code of the function which takes long. Sometimes, that already provides enough information to see what can be optimized. In other cases, the problem isn't so obvious. Probably, there is a type inference problem because that can make huge differences as is shown in the section above. One way would then be to go to the function which takes the most time to run and see how the type inference looks via @code_warntype. Unfortunately, this can be a bit tricky. Consider, for example, a function with keyword arguments:

with_keyword_arguments(a; b=3) = a + b;

@code_warntype with_keyword_arguments(1)

MethodInstance for with_keyword_arguments(::Int64)
  from with_keyword_arguments(a; b) @ Main REPL[9]:1
Arguments
  #self#::Core.Const(Main.with_keyword_arguments)
  a::Int64
Body::Int64
1 ─ %1 = Main.:(var"#with_keyword_arguments#3")::Core.Const(Main.var"#with_keyword_arguments#3")
│   %2 = (%1)(3, #self#, a)::Int64
└──      return %2

Here, we don't see the a + b as we would expect, but instead see that the with_keyword_arguments calls another function without keyword arguments. Now, we would need to manually call this nested function with a generated name var"#with_keyword_arguments#1" with exactly the right inputs to see what @code_warntype does exactly inside this function. Even worse, imagine that you have a function which calls a function which calls a function...

To solve this, there is Cthulhu.jl. With Cthulhu, it is possible to @descend into a function and see the code warntype. Next, the arrow keys and enter can be used to step into a function and see the code warntype for that. By continuously stepping into and out of functions, it is much easier to see what code is calling what and where exactly the type inference starts to fail. Often, by solving a type inference problem at exactly the right spot, inference problems for a whole bunch of functions can be fixed. For more information about Cthulhu, see the GitHub page linked above.

A complementary tool to find the root of type problems is JET.jl. Basically, this tool can automate the process described above. It relies on Julia's compiler and can point to the root of type inference problems. Let's do a demo. Here, we use the optimization analysis:

using JET: @report_opt

@report_opt blocked_outer(2)

═════ 2 possible errors found ═════
┌ blocked_outer(x::Int64) @ Main ./REPL[4]:1
│ runtime dispatch detected: inner(%1::Any)::Any
└────────────────────
┌ blocked_outer(x::Int64) @ Main ./REPL[4]:1
│ runtime dispatch detected: (3 * %2::Any)::Any
└────────────────────

In this case, the tool points out exactly the problem we've had. Namely, because the function definition is 3 * inner(Base.inferencebarrier(x)), the inner function call cannot be optimized because the type is unknown at that point. Also, the output of inner(Base.inferencebarrier(x)) is unkown and we have another runtime dispatch.

For extremely long outputs, it can be useful to print the output of JET to a file to easily navigate through the output.

These are the most important tools to improve performance. If this is all you care about, then feel free to stop reading here. In the next section, let's take a look at how to reduce the time to first X.

Precompilation

As described above, Julia does lots of optimizations on your code. For example, it removes unnecessary function calls and hardcodes method calls if possible. This takes time and that is a problem. Like said above, Makie runs extremely quick after the first time that you have created a plot going from 40 seconds to something like 0.001 seconds. And, we need to wait all these seconds every time that we restart Julia. Of course, Julia developers don't develop by changing their plotting code and wait 40 seconds to see the output. We use tools such as Pluto.jl or Revise.jl to use code changes without restarting Julia. Still, sometimes it is necessary to restart Julia, so what can we do to reduce the compilation time?

Well, we can reduce the compilation time by shouting I am the compiler now! and write optimized code manually. For example, this is done in OrdinaryDiffEq.jl#1465. In some cases, this can be a great last-resort solution to make some compilation time disappear.

However, it is quite laborious and not suitable in all cases. A very nice alternative idea is to move the compilation time into the precompilation stage. Precompilation occurs right after package installation or when loading a package after it has been changed. The results of this compilation are retained even after restarting the Julia instance. So, instead of having to compile things for each restart, we just compile it only when changing the package! Sounds like a good deal.

It is a good deal. Except, we have to note that we're working with the Julia language. Not all functions have typed arguments let alone concretely typed arguments, so the precompile phase cannot always know what it should compile. Even more, Julia by default doesn't compile all functions with concretely typed arguments. It just assumes that some function will probably not be used, so no need to precompile it. This is on purpose, to avoid developers putting concrete types everywhere which would make Julia packages less composable which is a very fair argument.

Anyway, we can fix this by adding precompile directives ourselves. For example, we can create a new function, call precompile on it for integers and look at the existing method specializations:

add_one(x) = x + 1
precompile(add_one, (Int,))
methods(add_one)[1].specializations

MethodInstance for Main.var"workspace#7".add_one(::Int64)

A method specialization is just another way of saying a compiled instance for a method. So, a specialization is always for some concrete types. This method specialization shows that add_one is compiled even though we haven't called add_one yet. The function is completely ready for use for integers. If we pass another type, the function would still need to compile.

What is nice about this is that the precompile will compile everything recursively. So, say, we have a large codebase handling some kind of notebooks and the package has some kind of open function with concrete types such as a ServerSession to open the notebook into and a String with the path for the notebook location, then we can add a precompile on that function as follows:

precompile(open, (ServerSession, String))

Inside this large codebase. Since the open function is calling many other functions, the precompile will compile many functions and can reduce the time to first X by a lot. This is what happened in Pluto.jl#1934. We've added one line of code to reduce the time to first open a notebook from 11 to 8 seconds. That is a 30% reduction in running time by adding one line of code. To figure out where you need to add precompile directives exactly, you can use SnoopCompile.jl.

Alas, now you probably wonder why we didn't have a 100% reduction. The answer is type inference. precompile will go through all the functions recursively but once the type becomes non-concrete, it cannot know what to compile. To fix this, we can use the tools presented above to fix type inference problems.

In conclusion, this is what I find so beautiful about the language. You can hack your proof-of-concept together in very naive ways and then throw on a few precompiles if you want to reduce the TTFX. Then, once you need performance, you can pinpoint what method takes the most time, look at the generated LLVM code and start fixing problems such as type inference. Improving the inferability will often make code more readable, it will reduce running time and it will reduce time to first X; all at the same time.

Acknowledgements

Thanks to Michael Helton, Rafael Fourquet and Guillaume Dalle for providing feedback on this blog post.

Appendix

This post was built with Julia 1.11.3 and

BenchmarkTools 1.5.0
JET 0.9.12
PlutoUI 0.7.60
Profile 1.11.0
ProfileSVG 0.2.2