There are a few overlooked areas when trying to find and murder slow code, besides the regular offenders: accidental reflection, IO/logging, and doing too much work. Here are a few lessons. But first:
A brief obligatory warning
Complexity is the perpetual wolf outside the door. Performant code can easily obscure intent and harm clarity. Premature optimization… what’s that quote?
Try not to guess or assume things about your code’s performance, as the modern stack is complex. Always measure before you start fixing the perceived bottleneck. I use YourKit, which my employer generously purchased. It does a whole lot more than CPU and memory profiling, and I’ve barely scratched the surface of it’s capabilities. (I’m interested if anyone has any good Clojure-specific filtersets for it.)
Having said that, some observations…
Do you really need state there?
My colleague is working on a genetic algorithm simulator for robotic vacuums. Basically each trial run simulates a single robot exploring the room, picking up dust. To model it, the room should obviously be an indexable grid of cells, either a two-dimensional vector or an unrolled single vector. But should this be a collection of simple maps, or atoms of simple maps? Since the vacuum is the only agent of change in the room, and there aren’t multiple vacuums or processes, it is easier and faster to run the simulation without doing atom-bashing. Just return the whole state of the room as a fn of the previous state. Thanks to structural sharing, it’s no more memory allocation. There are non-trivial benefits to doing this as well. Replace (reduce) with (reductions) and you get a seq of states of the room that you can inspect – or animate!
(time) is not a profiler
Benchmarking the JVM is hard, benchmarking a LISP in the JVM is really hard. criterium is a library that does better, scientific benchmarking. So can YourKit or jvisualvm.
Transients
If there are rules about #= club and macros, then the rule for transient club should be:
Don’t use transients before you’ve profiled.
But have no qualms using (into), as it accumulates over transients internally. The transient lives and dies within mutation, never enjoying adulthood. Less experienced Clojure users shouldn’t know these exist, otherwise they will inadvertently reimplement Java collections.
Lesser known data structures
If you need a map sorted by key, (sorted-map). Ditto, but sorted by value, (clojure.data.priority-map/priority-map). Look for calls to (apply min), or (apply min-key).
Cond you help me out?
In a hot loop with a multi-clause cond statement, put your clauses in descending order of likelihood. If you put a test condition that is truthy 99% of the time at the end of the clauses, or as the :else clause, execution has to walk through all the previous clauses, failing. Put the most likely escape routes first, without harming clarity.
1 2 3 4 5 6 7 8 9 10 | |
(case) is a little-known variant for branching over compile-time constants.
To vec or not to vec
The use of transients underscores a more general question about using appropriate types. Let’s say that you have a vector, and a bunch of fns related to processes around it. Remember that because core/map and filter return sequential structures, it’s really easy for a vector to accidentally downgrade into a sequential, slowing down all calls to nth. Watch for this at function boundaries, and any point that you see ->>. To prevent this issue, here are some questions to ask:
- Are you unnecessarily specifying representation? If so, stop and rethink.
- Do I really need a vector within these functions internally?
- What are the entry and exit points of this process, and do they need vectors?
- If so, who should ensure the appropriate structure?
The new “framework” clojure.core.reducers doesn’t suffer from the conflation of transformation and representation.
Pretty-printing
Watch for pprint calls. This falls under IO/logging, and it’s never an efficient call in any language.
Division
This is not Clojure-specific, but division is slow, even with integers. I was using modulus to loop over slots in a ring buffer, thinking it was straight-forward. Nope, huge bottleneck. The LMAX team knows this very well. They implemented a crazy fast trading system over a ring buffer in the JVM, not C++. Instead of dividing, which takes many more cycles than addition, they did made the size of the ring buffer a power of two and used bit-shifting instead. Much faster! Maybe I’ll talk about primitive/numeric hinting in a future post… Daniel Solano Gómez did a great talk on numerics.
Complected code
The performance of intertwined or tightly-coupled code is undefined. If you can’t reason over it, can you measure it?
For more real-world discussion, Craig Andera gave a great talk about real-world performance, not just about Clojure, but a more holistic approach.