Denali: A Goal-directed Superoptimizer
Published:
This blog post was written for Dr. James Bornholt’s CS 395T: Systems Verification and Synthesis, Spring 2021. It summarizes the context and contributions of the paper Denali: A Goal-directed Superoptimizer, written by Dr. Rajeev Joshi, Dr. Greg Nelson, and Dr. Keith Randall.
None of the technical ideas discussed in this blog are my own, they are summaries/explanations based on the referenced works.
Denali: A Goal-directed Superoptimizer
Tdoday’s paper is Denali: A Goal-directed Superoptimizer. At the time of its publication (2002), it was one of the first superoptimizers: a code generator which seeks to find truly optimal code. This is a dramatically different approach from traditional compiler optimizations, and is usually specific to efficiency-critical straight-line kernels written at the assembly level.
Background
What is Super🦸optimization?
Plenty of compilers are optimizing compilers. However, in the strictest sense of the word, they don’t really find an optimal translation. They just find one that, according to some heuristics, ought to improve upon a naive translation. Why? Finding optimal translations is, in general, undecidable. Even for simplified, decidable versions of the problem, it is prohibitively time consuming to insert into any mortal programmer’s build-run-debug development cycle.
However, sometimes it is worth the effort to find a truly optimal solution. To disambiguate between these two “optimization” procedures, we use the term superoptimization when we are seeking a “truly optimal” solution. Superoptimization is an offline procedure and typically targets straight-line sequences of machine code inside critical loops.
With a few simplifying assumptions, the shortest straight-line code is the fastest. Consequently, we seek the shortest program.
Super🦸optimization: The Pre-Denali Era (Beginning of Time – 2002)
Alexia Massalin coined the term “superoptimization” in her 1987 paper Superoptimizer – A look at the Smallest Program. Massalin used a (pruned) exhaustive search to find the shortest implementation of various straight line computations in the 68000 instruction set. For instance, she found the shortest programs to compute the signum function, absolute value, max/min, and others. Her goal was to identify unintuitive idioms in these shortest programs so that performance engineers could use them in practice.
While Massalin’s technique was powerful, it did not scale well (the shortest programs were at most 13 instructions long in Massalin’s paper). Moreover, the output programs were not automatically verified to be equivalent to the input programs. They are instead highly likely to be equivalent, and must be verified by hand.
Granlund \& Kenner followed up on Massalin’s work in 1992 with the GNU Superoptimizer. They integrated a variation of Massalin’s superoptimizer into GCC to eliminate branching.
Until 2002, research in superoptimizers seemed to stall. Judging by citations during that period, most researchers considered Massalin’s work to fit inside the field of optimizing compilers. These researchers viewed superoptimization as a useful engineering tool, but of little theoretical interest or scalability. Rather, superoptimization was seen as an interesting application of brute-force search. Massalin and the GNU Superoptimizer seemed to become a token citation in the optimizing compiler literature.
The Denali Approach
Goal-Directed vs. Brute Force
Massalin’s superoptimizer relies on brute-force search: enumerate candidate programs until you find the desired program. Given the massive size of any modern instruction set, this does not scale well. However, since we want the shortest program, we have to rely on some kind of brute-force search. Denali’s insight is that Massalin’s search algorithm was enumerating all candidate programs, instead of only enumerating relevant candidate programs.
Denali users specify their desired program as a set of (memory location, expression to evaluate) pairs. For instance, (%rdi, 2 * %rdi) is the program which doubles the value of %rdi.
Denali’s algorithm only “enumerates” candidate programs which it can prove are equivalent to the desired program. For efficiency, it stores this enumeration in a compact graph structure called an E-Graph, then searches the E-Graph using a SAT solver.
E-Graphs
What is an E-Graph?
An E-Graph is used to represent expressions. For instance, a literal 4 or a register value %rdi is represented as a node with no children.
The expression %rdi * 4 is represented as a node ‘*’ whose first child represents %rdi and whose second child represents 4.
Bigger expressions are represented just like you would think. For instance, the expression %rdi * 4 + 1 would be represented as
So far, this just looks like an Abstract Syntax Tree. E-Graphs are distinguished from ASTs by the ability to represent multiple equivalent expressions. Suppose we wish to add the equivalence 4=2**2 to our E-graph. We do this by adding a special equivalence edge
Since there is no machine exponentiation instruction, this does not look useful at first. However, now we can add a further equivalence edge based on the fact that %rdi « 2 = %rdi * 2**2 = %rdi * * 4.
Since E-Graphs represent A=B by keeping both A and B around, they can become quite massive.
How do we build E-Graphs?
We can use proof rules to repeatedly grow the E-Graph and/or add equivalence edges. If we keep applying our proof rules until our graph stops changing, then we’ve deduced all the ways we can provably compute our expression (relative to our proof rules). For instance, in the previous example we had only three proof rules:
- 4 = 2**2
- x * 2**n = x « n
- If a = b, and b = c, then a = c
If we add more proof rules, we may be able to deduce faster ways to compute our expression.
Other uses of E-Graphs
An early variant of E-Graphs is described in Greg Nelson’s (one of the Denali authors) Ph.D. Thesis. These were used by Nelson in the automated theorem prover Simplify for equational reasoning. Since then, search over E-graphs via the congruence closure algorithm is used by many modern SMT solveres for reasoning about equality of uninterpreted functions (it is even taught in Dr. Isil Dillig’s CS 389L course here at UT!). For example, the Z3 SMT solver implements an E-graph, and the CVC4 Solver implements an incremental congruence closure.
Search over E-Graphs
Nodes in an E-Graph that are connected by an equivalence edge represent expressions that are equivalent according to the proof rules. Therefore, we only need to evaluate one of the nodes. Denali can use a SAT solver to figure out the optimal choice of nodes. Their encoding is not too complicated.
The basic idea of the encoding is as follows:
For each machine instruction node T,
L(i, T) = { 1 T starts executing on cycle i
{ 0 otherwise
Then, all we have to do is add constraints so that
- Exactly one instruction starts executing per cycle.
- Each instruction’s arguments are available when the instruction gets executed.
- Some node equivalent to the root node gets computed.
Now we can find the shortest program encoded in our E-Graph by constraining the SAT solver to look for a program of length 1, then length 2, then length 3, …. until we find a solution.
Impact of the Denali Superoptimizer
Preliminary Results
The Denali paper presents several preliminary results. For the Alpha instruction set architecture, they are able to generate some programs of length up to 31 instructions. For comparison, the GNU superoptimizer is unable to generate (near) optimal instructions sequences of length greater than 5.
However, in addition to Denali’s built-in architectural axioms, the programmers specify program-specific axioms in their examples. This trades off automation for the ability to generate longer (near) optimal instruction sequences.
Super🦸optimization: The Post-Denali Era (2002 – Present Day)
Denali demonstrated that, for small programs, it is possible to generate provably equivalent, (nearly) optimal code. Since then, there has been a lot of interest in superoptimization. Here are some projects/papers that have popped up since Denali.
- Souper is an open-source project that extracts straight-line code from the LLVM IR and applies superoptimization. It uses caching so that it can be run online (2017).
- slumps is based on souper and targets web assembly (2020).
- STOKE is a superoptimizer for x86 (2013) (which we will be reading on March 22 after spring break)
- Uses stochastic enumerative search.
- Is still maintained and open source at StanfordPL/stoke.
- embecosm, a compiler research group, is developing GSO 2.0 (2015)
- There has been research into automatic peephole optimizations (2006).
However, while there is active industry and research interest in the problem that Denali presented (finding a provably equivalent, near-optimal translation), most modern approaches (e.g. souper) rely on SMT-based synthesis techniques. Denali’s methods of superoptimization seem to have largely fallen by the wayside. Part of this is because Denali’s provably (almost) optimal program relies on a set of user-specified axioms, and is only optimal with respect to those axioms. Part of the appeal of an SMT solver is standardized theories for certain objects and operations.
Both enumerative search (e.g. STOKE) and goal-directed search (e.g. souper) are used today. In addition, Denali’s general notion of specification (a set of symbolic input-output pairs) is still used, with various project-specific modifications. Projects still rely on (often hand-written) heuristics to measure the cost/cycle-count of candidate programs.
Discussion Questions
- As computation times are increasingly memory-bound, does superoptimization run into concerns with Amdahl’s law?
- SMT solvers are powerful tools, but incredibly general-purpose. What types of computations are likely to be compute-bound, and can we use that domain-specific knowledge to make superoptimization faster?
- Superoptimization seems naturally related to automated vectorization. However, people seem to treat the two problems as separate. Is there any reason automated vectorization might make superoptimization much more difficult?
- BLIS is a framework for creating BLAS implementations for new architectures by only implementing small, finite computations kernels. Can BLIS be combined with superoptimization to automatically generate BLAS libraries for specific architectures?
References
All graphs were built using graphviz. The example E-Graph in section What is an E-Graph? is based on the example in today’s paper.