The life and times of an Abstract Syntax Tree

MalBot · May 2, 2024, 2:05pm

By Francesco Bertolaccini

You’ve reached computer programming nirvana. Your journey has led you down many paths, including believing that God wrote the universe in LISP, but now the truth is clear in your mind: every problem can be solved by writing one more compiler.

It’s true. Even our soon-to-be artificially intelligent overlords are nothing but compilers, just as the legends foretold. That smart contract you’ve been writing for your revolutionary DeFi platform? It’s going through a compiler at some point.

Now that we’ve established that every program should contain at least one compiler if it doesn’t already, let’s talk about how one should go about writing one. As it turns out, this is a pretty vast topic, and it’s unlikely I’d be able to fit a thorough disquisition on the subject in the margin of this blog post. Instead, I’m going to concentrate on the topic of Abstract Syntax Trees (ASTs).

In the past, I’ve worked on a decompiler that turns LLVM bitcode into Clang ASTs, and that has made me into someone with opinions about them. These are opinions on the things they don’t teach you in school, like: what should the API for an AST look like? And how should it be laid out in memory? When designing a component from scratch, we must consider those aspects that go beyond its mere functionality—I guess you could call these aspects “pragmatics.” Let’s go over a few of them so that if you ever find yourself working with ASTs in the future, you may skip the more head-scratching bits and go straight to solving more cogent problems!

What are ASTs?

On their own, ASTs are not a very interesting part of a compiler. They are mostly there to translate the dreadful stream of characters we receive as input into a more palatable format for further compiler shenanigans. Yet the way ASTs are designed can make a difference when working on a compiler. Let’s investigate how.

Managing the unmanageable

If you’re working in a managed language like C# or Java, one with a garbage collector and a very OOP type system, your AST nodes are most likely going to look something like this:

class Expr {}
class IntConstant : Expr {
    int value;
}
class BinExpr : Expr {
    public Expr lhs;
    public Expr rhs;
}

This is fine—it serves the purpose well, and the model is clear: since all of the memory is managed by the runtime, ownership of the nodes is not really that important. At the end of the day, those nodes are not going anywhere until everyone is done with them and the GC determines that they are no longer reachable.

(As an aside, I’ll be making these kinds of examples throughout the post; they are not meant to be compilable, only to provide the general idea of what I’m talking about.)

I typically don’t use C# or Java when working on compilers, though. I’m a C++ troglodyte, meaning I like keeping my footguns cocked and loaded at all times: since there is no garbage collector around to clean up after the mess I leave behind, I need to think deeply about who owns each and every one of those nodes.

Let’s try and mimic what was happening in the managed case.

The naive approach

struct Expr {
    virtual ~Expr();
};
struct IntConstant : Expr {
    int value;
};
struct BinExpr : Expr {
    std::shared_ptr lhs;
    std::shared_ptr rhs;
};

Shared pointers in C++ use reference counting (which one could argue is a form of automatic garbage collection), which means that the end result is similar to what we had in Java and C#: each node is guaranteed to stay valid at least until the last object holding a reference to it is alive.

That at least in the previous sentence is key: if this was an Abstract Syntax Graph instead of an Abstract Syntax Tree, we’d quickly find ourselves in a situation where nodes would get stuck in a limbo of life detached from material reality, a series of nodes pointing at each other in a circle, forever waiting for someone else to die before they can finally find their eternal rest as well.

Again, this is a purely academic possibility since a tree is by definition acyclic, but it’s still something to keep in mind.

I don’t know Rust that well, but it is my understanding that a layout roughly equivalent to the one above would be written like this:

enum Expr {
    IntConstant(i32),
    BinExpr(Arc<Expr>, Arc<Expr>)
}

When using this representation, your compiler will typically hold a reference to a root node that causes the whole pyramid of nodes to keep standing. Once that reference is gone, the rest of the nodes follow suit.

Unfortunately, each pointer introduces additional computation and memory consumption due to its usage of an atomic reference counter. Technically, one could avoid the “atomic” part in the Rust example by using Rc instead of Arc, but there’s no equivalent of that in C++ and my example would not work as well. In my experience, it’s quite easy to do away with the ceremony of making each node hold a reference count altogether, and instead decide on a more disciplined approach to ownership.

The “reverse pyramid” approach

struct Expr {
    virtual ~Expr();
};
struct IntConstant : Expr {
    int value;
};
struct BinExpr : Expr {
    std::unique_ptr lhs;
    std::unique_ptr rhs;
};

Using unique pointers frees us from the responsibility of keeping track of when to free memory without adding the overhead of reference counting. While it’s not possible for multiple nodes to have an owning reference to the same node, it’s still possible to express cyclic data structures by dereferencing the unique pointer and storing a reference instead. This is (very) roughly equivalent to using std::weak_ptr with shared pointers.

Just like in the naive approach, destroying the root node of the AST will cause all of the other nodes to be destroyed with it. The difference is that in this case we are guaranteed that this will happen, because every child node is owned by their parent and no other owning reference is possible.

I believe this representation is roughly equivalent to this Rust snippet:

enum Expr {
    IntConstant(i32),
    BinExpr(Box<Expr>, Box<Expr>)
}

Excursus: improving the API

We are getting pretty close to what I’d call the ideal representation, but one thing I like to do is to make my data structures as immutable as possible.

BinExpr would probably look like this if I were to implement it in an actual codebase:

class BinExpr : Expr {
    std::unique_ptr lhs, rhs;
public:
    BinExpr(std::unique_ptr lhs, std::unique_ptr rhs)
        : lhs(std::move(lhs))
        , rhs(std::move(rhs)) {}   
    const Expr& get_lhs() const { return *lhs; }
    const Expr& get_rhs() const { return *rhs; }
};

This to me signals a few things:

Nodes are immutable.
Nodes can’t be null.
Nodes can’t be moved; their owner is fixed.

Removing the safeguards

The next step is to see how we can improve things by removing some of the safeguards that we’ve used so far, without completely shooting ourselves in the foot. I will not provide snippets on how to implement these approaches in Rust because last time I asked how to do that in my company’s Slack channel, the responses I received were something like “don’t” and “why would you do that?” and “someone please call security.” It should not have been a surprise, as an AST is basically a linked list with extra steps, and Rust hates linked lists.

Up until now, the general idea has been that nodes own other nodes. This makes it quite easy to handle the AST safely because the nodes are self-contained.

What if we decided to transfer the ownership of the nodes to some other entity? It is, after all, quite reasonable to have some sort of ASTContext object we can assume to handle the lifetime of our nodes, similar to what happens in Clang.

Let’s start by changing the appearance of our Expr nodes:

struct BinExpr : Expr {
    const Expr& lhs;
    const Expr& rhs;
};

Now we create a storage for all of our nodes:

vector<unique_ptr> node_storage;
auto &lhs = node_storage.emplace_back(make_unique(...));
auto &rhs = node_storage.emplace_back(make_unique(...));
auto &binexp = node_storage.emplace_back(make_unique(*lhs, *rhs));

Nice! node_storage is now the owner of all the nodes, and we can iterate over them without having to do a tree visit. In fact, go watch this talk about the design of the Carbon compiler, about 37 minutes in: if you keep your pattern of creating nodes predictable, you end up with a storage container that’s already sorted in, e.g., post-visit order!

Variants on a theme

Let’s now borrow a trick from Rust’s book: the Expr class I’ve been using up until this point is an old-school case of polymorphism via inheritance. While I do believe inheritance has its place and in many cases should be the preferred solution, I do think that ASTs are one of the places where discriminated unions are the way to go.

Rust calls discriminated unions enum, whereas C++17 calls them std::variant. While the substance is the same, the ergonomics are not: Rust has first class support for them in its syntax, whereas C++ makes its users do template metaprogramming tricks in order to use them, even though they do not necessarily realize it.

The one feature I’m most interested in for going with variant instead of inheritance is that it turns our AST objects into “value types,” allowing us to store Expr objects directly instead of having to go through an indirection via a reference or pointer. This will be important in a moment.

The other feature that this model unlocks is that we get the Visitor pattern implemented for free, and we can figure out exactly what kind of node a certain value is holding without having to invent our own dynamic type casting system. Looking at you, LLVM. And Clang. And MLIR.

Going off the rails

Let’s take a look back at an example I made earlier:

vector<unique_ptr> node_storage;
auto &lhs = node_storage.emplace_back(make_unique(...));
auto &rhs = node_storage.emplace_back(make_unique(...));
auto &binexp = node_storage.emplace_back(make_unique(*lhs, *rhs));

There’s one thing that bothers me about this: double indirection, and noncontiguous memory allocation. Think of what the memory layout for this storage mechanism looks like: the vector will have a contiguous chunk of memory allocated for storing pointers to all of the nodes, then each pointer will have an associated chunk of memory the size of a node which, as mentioned earlier, varies for each kind of node.

What this means is that our nodes, even if allocated sequentially, have the potential to end up scattered all over the place. They say early optimization is the root of all evil, but for the sake of exhausting all of the tricks I have up my sleeve, I’ll go ahead and show a way to avoid this.

Let’s start by doing what I said I’d do earlier, and use variant for our nodes:

struct IntConstant;
struct BinExpr;
using Expr = std::variant<IntConstant, BinExpr>;
struct IntConstant {

int value;

};

struct BinExpr  {

Expr &lhs;

Expr &rhs;

};

Now that each and every node has the same size, we can finally store them contiguously in memory:

std::vector node_storage;
node_storage.reserve(max_num_nodes);
auto &lhs = node_storage.emplace_back(IntConstant{3});
auto &rhs = node_storage.emplace_back(IntConstant{4});
auto &binexp = node_storage.emplace_back(BinExpr{lhs, rhs});

You see that node_storage.reserve call? That’s not an optimization—that is an absolutely load-bearing part of this mechanism.

I want to make it absolutely clear that what’s happening here is the kind of thing C++ gets hate for. This is a proverbial gun that, should you choose to use it, will be strapped at your hip pointed at your foot, fully loaded and ready to blow your leg off if at any point you forget it’s there.

The reason we’re using reserve in this case is that we want to make sure that all of the memory we will potentially use for storing our nodes is allocated ahead of time, so that when we use emplace_back to place a node inside of it, we are guaranteed that that chunk of memory will not get reallocated and change address. (If that were to happen, any of our nodes that contain references to other nodes would end up pointing to garbage, and demons would start flying out of your nose.)

Using vector and reserve is of course not the only way to do this: using an std::array is also valid if the maximum number of nodes you are going to use is known at compile time.

Ah yes, max_num_nodes. How do you compute what that is going to be? There’s no single good answer to this question, but you can find decent heuristics for it. For example, let’s say you are parsing C: the smallest statement I can think of would probably look something like a;, or even more extremely, just a. We can deduce that, if we want to be extremely safe, we could allocate storage for a number of nodes equal to the amount of characters in the source code we’re parsing. Considering that most programs will not be anywhere close to this level of pathological behavior, it’s reasonable to expect that most of that memory will be wasted. Unfortunately, we can’t easily reclaim that wasted memory with a simple call to shrink_to_fit, as that can cause a reallocation.

The technique you can use in that case, or in the case where you absolutely cannot avoid allocating additional memory, is to actually do a deep clone of the AST, visiting each node and painstakingly creating a new counterpart for it in the new container.

One thing to keep in mind, when storing your AST nodes like this, is that the size of each node will now be equal to the size of the largest representable node. I don’t think that this matters that much, since you should try and keep all of your nodes as small as possible anyway, but it’s still worth thinking about.

Of course, it might be the case that you don’t actually need to extract the last drop of performance and memory efficiency out of your AST, and you may be willing to trade some of those in exchange for some ease of use. I can think of three ways of achieving this:

Use std::list.
Use std::deque.
Use indices instead of raw pointers.

Let’s go through each of these options one at a time.

Use std::list instead of std::vector

Don’t. ‘Nuff said.

Alright, fine. I’ll elaborate.

Linked lists were fine in the time when the “random access” part of RAM was not a lie yet and memory access patterns didn’t matter. Using a linked list for storing your nodes is just undoing all of the effort we’ve gone through to optimize our layout.

Use std::deque instead of std::vector

This method is already better! Since we’ll mostly just append nodes to the end of our node storage container, and since a double-ended queue guarantees that doing so is possible without invalidating the addresses of any existing contents, this looks like a very good compromise.

Unfortunately the memory layout won’t be completely contiguous anymore, but you may not care about that. If you are using Microsoft’s STL, though, you have even bigger issues ahead of you.

Use indices instead of raw pointers

The idea is that instead of storing the pointer of a child node, you store the index of that node inside of the vector. This adds a layer of indirection back into the picture, and you now also have to figure out what vector does this index refer to? Do you store a reference to the vector inside each node? That’s a bit of a waste. Do you store it globally? That’s a bit icky, if you ask me.

Parting thoughts

I’ve already written a lot and I’ve barely scratched the surface of the kind of decisions a designer will have to make when writing a compiler. I’ve talked about how you could store your AST in memory, but I’ve said nothing about what you want to store in your AST.

The overarching theme in this exhilarating overview is that there’s a lot about compilers that goes beyond parsing, and all of the abstract ideas needed to build a compiler need concretizing at some point, and the details on how you go about doing that matter. I also feel obligated to mention two maxims one should keep in mind when playing this sort of game: premature optimization is the root of all evil, and always profile your code—it’s likely that your codebase contains lower-hanging fruit you can pick before deciding to fine-tune your AST storage.

It’s interesting that most of the techniques I’ve shown in this article are not easily accessible with managed languages. Does this mean that all of this doesn’t really matter, or do compilers written in those languages (I’m thinking of, e.g., Roslyn) leave performance on the table? If so, what’s the significance of that performance?

Finally, I wanted this post to start a discussion about the internals of compilers and compiler-like tools: what do these often highly complex pieces of software hide beneath their surface? It’s easy to find material about the general ideas regarding compilation—tokenization, parsing, register allocation—but less so about the clever ideas people come up with when writing programs that need to deal with huge codebases in a fast and memory-efficient manner. If anyone has war stories to share, I want to hear them!

Article Link: The life and times of an Abstract Syntax Tree | Trail of Bits Blog