The future of Clang-based tooling

MalBot · July 28, 2023, 11:55am

By Peter Goodman

Clang is a marvelous compiler; it’s a compiler’s compiler! But it isn’t a toolsmith’s compiler. As a toolsmith, my ideal compiler would be an open book, allowing me to get to everywhere from anywhere. The data on which my ideal compiler would operate (files, macros, tokens), their eventual interpretation (declarations, statements, types), and their relations (data flow, control flow) would all be connected.

On its own, Clang does not do these things. libClang looks like an off-the-shelf, ready-to-use solution to your C, C++, and Objective-C parsing problems, but it’s not. In this post, I’ll investigate the factors that drive Clang’s popularity, why its tooling capabilities are surprisingly lacking despite those factors, and the new solutions that make Clang’s future bright.

What lies behind Clang’s success?

Clang is the name of the “compiler front end” that generates an intermediate representation (IR) from your C, C++, and Objective-C source code. That generated IR is subsequently taken as input by the LLVM compiler back end, which converts the IR into machine code. Readers of this blog will know LLVM by the trail of our lifting tools.

I adopted Clang as my primary compiler over a decade ago because of its actionable (and pretty!) diagnostic messages. However, Clang has only recently become one of the most popular production-quality compilers. I believe this because it has, over time, accumulated the following factors that drive compiler popularity:

Fast compile times: Developers don’t want to wait ages for their code to compile.
Generated machine code runs quickly: Everyone wants their code to run faster, and for some users, a small-percentage performance improvement can translate to millions of dollars in cost savings (so cloud spend can go further!).
End-to-end correctness: Developers need to trust that the compiler will almost always (because bugs do happen) translate their source code into semantically equivalent machine code.
Quality of diagnostic messages: Developers want actionable messages that point to errors in their code, and ideally recommend solutions.
Generates debuggable machine code: The machine code must work with yesterday’s debugger formats.
Backing and momentum: People with lots of time (those in academia) or money (those in the industry) need to push forward the compiler’s development so that it is always improving on the above metrics.

However, one important factor is missing from this list: tooling. Despite many improvements over the past few years, Clang’s tooling story still has a long way to go. The goal of this blog post is to present a reality check about the current state of Clang-based tooling, so let’s dive in!

The Clang AST is a lie

Clang’s abstract syntax tree (AST) is the primary abstraction upon which all tooling is based. ASTs capture essential information from source code and act as scaffolding for semantic analysis (e.g., type checking) and code generation.

But what about when things aren’t in the source code? In C++, for example, one generally does not explicitly invoke class destructor methods. Instead, those methods are implicitly invoked at the end of an object’s lifetime. C++ is full of these implicit behaviors, and almost none of them are actually explicitly represented in the Clang AST. This is a big blind spot for tools operating on the Clang AST.

The Clang CFG is a (pretty good) lie

I complained above that it was a shame that the wealth of information available to compilers is basically left on the table in favor of ad-hoc solutions. To be fair, this is simplistic; Clang is not ideally engineered for interactivity within an IDE, for example. But also, there are some really fantastic Clang-based tools out there that are actively used and developed, such as the Clang Static Analyzer.

Because the Clang Static Analyzer is “built on Clang,” one might assume that its analyses are performed on a representation that is faithful to both the Clang AST and the generated LLVM IR. Yet just above, I revealed to you that the Clang AST is a lie—it’s missing quite a bit, such as implicit C++ destructor calls. The Clang Static Analyzer apparently side-steps this issue by operating on a data structure called the CFG.

The Clang CFG, short for control-flow graph, represents how a theoretical computer would execute the statements encoded in the AST. The accuracy of analysis results hinges on the accuracy of the CFG. Yet the CFG isn’t actually used during Clang’s codegen process, which produces LLVM IR containing—you guessed it—control-flow information. The Clang CFG is actually just a very good approximation of the implementation that actually matters. As a toolsmith, I care about accuracy; I don’t want to have to guess about where the abstraction leaks.

LLVM IR as the one true IR is a lie

Clang’s intermediate representation, LLVM IR, is produced directly from the Clang AST. LLVM IR is superficially machine code independent. The closer you look, the easier it is to spot the machine-dependent parts, such as intrinsics, target triples, and data layouts. However, these parts are not expected to be retargetable because they are explicitly specific to the target architecture.

What makes LLVM IR fall short of being a practically retargetable IR actually has very little to do with LLVM IR itself, and more to do with how it is produced by Clang. Clang doesn’t produce identical-looking LLVM IR when compiling the same code for different architectures. Trivial examples of this are that LLVM IR contains constant values where the source code contained expressions like sizeof(void *). But those are the known knowns; the things that developers can reasonably predict will differ. The unreasonable differences happen when Clang over-eagerly chooses type, function parameter, and function return value representations that will “fit” well with the target application binary interface (ABI). In practice, this means that your std::pair<int, int> function parameter might be represented as a single i64, two i32s, an array of two i32s, or even as a pointer to a structure… but never a structure. Hilariously, LLVM’s back end handles structure-typed parameters just fine and correctly performs target-specific ABI lowering. I bet there are bugs lurking between these two completely different systems for ABI lowering. Reminds you of the CFG situation a bit, right?

The takeaway here is that the Clang AST is missing information that is invented by the LLVM IR code generator, but LLVM IR is also missing information that is destroyed by said code generator. And if you want to bridge that gap, you need to rely on an approximation: the Clang CFG.

Encore: the lib in libClang is a lie

Libraries are meant to be embedded into larger programs; therefore, they should strive not to trigger aborts that would tear down those program processes! Especially not when performing read-only, non-state-mutating operations. I say the “lib” in libClang is a lie because the “Clang API” isn’t really intended as an external API; it’s an internal API for the rest of Clang. When Clang is using itself incorrectly, it makes sense to trigger an assertion and abort execution—it’s probably a sign of a bug. But it just so happens that a significant portion of Clang’s API is exposed in library form, so here we are today with libClang, which pretends to be a library but is not engineered as such.

Encore the second: compile_commands.json is a lie

The accepted way to run Clang-based tooling on a whole program or project is a JSON format aptly named compile_commands.json. This JSON format embeds the invocation of compilers in command form (either as a string – yuck!, or as a list of arguments), the directory in which the compiler operated, and the primary source file being compiled.

Unfortunately, this format is missing environment variables (those pesky things!). Yes, environment variables materially affect the operation and behavior of compilers. Better-known variables like CPATH, C_INCLUDE_PATH, and CPLUS_INCLUDE_PATH affect how the compiler resolves #include directives. But did you know about CCC_OVERRIDE_OPTIONS? If not, guess what: neither does compile_commands.json!

Okay, so maybe these environment variables are not that frequently used. Another environment variable, PATH, is always used. When one types clang at the command line, the PATH variable is partially responsible for figuring out to which Clang binary the variable will be executed. Depending on your system and setup, this might mean Apple Clang, Homebrew Clang, vcpkg Clang, one of the many Clangs available in Debian’s package manager, or maybe a custom-built one. This matters because the clang executable is introspective. Clang uses its own binary’s path to discover, among other things, the location of the resource directory containing header files like stdarg.h.

As a toolsmith, I want to be able to faithfully reproduce the original build, but I can’t do that with the compile_commands.json format as it exists today.

Final encore: Compilers textbooks are lying to you (sort of)

I promise this is my last rant, but this one cuts to the crux of the problem. Compilers neatly fit the pipeline architecture: Source code files are lexed into tokens, which are then structured into AST by parsers. The ASTs are then analyzed for semantic correctness by type checkers before being converted into IRs for generic optimizations. Finally, the IR is targeted and lowered into a specific machine code by the back end.

This theoretical pipeline architecture has many nice properties. Pipeline architectures potentially enable third-party tools to be introduced between any two stages, so long as the tool consumes the right input format and produces the right output format. In fact, it is this pipeline nature that makes the LLVM back end excel at optimization. LLVM optimizers are “passes” that logically consume and produce LLVM IR.

The truth is that in Clang, lexing, parsing, and semantic analysis are a fractal of colluding components that cannot easily be teased apart. The semantic analyzer drives the pre-processor, which co-routines with the lexer to identify, annotate, and then discard tokens as soon as they aren’t needed. Clang keeps just enough information around to report pretty diagnostics and to handle parsing ambiguities in languages like C++, and throws away the rest in order to be as fast and memory-efficient as possible.

What this means in practice is that, surprisingly, Clang’s preprocessor can’t actually operate correctly on a pre-lexed token stream. And there are more subtle consequences; for example, interposing on the preprocessor to capture macro expansions appears to be supported, but is barely usable in practice. This support is implemented via a callback mechanism. Unfortunately, the callbacks often lack sufficient context or are called at the wrong time. From the stream of callbacks alone, one can’t distinguish between scenarios like macro expansion of macro arguments vs. expansion that occurs before a function-like macro invocation, or macro expansions before vs. inside of a conditional directive. This matters for tools that want to present both the source and the macro expansion tree. There’s a reason why Clang-based tools like the excellent Woboq Code Browser invoke a second preprocessor inside of the callbacks; there’s just no other way to see what actually happens.

At the end of the day, the mental model of a traditional compiler pipeline neatly described by compiler textbooks is simplistic and does not represent the way Clang actually works. Preprocessing is a remarkably complex problem, and reality often demands complex solutions to such problems.

The future of Clang-based tooling is on its way

If you agree with my rant, check out PASTA, a C++ and Python wrapper around a large percentage of Clang’s API surface area. It does things big and small. Among small things, it provides a disciplined and consistent naming scheme for all API methods, automatic memory management of all underlying data structures, and proper management of compile commands. Among the big, it provides bi-directional mappings between lexed tokens from files and AST nodes, and it makes API methods conventionally safe to use even if you shouldn’t use them (because Clang doesn’t document when things assert and tear down your process).

PASTA isn’t a panacea for all of my complaints. But—lucky for you, aspiring Clang toolsmith or reader—DARPA is generously funding the future of compiler research. As part of the DARPA V-SPELLS program, Trail of Bits is developing VAST, a new MLIR-based middle-end to Clang which we introduced in our VAST-checker blog post. VAST converts Clang ASTs into a high-level, information-rich MLIR dialect that simultaneously maintains provenance with the AST and contains explicit control- and data-flow information. VAST progressively lowers this MLIR, eventually reaching all the way down to LLVM IR. Maybe those textbooks weren’t lying after all, because this sounds like a pipeline connecting Clang’s AST to LLVM IR.

That’s right: we’re not throwing the baby out with the bathwater. Despite my long rant, Clang is still a great C, C++, and Objective-C front end, and LLVM is a great optimizer and back end. The needs of the time conspired to fit these two gems together in a less-than-ideal setting, and we’re working to develop the crown jewel. Watch this spot because we will be releasing a tool combining PASTA and VAST in the near future under a permissive open-source license.

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Distribution Statement A – Approved for Public Release, Distribution Unlimited

Article Link: The future of Clang-based tooling | Trail of Bits Blog