A Winter’s Tale: Improving messages and types in GDB’s Python API

By Matheus Branco Borella, University of São Paulo

As a winter associate at Trail of Bits, my goal was to make two improvements to the GNU Project Debugger (GDB): make it run faster and improve its Python API to support and improve tools that rely on it, like Pwndbg. The main goal was to run symbol parsing in parallel and better use all available CPU cores. I ultimately implemented three changes that enhanced GDB’s Python API.

Beyond the actual code, I also learned about upstreaming patches in GDB. This process can take a while, has a bit of a learning curve, and involves a lot of back and forth with the project’s maintainers. I’ll discuss this in the post, and you can also follow along as my work is still being debated in the GDB patches mailing list.

Why make GDB faster?

GDB has three ways to load DWARF symbols from a program:

  1. Partial symbol table loader: The index loader is responsible for loading in symbol names and connecting them to their respective compilation units (CUs), leaving the parsing and building of their symbol tables to the full loader. Parsing will be done later only when full information about the symbol is required.
  2. Full symbol table loader: Finishes the work the index loader has left for later by parsing the CUs and building their symbol tables as needed. This loader fully parses the DWARF information in the file and stores it in memory.
  3. Index parser: ELFs can have a special .gdb_index section, added either with the –gdb-index linker flag or with the gdb-add-index tool provided by GDB. The tool stores an index for the internal symbol table that allows GDB to skip the index construction pass, significantly reducing the time required to load the binary in GDB.

The original idea was to port the parallel parsing approach in drgn, Meta’s open-source debugger, to GDB. Parallel parsing had already been implemented for the index loader, leaving only the full loader and the index parser as potential next candidates in line for parallelization.

You can think of GDB’s parsing routines as split into concurrent tasks on a per-CU basis since they’re already invoked sequentially once per CU. However, this understanding has a major issue: despite the ostensive separation of the data, it is not separated into data that is fully read-write, partially read-write with implicit synchronization, and read-only. The parsing subroutines fully expect all of these data structures to be read-write, at least to a degree.

While solving most of these is a simple case of splitting the values into separate read-write copies (one owned by each thread), things like the registries, the caches, and particularly the obstacks are much harder to move to a concurrent model.

What’s an obstack?

General purpose allocations, like malloc(), are time-consuming. They may not be efficient when users need to allocate many small objects as quickly as possible since they store metadata within each allocation.

Enter allocation stacks. Each new object is allocated on the top and freed from the top in order. The GNU Obstack, an implementation of such an allocator, is used heavily in GDB. Each reasonably long-lived container object, including objfile and gdbarch, has its instance of an obstack and is used to hold the objects it references and frees them all at once, together with the object itself.

If you’re knowledgeable on object lifetime tracking—be it dynamic, like you’d get with std::shared_ptr, or static, like with references in Rust—the last paragraph will have sounded familiar. Judging by how obstack allocations are used in GDB, someone might assume there is a way to guarantee that objects will live as long as the container that owns them.

After discussing this with others in the IRC and mailing list, I reached two conclusions: it would take a considerable amount of time to investigate it, and I was better off prioritizing the Python API so that I could have a chance at completing the improvements on time. Ultimately, I spent most of my time on those attainable goals.

GDB objects __repr__ methods

The first change is fairly simple. It adds __repr__() implementations to a handful of types in the GDB Python API. This change makes the messages we get from inspecting types in the Python REPL more informative about what those types represent.

Previously, we would get something like this, which is hardly helpful (note: pi is the GDB command to run the Python REPL):

(gdb) pi
>>> gdb.lookup_type("char")
<gdb.Type object at 0x7ff8e01aef20>

Now, we can get the following, which tells us what kind of type this is, as well as its name, rather than where the object is located in memory:

(gdb) pi
>>> gdb.lookup_type("char")
<gdb.Type code=TYPE_CODE_INT name=char>

This also applies to gdb.Architecture, gdb.Block, gdb.Breakpoint, gdb.BreakpointLocation, and gdb.Symbol.

This helped me understand how GDB interfaces with Python and how the Python C API generally works. It allowed me to add my own functions and types later.

Types ahoy!

The second change adds the ability to create types from the Python API, where previously, you could only query for existing types using gdb.lookup_type(). Now you can directly create any primitive type supported by GDB, which can be pretty handy if you’re working on code but don’t have the symbols for it, or if you’re writing plugins to help people work with that sort of code. Types from weird extra binaries need not apply!

GDB supports a fairly large number of types. All of them can be created directly using gdb.init_type or one of the specialized gdb.init_*_type functions, which let you specify parameters relevant to the type being created. Most of them work similarly, except for gdb.init_float_type, which has its own new gdb.FloatFormat type to go along with it. This lets you specify how the floating point type you’re trying to create is laid out in memory.

An extra consideration that comes with this change is where exactly the memory for these new types comes from. Since these functions are based on functions already available internally in GDB, and since these functions use the obstack from a given objfile, the obstack is the memory source for these allocations. This has one big advantage: objects that reference these types and belong to the same objfile are guaranteed never to outlive them.

You may already have realized a significant drawback to this method: any type allocated on it has a high chance of not being on the top of the stack when the Python runtime frees it. So regardless of their real lifetime requirements, types can be freed only along with the objfile that owns them. The main implication is that unreachable types will leak their memory for the lifetime of the objfile.

Keeping track of the initialization of the type by hand would require a deeper change to the existing type object infrastructure. This is too ambitious for a first patch.

Here are a few examples of this method in action:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>>
>>> # Time to standardize integer extensions. :^)
>>> gdb.init_type(objfile, gdb.TYPE_CODE_INT, 24, "long short int")
<gdb.Type code=TYPE_CODE_INT name=long short int>

This creates a new 24-bit integer type named “long short int”:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>>
>>> ff = gdb.FloatFormat()
>>> ff.totalsize = 32
>>> ff.sign_start = 0
>>> ff.exp_start = 1
>>> ff.exp_len = 8
>>> ff.man_start = 9
>>> ff.man_len = 23
>>> ff.intbit = False
>>>
>>> gdb.init_float_type(objfile, ff, "floatier")
<gdb.Type code=TYPE_CODE_FLOAT name=floatier>

This creates a new floating point type reminiscent of the one available in standard x86 machines.

What about the symbols?

The third change adds the ability to register three symbols: types, goto labels, and statics. This makes it much easier to add new symbols, which is especially useful if you’re reverse engineering and don’t have any original symbols. Without this patch, the main way to add new symbols involves adding them to a separate file, compiling the file to the target architecture, and loading it into GDB after the base program is loaded with the add-symbol-file command.

GDB’s internal symbol infrastructure is mostly not meant for on-the-fly additions. Let’s look at how GDB creates, stores, and looks up symbols.

Symbols in GDB are found through pointers deep inside structures called compunit_symtab. These structures are set up through a builder that allows symbols to be added to the table as it’s being built. This builder is later responsible for registering the new structure with the (in the case of this patch) objfile that owns it. In the objfile case, these tables are stored in a list that, during lookup—disregarding the symbol lookup cache—is traversed until a symbol matching the given requirements is found in one of the tables.

Currently, tables aren’t set up so that symbols can be added to the table at will after it’s been built. So if we don’t want to make deep changes to GDB before the first patch, we must find a way around this limitation. What I landed on was building a new symbol table and stringing it to the end of the list for every new symbol. Although this is a rather inefficient approach, it’s sufficient to get the feature to work.

As this patch continues to be upstreamed, I aim to iron out and improve the mechanism by which this functionality is implemented.

Lastly, I’d like to show an example of a new type being created and registered as a symbol for future lookup:

(gdb) pi
>>> objfile = gdb.lookup_objfile("servo")
>>> type = gdb.init_type(objfile, gdb.TYPE_CODE_INT, 24, "long short int")
>>> objfile.add_type_symbol("long short int", type)
>>> gdb.lookup_type("long short int")
<gdb.Type code=TYPE_CODE_INT name=long short int>

Getting it all merged

Overall, this winter at Trail of Bits produced more informative messages the ability to create supported types in GDB’s Python API, which is helpful when you don’t have symbols for the code you’re working on.

GDB is old school regarding how it handles contributions. Its maintainers use email to submit, test, and comment on patches before being upstreamed. This generally means there’s a very rigid etiquette to follow when submitting a patch.

As someone who had never dealt with email-based projects, my first attempt to submit a patch was bad. I cobbled together a text file with the output of git diff and then wrote the entire message by hand before sending it through a client that poorly handled non-UNIX line endings. This caused a mess that, understandably, none of the maintainers in the list inclined to patch in and test. Still, they were nice enough to tell me I should’ve done it using Git’s built-in email functionality: git send-email directly.

After that particular episode, I put in the time to split off my changes into proper branches and to rebase them so that they would all be condensed into a single commit per major change. This created a more rational and descriptive message that covers the entire change and is much better suited for use with git send-email. Since then, things have been rolling pretty smoothly, though there has been a lot of back and forth trying to get all of my changes in.

While the three changes have already been submitted, the one implementing __repr__() is further down the pipeline, while the other two are still awaiting review. Keep an eye out for them!

Article Link: A Winter’s Tale: Improving messages and types in GDB’s Python API | Trail of Bits Blog