How we applied advanced fuzzing techniques to cURL

MalBot · March 1, 2024, 3:55pm

By Shaun Mirani

Near the end of 2022, Trail of Bits was hired by the Open Source Technology Improvement Fund (OSTIF) to perform a security assessment of the cURL file transfer command-line utility and its library, libcurl. The scope of our engagement included a code review, a threat model, and the subject of this blog post: an engineering effort to analyze and improve cURL’s fuzzing code.

We’ll discuss several elements of this process, including how we identified important areas of the codebase lacking coverage, and then modified the fuzzing code to hit these missed areas. For example, by setting certain libcurl options during fuzzer initialization and introducing new seed files, we doubled the line coverage of the HTTP Strict Transport Security (HSTS) handling code and quintupled it for the Alt-Svc header. We also expanded the set of fuzzed protocols to include WebSocket and enabled the fuzzing of many new libcurl options. We’ll conclude this post by explaining some more sophisticated fuzzing techniques the cURL team could adopt to increase coverage even further, bring fuzzing to the cURL command line, and reduce inefficiencies intrinsic to the current test case format.

How is cURL fuzzed?

OSS-Fuzz, a free service provided by Google for open-source projects, serves as the continuous fuzzing infrastructure for cURL. It supports C/C++, Rust, Go, Python, and Java codebases, and uses the coverage-guided libFuzzer, AFL++, and Honggfuzz fuzzing engines. OSS-Fuzz adopted cURL on July 1, 2017, and the incorporated code lives in the curl-fuzzer repository on GitHub, which was our focus for this part of the engagement.

The repository contains the code (setup scripts, test case generators, harnesses, etc.) and corpora (the sets of initial test cases) needed to fuzz cURL and libcurl. It’s designed to fuzz individual targets, which are protocols supported by libcurl, such as HTTP(S), WebSocket, and FTP. curl-fuzzer downloads the latest copy of cURL and its dependencies, compiles them, and builds binaries for these targets against them.

Each target takes a specially structured input file, processes it using the appropriate calls to libcurl, and exits. Associated with each target is a corpus directory that contains interesting seed files for the protocol to be fuzzed. These files are structured using a custom type-length-value (TLV) format that encodes not only the raw protocol data, but also specific fields and metadata for the protocol. For example, the fuzzer for the HTTP protocol includes options for the version of the protocol, custom headers, and whether libcurl should follow redirects.

First impressions: HSTS and Alt-Svc

We’d been tasked with analyzing and improving the fuzzer’s coverage of libcurl, the library providing curl’s internals. The obvious first question that came to mind was: what does the current coverage look like? To answer this, we wanted to peek at the latest coverage data given in the reports periodically generated by OSS-Fuzz. After some poking around at the URL for the publicly accessible oss-fuzz-coverage Google Cloud Storage bucket, we were able to find the coverage reports for cURL (for future reference, you can get there through the OSS-Fuzz introspector page). Here’s a report from September 28, 2022, at the start of our engagement.

Reading the report, we quickly noticed that several source files were receiving almost no coverage, including some files that implemented security features or were responsible for handling untrusted data. For instance, hsts.c, which provides functions for parsing and handling the Strict-Transport-Security response header, had only 4.46% line coverage, 18.75% function coverage, and 2.56% region coverage after over five years on OSS-Fuzz:

The file responsible for processing the Alt-Svc response header, altsvc.c, was similarly coverage-deficient:

An investigation of the fuzzing code revealed why these numbers were so low. The first problem was that the corpora directory was missing test cases that included the Strict-Transport-Security and Alt-Svc headers, which meant there was no way for the fuzzer to quickly jump into testing these regions of the codebase for bugs; it would have to use coverage feedback to construct these test cases by itself, which is usually a slow(er) process.

The second issue was that the fuzzer never set the CURLOPT_HSTS option, which instructs libcurl to use an HSTS cache file. As a result, HSTS was never enabled during runs of the fuzzer, and most code paths in hsts.c were never hit.

The final impediment to achieving good coverage of HSTS was an issue with its specification, which tells user agents to ignore the Strict-Transport-Security header when sent over unencrypted HTTP. However, this creates a problem in the context of fuzzing: from the perspective of our fuzzing target, which never stood up an actual TLS connection, every connection was unencrypted, and Strict-Transport-Security was always ignored. For Alt-Svc, libcurl already included a workaround to relax the HTTPS requirement for debug builds when a certain environment variable was set (although curl-fuzzer did not set this variable). So, resolving this issue was just a matter of adding a similar feature for HSTS to libcurl and ensuring that curl-fuzzer set all necessary environment variables.

Our changes to address these issues were as follows:

We added seed files for Strict-Transport-Security and Alt-Svc to curl-fuzzer (ee7fad2).
We enabled CURLOPT_HSTS in curl-fuzzer (0dc42e4).
We added a check to allow debug builds of libcurl to bypass the HTTPS restriction for HSTS when the CURL_HSTS_HTTP environment variable is set, and we set the CURL_HSTS_HTTP and CURL_ALTSVC_HTTP environment variables in curl-fuzzer (6efb6b1 and 937597c).

The day after our changes were merged upstream, OSS-Fuzz reported a significant bump in coverage for both files:

A little over a year of fuzzing later (on January 29, 2024), our three fixes had doubled the line coverage for hsts.c and nearly quintupled it for altsvc.c:

Sowing the seeds of bugs

Exploring curl-fuzzer further, we saw a number of other opportunities to boost coverage. One low-hanging fruit we spotted was the set of seed files found in the corpora directory. While libcurl supports numerous protocols (some of which surprised us!) and features, not all of them were represented as seed files in the corpora. This is important: as we alluded to earlier, a comprehensive set of initial test cases, touching on as much major functionality as possible, acts as a shortcut to attaining coverage and significantly cuts down on the time spent fuzzing before bugs are found.

The functionality we created new seed files for, with the hope of promoting new coverage, included (ee7fad2):

CURLOPT_LOGIN_OPTIONS: Sets protocol-specific login options for IMAP, LDAP, POP3, and SMTP
CURLOPT_XOAUTH2_BEARER: Specifies an OAuth 2.0 Bearer Access Token to use with HTTP, IMAP, LDAP, POP3, and SMTP servers
CURLOPT_USERPWD: Specifies a username and password to use for authentication
CURLOPT_USERAGENT: Specifies the value of the User-Agent header
CURLOPT_SSH_HOST_PUBLIC_KEY_SHA256: Sets the expected SHA256 hash of the remote server for an SSH connection
CURLOPT_HTTPPOST: Sets POST request data. curl-fuzzer had been using only the CURLOPT_MIMEPOST option to achieve this, while the similar but deprecated CURLOPT_HTTPPOST option wasn’t exercised. We also added support for this older method.

Certain other CURLOPTs, as with CURLOPT_HSTS in the previous section, made more sense to set globally in the fuzzer’s initialization function. These included:

CURLOPT_COOKIEFILE: Points to a filename to read cookies from. It also enables fuzzing of the cookie engine, which parses cookies from responses and includes them in future requests.
CURLOPT_COOKIEJAR: Allows fuzzing the code responsible for saving in-memory cookies to a file
CURLOPT_CRLFILE: Specifies the certificate revocation list file to read for TLS connections

Where to go from here

As we started to understand more about curl-fuzzer’s internals, we drew up several strategic recommendations to improve the fuzzer’s efficacy that the timeline of our engagement didn’t allow us to implement ourselves. We presented these recommendations to the cURL team in our final report, and expand on a few of them below.

Dictionaries

Dictionaries are a feature of libFuzzer that can be especially useful for the text-based protocols spoken by libcurl. The dictionary for a protocol is a file enumerating the strings that are interesting in the context of the protocol, such as keywords, delimiters, and escape characters. Providing a dictionary to libFuzzer may increase its search speed and lead to the faster discovery of new bugs.

curl-fuzzer already takes advantage of this feature for the HTTP target, but currently supplies no dictionaries for the numerous other protocols supported by libcurl. We recommend that the cURL team create dictionaries for these protocols to boost the fuzzer’s speed. This may be a good use case for an LLM; ChatGPT can generate a starting point dictionary in response to the following prompt (replace with the name of the target protocol):

A dictionary can be used to guide the fuzzer. A dictionary is passed as a file to the fuzzer. The simplest input accepted by libFuzzer is an ASCII text file where each line consists of a quoted string. Strings can contain escaped byte sequences like "\xF7\xF8". Optionally, a key-value pair can be used like hex_value="\xF7\xF8" for documentation purposes. Comments are supported by starting a line with #. Write me an example dictionary file for a <PROTOCOL> parser.

argv fuzzing

During our first engagement with curl, one of us joked, “Have we tried curl AAAAAAAAAA… yet?” There turned out to be a lot of wisdom behind this quip; it spurred us to fuzz curl’s command-line interface (CLI), which yielded multiple vulnerabilities (see our blog post, cURL audit: How a joke led to significant findings).

This CLI fuzzing was performed using AFL++’s argv-fuzz-inl.h header file. The header defines macros that allow a target program to build the argv array containing command-line arguments from fuzzer-provided data on standard input. We recommend that the cURL team use this feature from AFL++ to continuously fuzz cURL’s CLI (implementation details can be found in the blog post linked above).

Structure-aware fuzzing

One of curl-fuzzer’s weaknesses is intrinsic to the way it currently structures its inputs, which is with a custom Type-length-value (TLV) format. A TLV scheme (or something similar) can be useful for fuzzing a project like libcurl, which supports a wealth of global and protocol-specific options and parameters that need to be encoded in test cases.

However, the brittleness of this binary format makes the fuzzer inefficient. This is because libFuzzer has no idea about the structure that inputs are supposed to adhere to. curl-fuzzer expects input data in a strict format: a 2-byte field for the record type (of which only 52 were valid at the time of our engagement), a 4-byte field for the length of the data, and finally the data itself. Because libFuzzer doesn’t take this format into account, most of the mutations it generates wind up being invalid at the TLV-unpacking stage and have to be thrown out. Google’s fuzzing guidance warns about using TLV inputs for this reason.

As a result, the coverage feedback used to guide mutations toward interesting code paths performs much worse than it would if we dealt only with raw data. In fact, libcurl may contain bugs that will never be found with the current naive TLV strategy.

So, how can the cURL team address this issue while keeping the flexibility of a TLV format? Enter structure-aware fuzzing.

The idea with structure-aware fuzzing is to assist libFuzzer by writing a custom mutator. At a high level, the custom mutator’s job comprises just three steps:

Try to unpack the input data coming from libFuzzer as a TLV.
If the data can’t be parsed into a valid TLV, instead of throwing it away, return a syntactically correct dummy TLV. This can be anything, as long as it can be successfully unpacked.
If the data does constitute a valid TLV, mutate the fields parsed out in step 1 by calling the LLVMFuzzerMutate function. Then, serialize the mutated fields and return the resultant TLV.

With this approach, no time is wasted discarding inputs because every input is valid; the mutator only ever creates correctly structured TLVs. Performing mutations at the level of the decoded data (rather than at the level of the encoding scheme) allows better coverage feedback, which leads to a faster and more effective fuzzer.

An open issue on curl-fuzzer proposes several changes, including an implementation of structure-aware fuzzing, but there hasn’t been any movement on it since 2019. We strongly recommend that the cURL team revisit the subject, as it has the potential to significantly improve the fuzzer’s ability to find bugs.

Our 2023 follow-up

At the end of 2023, we had the chance to revisit cURL and its fuzzing code in another audit supported by OSTIF. Stay tuned for the highlights of our follow-up work in a future blog post.

Article Link: How we applied advanced fuzzing techniques to cURL | Trail of Bits Blog