Introducing DIFFER, a new tool for testing and validating transformed programs

By Michael Brown

We recently released a new differential testing tool, called DIFFER, for finding bugs and soundness violations in transformed programs. DIFFER combines elements from differential, regression, and fuzz testing to help users find bugs in programs that have been altered by software rewriting, debloating, and hardening tools. We used DIFFER to evaluate 10 software debloating tools, and it discovered debloating failures or soundness violations in 71% of the transformed programs produced by these tools.

DIFFER fills a critical need in post-transformation software validation. Program transformation tools usually leave this task entirely to users, who typically have few (if any) tools beyond regression testing via existing unit/integration tests and fuzzers. These approaches do not naturally support testing transformed programs against their original versions, which can allow subtle and novel bugs to find their way into the modified programs.

We’ll provide some background research that motivated us to create DIFFER, describe how it works in more detail, and discuss its future.

If you prefer to go straight to the code, check out DIFFER on GitHub.

Background

Software transformation has been a hot research area over the past decade and has primarily been motivated by the need to secure legacy software. In many cases, this must be done without the software’s source code (binary only) because it has been lost, is vendor-locked, or cannot be rebuilt due to an obsolete build chain. Among the more popular research topics that have emerged in this area are binary lifting, recompiling, rewriting, patching, hardening, and debloating.

While tools built to accomplish these goals have demonstrated some successes, they carry significant risks. When compilers lower source code to binaries, they discard contextual information once it is no longer needed. Once a program has been lowered to binary, the contextual information necessary to safely modify the original program generally cannot be fully recovered. As a result, tools that modify program binaries directly may inadvertently break them and introduce new bugs and vulnerabilities.

While DIFFER is application-agnostic, we originally built this tool to help us find bugs in programs that have had unnecessary features removed with a debloating tool (e.g., Carve, Trimmer, Razor). In general, software debloaters try to minimize a program’s attack surface by removing unnecessary code that may contain latent vulnerabilities or be reused by an attacker using code-reuse exploit patterns. Debloating tools typically perform an analysis pass over the program to map features to the code necessary to execute them. These mappings are then used to cut code that corresponds to features the user doesn’t want. However, these cuts will likely be imprecise because generating the mappings relies on imprecise analysis steps like binary recovery. As a result, new bugs and vulnerabilities can be introduced into debloated programs during cutting, which is exactly what we have designed DIFFER to detect.

How does DIFFER work?

At a high level, DIFFER (shown in figure 1) is used to test an unmodified version of the program against one or more modified variants of the program. DIFFER allows users to specify seed inputs that correspond to both unmodified and modified program behaviors and features. It then runs the original program and the transformed variants with these inputs and compares the outputs. Additionally, DIFFER supports template-based mutation fuzzing of these seed inputs. By providing mutation templates, DIFFER can maximize its coverage of the input space and avoid missing bugs (i.e., false negatives).

DIFFER expects to see the same outputs for the original and variant programs when given inputs that correspond to unmodified features. Conversely, it expects to see different outputs when it executes the programs with inputs corresponding to modified features. If DIFFER detects unexpected matching, differing, or crashing outputs, it reports them to the user. These reports help the user identify errors in the modified program resulting from the transformation process or its configuration.

Figure 1: Overview of DIFFER

When configuring DIFFER, the user selects one or more comparators to use when comparing outputs. While DIFFER provides many built-in comparators that check basic outputs such as return codes, console text, and output files, more advanced comparators are often needed. For this purpose, DIFFER allows users to add custom comparators for complex outputs like packet captures. Custom comparators are also useful for reducing false-positive reports by defining allowable differences in outputs (such as timestamps in console output). Our open-source release of DIFFER contains many useful comparator implementations to help users easily write their own comparators.

However, DIFFER does not and cannot provide formal guarantees of soundness in transformation tools or the modified programs they produce. Like other dynamic analysis testing approaches, DIFFER cannot exhaustively test the input space for complex programs in the general case.

Use case: evaluating software debloaters

In a recent research study we conducted in collaboration with our friends at GrammaTech, we used DIFFER to evaluate debloated programs created by 10 different software debloating tools. We used these tools to remove unnecessary features from 20 different programs of varying size, complexity, and purpose. Collectively, the tools created 90 debloated variant programs that we then validated with DIFFER. DIFFER discovered that 39 (~43%) of these variants still had features that debloating tools failed to remove. Even worse, DIFFER found that 25 (~28%) of the variants either crashed or produced incorrect outputs in retained features after debloating.

By discovering these failures, DIFFER has proven itself as a useful post-transformation validation tool. Although this study was focused on debloating transformations, we want to emphasize that DIFFER is general enough to test other transformation tools such as those used for software hardening (e.g., CFI, stack protections), translation (e.g., C-to-Rust transformers), and surrogacy (e.g., ML surrogate generators).

What’s next?

With DIFFER now available as open-source software, we invite the security research community to use, extend, and help maintain DIFFER via pull requests. We have several specific improvements planned as we continue to research and develop DIFFER, including the following:

  • Support running binaries in Docker containers to reduce environmental burdens.
  • Add new built-in comparators.
  • Add support for targets that require superuser privileges.
  • Support monitoring multiple processes that make up distributed systems.
  • Add runtime comparators (via instrumentation, etc.) for “deep” equivalence checks.

Acknowledgements

This material is based on work supported by the Office of Naval Research (ONR) under Contract No. N00014-21-C-1032. Any opinions, findings and conclusions, or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the ONR.

Article Link: Introducing DIFFER, a new tool for testing and validating transformed programs | Trail of Bits Blog