Finding bugs in C code with Multi-Level IR and VAST

Intermediate languages (IRs) are what reverse engineers and vulnerability researchers use to see the forest for the trees. IRs are used to view programs at different abstraction layers, so that analysis can understand both low-level code aberrations and higher levels of flawed logic mistakes. The setback is that bug-finding tools are often pigeonholed into choosing a specific IR, because bugs don’t uniformly exist across abstraction levels.

We developed a new tool called VAST that solves this problem by providing a “tower of IRs,” allowing a program analysis to start at the best-fit representation for the analysis goal, then work upwards or downwards as needed. For instance, an analyst may want to do one of three things with a stack-based buffer overflow. (1) Identify it. (2) Classify it. (3) Remediate it.

Now comes choosing the right IR. Some bug properties are only apparent at certain abstraction levels. A buffer overflow is easily identified in LLVM IR, because stack buffers in LLVM IR are highly characteristic (i.e., created via the alloca instruction). This is the “best-fit” IR for identification.

For classification, a buffer overflow can go from a common bug to a security threat if the buffer sits near sensitive data in program memory. This only becomes clear below the LLVM IR level, near or at the machine code level, where buffers are fused together with other sensitive information, forming a “stack frame.”

The last part of the story is communication and remediation. The reason why the buffer overflowed in the first place can be a side-effect of a type conversion on a buffer index that was self-evident in the program’s abstract syntax tree (AST), the highest level IR. Connecting these facts together used to be impossible, but VAST’s tower of IRs is changing this. Bugs span the semantic gap, and so should analyses.

VAST’s tower of intermediate representations

How can a single system cross this so-called semantic gap? The key to what makes VAST work is MLIR: the Multi-Level Intermediate Representation project. MLIR is an LLVM-related infrastructure project that makes the development of domain-specific languages and IRs easier. It provides a framework to efficiently describe operations and types and groups them into “dialects.” Dialects are like embedded languages and can be mixed and matched. Imagine if LLVM would let you add new instructions. That is the power of dialects! MLIR provides utilities for rule-based dialect conversions, pattern matching, and other features.

VAST uses MLIR to build a “Tower of IRs,” where each tower level is an MLIR dialect that corresponds to an abstraction level in the C/C++ compilation process. Our goal is to make next-generation program analysis, but at the end of the day, VAST is just a new compiler middle-end for Clang. It consumes Clang abstract syntax trees (ASTs) and produces LLVM IR. As it develops further, we can use it as a replacement for Clang and test it live.

We will demonstrate VAST and MLIR’s capabilities by writing a simple checker for the Sequoia bug using VAST’s high-level (hl) dialect. The bug is caused by an overflowed integer value being used to determine a buffer’s size. The integer overflow happens when an unsigned integer is implicitly cast to a signed integer before a function call. The way we find the bug is going to be modeled after a CodeQL query featured in Jordy Zomer’s Variant analysis of the ‘Sequoia’ bug article.

Writing a VAST-based bug checker

If you want to try this example yourself, we’ve made the code available.

We’ll start with the code that contains the bug. In one particular variant of the Sequoia bug found in the Linux kernel, function seq_buf_path calls d_path, passing an unsigned size_t size value into a signed argument int buflen. The int buflen argument is then used to compute the size of struct prepend_buffer __name declared via the DECLARE_BUFFER macro.

#define DECLARE_BUFFER(__name, __buf, __len) \
	struct prepend_buffer __name = {.buf = __buf + __len, .len = __len}

char *d_path(const struct path *path, char *buf, int buflen)
{
DECLARE_BUFFER(b, buf, buflen);

}

int seq_buf_path(struct seq_buf *s, const struct path *path, const char *esc)
{
char *buf;
size_t size = seq_buf_get_buf(s, &buf);

if (size) {
char *p = d_path(path, buf, size);

}

}

Coming from a background of writing LLVM tools, the best place to start would be a simple MLIR analysis pass. VAST comes with vast-opt, an analog to LLVM’s opt tool, which allows running passes over MLIR code in a .mlir file. So copying vast-opt into main, everything unneeded is removed.

auto main(int argc, char** argv) -> int
{
  // register dialects
  mlir::DialectRegistry registry;
  vast::registerAllDialects(registry);
  mlir::registerAllDialects(registry);

register_sequoia_checker_pass();

return mlir::failed(
mlir::MlirOptMain(argc, argv, “VAST Sequoia Bug Checker\n”, registry));
}

Next, we create a simple “Hello World” pass based on the MLIR pass infrastructure documentation.

struct sequoia_checker_pass
    : public mlir::PassWrapper<sequoia_checker_pass,
             mlir::OperationPass<mlir::ModuleOp>>
{
  auto getArgument() const -> llvm::StringRef final { return "sequoia"; }

auto getDescription() const -> llvm::StringRef final
{
return “Checks for the sequoia bug in VAST hl dialect code”;
}

void runOnOperation() override
{
llvm::errs() << “Hello World!” << ‘\n’;
}
};

void register_sequoia_checker_pass()
{
mlir::PassRegistration<sequoia_checker_pass>();
}

The next step is getting an input .mlir file. Luckily, VAST also comes with vast-front, a C/C++ frontend for VAST and its dialects. Extract the buggy Linux code into extract.c, run vast-front, and you get extract.hl.mlir, which feeds into vast-checker. Opt-like tools typically output code that comes out of their pipeline regardless of whether it changed. Nothing is interesting in the code, so it can be piped to /dev/null. Now the tool pipeline is set up.

$vast-front -vast-emit-mlir=hl -o extract.hl.mlir extract.c
$cat extract.hl.mlir
...
hl.func external @seq_buf_path (%arg0: !hl.ptr<...> ...) -> !hl.int {
  %4 = hl.var "buf" : !hl.lvalue>
  %5 = hl.var "size" : !hl.lvalue> = {
    %14 = hl.ref %arg0 : !hl.ptr<...>
    %15 = hl.implicit_cast %14 LValueToRValue : !hl.lvalue<...> -> !hl.ptr<...>
    %16 = hl.ref %4 : !hl.lvalue>
    %17 = hl.addressof %16 : !hl.lvalue> -> !hl.ptr<...>
    %18 = hl.call @seq_buf_get_buf(%15, %17) : (!hl.ptr<...>, ...) -> ...
    hl.value.yield %18 : !hl.typedef<"size_t">
  } ...
}
...
$vast-checker -sequoia extract.hl.mlir > /dev/null
Hello World!

The hl dialect MLIR code closely follows the structure of the Clang AST. This is by design so that VAST seamlessly blends into the compilation process of Clang. Unlike Clang AST, however, MLIR has a single static assignment (SSA) structure, which makes iterating over use-define chains easy and simplifies data-flow analysis.

LLVM IR is the same. Unlike LLVM IR, however, MLIR code is very generic, and the semantics of every operation and type are defined by the dialect author and everything that holds any semantic value is either an operation or a type. This is true for MLIR modules and functions and ties into how pass managers run MLIR passes.

To get the pass going, we make it operate on any instance of vast::hl::FuncOp, which roughly corresponds to a C function. Trying to be more efficient, we restricted the pass to run on instances of vast::hl::CallOp, which correspond to function calls. This would mimic how the query works as well.

But before the pass will run, we have to recognize that the MLIR pass manager in vast-checker only runs passes on operations at the top level of nesting, but the Sequoia MLIR code only contains hl.typedef, hl.struct, and hl.func operations on that level. Because of this emphasis on operation nesting, the MLIR pass manager only allows a pass to run on operations whose position in the nesting structure is known beforehand. Calls are not such operations as they can be arbitrarily nested in loops and if conditions.

So, in the end, the pass is run on FuncOp, and in runOnOperation we walk the operations nested in the FuncOp. The callback provided to the walk function gets triggered on encountering a CallOp.

struct sequoia_checker_pass
    : public mlir::PassWrapper<sequoia_checker_pass,
             mlir::OperationPass<vast::hl::FuncOp>> {
...
void runOnOperation() override
{
  using vast::vast_module;
  using vast::hl::CallOp;

auto fop = getOperation();

auto check_for_sequoia = [&](CallOp call) {…};

fop.walk(check_for_sequoia);
}

};

Looking at the CodeQL query, the first order of business after locating a call is to check whether any of its arguments are the result of an unsigned-to-signed cast. These casts could overflow and cause trouble in the callee function.

auto is_unsigned_to_signed_cast(mlir::Operation* opr) -> bool
{
  using vast::vast_module;
  using vast::hl::CastKind;
  using vast::hl::CStyleCastOp;
  using vast::hl::ImplicitCastOp;
  using vast::hl::TypedefType;
  using vast::hl::strip_elaborated;
  using vast::hl::getBottomTypedefType;
  using vast::hl::isSigned;
  using vast::hl::isUnsigned;

auto check_cast = [&](auto cast) -> bool
{
if (cast.getKind() == CastKind::IntegralCast) {
auto from_ty = strip_elaborated(cast.getValue().getType());
if (auto typedef_ty = from_ty.template dyn_cast<TypedefType>()) {
auto mod = mlir::cast<vast_module>(getOperation()->getParentOp());
from_ty = getBottomTypedefType(typedef_ty, mod);
}
return isUnsigned(from_ty) && isSigned(cast.getType());
}
return false;
};

return llvm::TypeSwitch<mlir::Operation*, bool>(opr)
.Case<ImplicitCastOp, CStyleCastOp>(check_cast)
.Default(/defaultResult=/false);
}

The VAST API here does most of the work for us here. We isolate cast operations based on their class, then isolate integer casts based on the CastKind attribute. Finally, we test the operands for signedness. Even typedef usage is covered by the API.

After a call is found to be using a potentially overflowing cast, it’s time to check the callee function body for pointer arithmetic. First, we write a small helper function to get the callee function from CallOp. After that, has_ptr_arith_use does the dataflow part of the CodeQL query. It checks whether the function parameter is involved in pointer arithmetic. This would indicate a potential vulnerability. To do this check I iterate over the aforementioned use-define chains recursively looking for any arithmetic over pointer-typed operands.

static auto is_arith_op(mlir::Operation* opr) -> bool
{
  using vast::hl::AddIOp;
  using vast::hl::SubIOp;

return llvm::TypeSwitch<mlir::Operation*, bool>(opr)
.Case<AddIOp, SubIOp>( { return true; })
.Default(/defaultResult=/false);
}

static auto has_ptr_operand(mlir::Operation* opr) -> bool
{
using vast::hl::PointerType;

auto is_ptr_type = [](mlir::Value val) -> bool
{ return val.getType().isa<PointerType>(); };

return llvm::any_of(opr->getOperands(), is_ptr_type);
}

static auto has_ptr_arith_use(mlir::Operation* opr) -> bool
{
if (opr == nullptr) {
return false;
}

if (is_arith_op(opr) && has_ptr_operand(opr)) {
return true;
}

return llvm::any_of(opr->getUsers(), has_ptr_arith_use);
}

With everything in place, I added a simple print that reports results.

  void runOnOperation() override
  {
    ...
    auto check_for_sequoia = [&](CallOp call)
    {
      for (const auto& arg : llvm::enumerate(call.getArgOperands())) {
        if (is_unsigned_to_signed_cast(arg.value().getDefiningOp())) {
          auto mod    = mlir::cast<vast_module>(getOperation()->getParentOp());
          auto callee = get_callee(call, mod);
          auto param  = callee.getArgument(arg.index());
          if (llvm::any_of(param.getUsers(), has_ptr_arith_use)) {
            llvm::errs()
                << "Call to `" << callee.getSymName() << "` in `"
                << fop.getSymName()
                << "` passes an unsigned value to a signed argument (index `"
                << arg.index()
                << "`) and then uses it in pointer arithmetic.\n";
          }
        }
      }
    };
   ...
  }

And then ran the checker as I did before. This time however with more interesting results.

$vast-checker -sequoia extract.hl.mlir > /dev/null
Call to `d_path` in `seq_buf_path` passes an unsigned value to a signed argument (index `2`) and then uses it in pointer arithmetic.

And here we have the detected Sequoia bug variant we started with.

Search for bugs high, low, and in between with VAST

Bugs are more easily discovered at some abstraction layers than others, which is why our ongoing research shows immense potential. With VAST, tool developers can select an IR that customizes program analysis to the appropriate abstraction layer(s). We invite you to follow along with our example analyzing the Sequoia bug and let us know if you are using it for your reverse engineering project.

Article Link: Finding bugs in C code with Multi-Level IR and VAST | Trail of Bits Blog