Four years ago, we released McSema, our x86 to LLVM bitcode binary translator. Since then, it has stretched and flexed; we added x86-64 support, put it on a performance-focused diet, and improved its usability and documentation.
McSema wasn’t the only thing improving these past years, though. At the same time, programs were increasingly adopting modern x86 features like the advanced vector extensions (AVX) instructions, which operate on 256-bit wide vector registers. Adjusting to these changes was back-breaking but achievable work. Then our lifting goals expanded to include AArch64, the architecture used by modern smartphones. That’s when we realized that we needed to step back and strengthen McSema’s core. This change in focus paid off; now McSema can transpile AArch64 binaries into x86-64! Keep reading for more details.
Enter the dragon
Today we are announcing the official release of McSema 2.0! This release greatly advances McSema’s core and brings several exciting new developments to our binary lifter:
- Remill. Instruction semantics are now completely separated into Remill, their own library. McSema is a client that uses the library for binary lifting. To borrow an analogy, McSema is to Remill as Clang is to LLVM. Look out for future projects using Remill.
- Simplified semantics. The separation of McSema and Remill makes it easier to add support for new instructions. In Remill, instruction semantics can be expressed directly in C++ and are automatically compiled by Clang into LLVM bitcode.
- AArch64 (64-bit ARMv8). The switch to using Remill as a semantics backend means that McSema 2 supports multiple architectures from the start. Not only does it work on x86 and x86-64 binaries, but it also supports lifting 64-bit ARMv8 programs.
- SSE3/4 and AVX support. McSema now supports lifting programs that utilize advanced vector instruction sets.
- Better CFG recovery. A common source of lifting errors is poor control flow recovery. We improved the control flow recovery process to make it simpler, faster, and more accurate. McSema’s CFG recovery is also beginning to incorporate advanced features, like lifting global variables and stack variables.
- Binary Ninja support. McSema now has beta support for recovering program control flow via Binary Ninja.
McSema 2.0 is under active development and is rapidly improving and gaining features. We hope to make both using and hacking on McSema easier and more accessible than ever.
See it soar: Using McSema 2
The biggest change to McSema is the switch to using Remill for instruction semantics, and the subsequent support for AArch64. A good demonstration of this improvement is to show that McSema can disassemble an AArch64 binary, lift it to bitcode, and then recompile that bitcode to x86-64 machine code. Let’s get to it then!
Getting McSema
The first step is to download and install the code. For now, Linux is the primary platform supported by McSema; however, we are working toward macOS and Windows build support. If your goal is to lift Windows binaries, then no sweat! Linux builds of McSema will happily analyze Windows binaries.
The above linked instructions give more details that you should follow (e.g. getting dependencies, resolving common errors, etc.), but the essential steps to downloading and installing McSema are as follows:
mkdir ~/data cd ~/data git clone [email protected]:trailofbits/remill.git cd ~/data/remill/tools git clone [email protected]:trailofbits/mcsema.git cd ~/data ~/data/remill/scripts/build.sh --llvm-version 3.9 cd ~/data/remill-build~/data/remill/scripts/setup.sh
sudo make install
These commands will clone Remill and McSema, invoke a common build script that compiles both projects in the ~/data/remill-build
directory, and then install the projects onto the system.
Disassembling our first binary
Using McSema is usually a two- or three-step process. The first step is always to disassemble a binary into a “control flow graph” file using the mcsema-disass command-line tool. This file contains all of the program binary’s original code and data, but organized into logical groupings, like variables, functions, blocks of instructions, and references therebetween.
We’ll use Felipe Manzano’s maze, compiled as an AArch64 program binary, as our running example. It’s an interactive, command-line game that asks the user to solve a maze. Precompiled binaries for the maze can be found in the McSema’s examples/Maze/bin directory.
cd ~/data/remill/tools/mcsema/examples/Maze/bin mcsema-disass --arch aarch64 \ --os linux \ --binary maze.aarch64 \ --output /tmp/maze.aarch64.cfg \ --log_file /tmp/maze.aarch64.log \ --entrypoint main \ --disassembler /opt/ida-6.9/idal64
The above steps will produce a control flow graph (CFG) file from the maze program, saving the CFG file to /tmp/maze.aarch64.cfg
. If you’re following along at home and don’t have a licensed version of IDA Pro, but do have a Binary Ninja license, then you can change the value passed to the --disassembler
option to point to the Binary Ninja executable instead (i.e. --disassembler /opt/binaryninja/binaryninja
). Finally, if you are one of those radare2 holdouts, then no sweat — we have CFG files for the maze binary already made.
Lifting to bitcode
The second step is to lift the CFG file into LLVM bitcode using the mcsema-lift-3.9 command-line tool. The 3.9 in this case isn’t the McSema version; it’s the LLVM toolchain version. LLVM is a fast-evolving project, which sometimes means that interesting projects (e.g. KLEE) are left behind and only work with older LLVM versions. We’ve tried to make it as simple as possible for users to reap the benefits of using McSema — that’s why McSema works using LLVM versions 3.5 and up. In fact, with McSema 2, you can now have multiple versions of McSema installed on your system, each targeting distinct LLVM versions. Enough about that, time to lift some weights!
mcsema-lift-3.9 --arch aarch64 \ --os linux \ --cfg /tmp/maze.aarch64.cfg \ --output /tmp/maze.aarch64.bc \ --explicit_args
The above command instructs McSema to save the lifted bitcode to the file /tmp/maze.aarch64.bc
. The --explicit_args
command-line flag is a new feature of McSema 2 that emulates the original behavior of McSema 1. If your goal is to perform static analysis or symbolic execution of lifted bitcode, then you will want to employ this option. Similarly, if you are compiling bitcode lifted from one architecture (e.g. AArch64) into machine code of another architecture (e.g. x86-64), then you also want this option. On the other hand, if your goal is to compile the lifted bitcode back into an executable program for the same architecture (as is the case for the Cyber Fault-tolerance Attack Recovery program), then you should not use --explicit_args
.
Compiling bitcode back to runnable programs
It’s finally time to make the magic happen — we’re going to take bitcode from an AArch64 program, and make it run on x86-64. We have conveniently ensured that a Clang compiler is installed alongside McSema, and in such a way that it does not clash with any other compilers that you already have installed. Here’s how to use that Clang to compile the lifted bitcode into an executable named /tmp/maze.aarch64.lifted
.
remill-clang-3.9 -o /tmp/maze.aarch64.lifted /tmp/maze.aarch64.bc
Note: If for some reason remill-clang-3.9 does not work for you, then you can also use ~/data/remill-build/libraries/llvm/bin/clang
.
Solving the maze
We’ve now successfully transpiled an AArch64 program binary into a x86-64 program binary. Wait, what? Yes, we really did that. Running the transpiled version shows us the correct output, prompting us with instructions on how to play the game.
$ /tmp/maze.aarch64.lifted Maze dimensions: 11x7 Player position: 1x1 Iteration no. 0 Program the player moves with a sequence of 'w', 's', 'a' and 'd' Try to reach the price(#)! +-+---+---+ |X| |#| | | --+ | | | | | | | | +-- | | | | | | +-----+---+
But what if — try as we might — we’re not able to solve the maze? That won’t be a problem, because we can always use the KLEE symbolic executor to solve the maze for us.
Your new workout routine
We’ve practiced all the moves and your new workout routine is ready. Day 1 in your routine is to disassemble a binary and make a CFG file.
mcsema-disass --arch aarch64 \ --os linux \ --binary ~/data/remill/tools/mcsema/examples/Maze/bin/maze.aarch64 \ --output /tmp/maze.aarch64.cfg \ --log_file /tmp/maze.aarch64.log \ --entrypoint main \ --disassembler /opt/ida-6.9/idal64
Day 2 is your lift day, where we lift the CFG file into LLVM bitcode.
mcsema-lift-3.9 --arch aarch64 \ --os linux \ --cfg /tmp/maze.aarch64.cfg \ --output /tmp/maze.aarch64.bc \ --explicit_args
Day 3 ends your week with some intense compiling, producing a new machine code executable from the lifted bitcode.
remill-clang-3.9 -o /tmp/maze.aarch64.lifted /tmp/maze.aarch64.bc
Finally, don’t forget your stretches. We want to make sure those muscles still work.
echo ssssddddwwaawwddddssssddwwww | /tmp/maze.aarch64.lifted
Come with me if you want to lift
The Maze transpiling and symbolic execution demos scratch the surface of what you can do with McSema 2. The ultimate goal has always been to enable binaries to be treated like source code. With the numerous improvements in McSema 2, we are getting closer to that ideal. In the coming months we’ll talk more about other exciting features of McSema 2 (like stack and global variable recovery) and how Trail of Bits and others are using McSema.
We’d love to talk to you about McSema and how it can solve your binary analysis and transformation problems. We’re always available at the Empire Hacking Slack and via our contact page.
For now though, put your belt on — it’s time for some heavy lifting. McSema version 2 is ready for your binaries.
Article Link: https://blog.trailofbits.com/2018/01/23/heavy-lifting-with-mcsema-2-0/