Part 2: Improving crypto code in Rust using LLVM’s optnone

By Henrik Brodin

Let’s implement crypto!

Welcome to the second part of our posts on the challenges of implementing constant-time Rust code. Part 1 discussed challenges with constant-time implementations in Rust and WebAssembly and how optimization barriers can mitigate risk. The Rust crypto community has responded with several approaches, and in this post, we will explore one such approach: the implementation of a feature in the Rust compiler (rustc) that provides users greater control over generated code. We will also explore the intermediate representation (IR) generated when this feature is implemented and consider problems that arise in later phases of code generation, such as instruction selection.

Revisiting constant time

A constant-time cryptographic algorithm always executes in the same amount of time regardless of the input. Not all of the operations need to execute in the same amount of time, but timing variations must not depend on secret data. If they did, an adversary could draw conclusions about the secret data. However, keeping secret data secret when compiling constant-time cryptographic code can be difficult. The compiler is required to preserve the “observable behavior” of the program, but since it has no notion of time (except for specialized compilers), it is also free to change the constant-time properties of code.

In practice, to prevent the compiler from altering carefully implemented constant-time code, we have to lie to the compiler. We have to tell it that we know things about the code that it cannot know—that we will read or write memory in ways that it cannot see, similar to multithreaded code. The option we are about to explore will instead tell the compiler not to use all its skills to generate efficient code.

We will start with the example in part 1: a choice, or a function that chooses either parameter a or b, depending on the choice parameter. Here it is without considering constant-time execution.

#[inline(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    if choice { a } else { b }
}

By adding the #[inline(never)] attribute, it’s easier to see the effects of our change. First, let’s compile the code and generate LLVM-IR from it.

rustc  --emit llvm-ir,link -C opt-level=0  test.rs

From this, a test.ll file is generated, in which we’ll find the IR for conditional_select.

; Function Attrs: noinline uwtable
define internal i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #1 {
start:
  %0 = alloca i32, align 4
  br i1 %choice, label %bb1, label %bb2

bb2: ; preds = %start
store i32 %b, i32* %0, align 4
br label %bb3

bb1: ; preds = %start
store i32 %a, i32* %0, align 4
br label %bb3

bb3: ; preds = %bb2, %bb1
%1 = load i32, i32* %0, align 4
ret i32 %1
}

We can see there is a branch that depends on the choice variable. We need to avoid branches to keep information about the value of choice secret. Now let’s reimplement the function in a way that removes branches.

Constant-time implementation

#[inline(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

According to Rust’s documentation, “The bool represents a value, which could only be either true or false. If you cast a bool into an integer, true will be 1 and false will be 0.” Here, we use this property, converting our bool to an integer.

The variable mask will now be 0xffffff or 0, depending on whether choice is true or false.

The xor and and operations will ensure that the return value is a if choice is true and b if choice is false.

Perfect! Let’s see whether we were able to generate the code without branches:

define internal i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #1 {
start:
  %0 = icmp ule i1 %choice, true
  call void @llvm.assume(i1 %0)
  %_6 = zext i1 %choice to i32
  %_8 = icmp eq i32 %_6, -2147483648
  %1 = call i1 @llvm.expect.i1(i1 %_8, i1 false)
  br i1 %1, label %panic, label %bb1

bb1: ; preds = %start
%_5 = sub i32 0, %_6
%_12 = xor i32 %a, %b
%_10 = and i32 %_5, %_12
%2 = xor i32 %b, %_10
ret i32 %2

panic: ; preds = %start
; call core::panicking::panic
call void @_ZN4core9panicking5panic17h7dfe23182f4d1104E([0 x i8]*
nonnull align 1 bitcast ([31 x i8]* @str.4 to [0 x i8]), i64 31, %
“core::panic::location::Location”
align 8 dereferenceable(24) bitcast
(<{ i8*, [16 x i8] }>* @alloc56 to %“core::panic::location::Location”*))
#9
unreachable
}

Yikes! What happened? The branches were removed from the core operations (see bb1). That’s a success, but there is more going on here. Now there is a conditional branch to panic, depending on the value of choice. This conditional branch is added by rustc in debug builds to detect signed integer overflows. In a production build, the signed integer overflow check will not be emitted and the branch will not exist.

However, one concerning detail is that there is a call to @llvm.assume, which depends on the value of choice. According to LLVM’s documentation,

The intrinsic allows the optimizer to assume that the provided condition is always true whenever the control flow reaches the intrinsic call. No code is generated for this intrinsic, and instructions that contribute only to the provided condition are not used for code generation. If the condition is violated during execution, the behavior is undefined.

Code generation could be influenced by the value of choice. In examining the condition more closely, the condition asserts that the range of values for choice is assumed to be [0,1]. What a relief! There is no leakage of secret information since it reveals only the range of choice (information that is already known), and not its specific value.

It seems that we’ve reached our goal. Let’s ensure that things still look OK in an optimized build.

rustc  --emit llvm-ir,link -C opt-level=3  test.rs:
define internal fastcc i32 
@_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 
zeroext %choice) unnamed_addr #5 {
start:
  %0 = select i1 %choice, i32 %a, i32 %b
  ret i32 %0
}

Depending on the target architecture, the compiler may lower the select statement to different instructions. On x86, it could be lowered to a cmov-instruction, while on other architectures, it becomes a conditional branch. What’s worse, if you were to compile the non-constant-time version we started out with, you would get the exact same IR. All that work for nothing!

We can see that as long as the code is not compiled with optimizations enabled, the end result is what we would expect. On the other hand, enabling optimizations could break the constant-time properties of the code. This leads us to the question, Can we influence the compiler to not optimize the conditional_select function? Cargo and rustc accept parameters that disable optimizations globally, but doing so on the full system is not typically possible. One possible solution could be to prevent optimizations for a specific function. (This has previously been suggested as a way to improve the situation.)

Fighting Helping the compiler

Now that we have the desired IR code in debug builds, let’s explore how an LLVM attribute, optnone, can be used to disable optimizations at the function level. The LLVM documentation states the following:

This function attribute indicates that most optimization passes will skip this function, with the exception of interprocedural optimization passes. Code generation defaults to the “fast” instruction selector. This attribute cannot be used together with the alwaysinline attribute; this attribute is also incompatible with the minsize attribute and the optsize attribute.

This attribute requires the noinline attribute to be specified on the function as well, so the function is never inlined into any caller. Only functions with the alwaysinline attribute are valid candidates for inlining into the body of this function.

Our next goal is to mark the conditional_select function with the optnone attribute. According to the documentation, the function also requires the noinline attribute. As it happens, we already marked the function with that attribute with #[inline(never)].

We will implement an attribute in Rust that, when compiled, will generate the optnone and noinline attributes for the function.

Building a Rust compiler

To build and run the Rust compiler, refer to this guide. From this point on, we will assume that the command used to compile is rustc +stage1. To verify that the custom compiler is used, invoke it with the additional -vV flag. You should see output similar to the following:

rustc 1.57.0-dev
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-apple-darwin
release: 1.57.0-dev
LLVM version: 13.0.0

Note the -dev version string, indicating a custom build.

Implementing the optnone attribute

There is already work done in this area; the Rust optimize attribute is already implemented to optimize for either the speed or size of a program. We are aiming to implement a “never” option for the optimize attribute. The goal is to write the conditional_select like this. (There are discussions about naming the “never” attribute. Naming is important, but for our purposes, we don’t need to focus on it.)

#[optimize(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
    let mask = -(choice as i32) as u32;
    b ^ (mask & (a ^ b))
}

Annotating the function with the attribute in a non-optimized build would have no effect. In an optimized build, it would ensure that the optimizer does not touch the function.

To implement such an option, the first step is to extend the OptimizeAttr attribute with the Never member. We will use this value as an information carrier, from parsing to code generation.

#[derive(Clone, Encodable, Decodable, Debug, HashStable_Generic)]
pub enum OptimizeAttr {
    None,
    Never,
    Speed,
    Size,
}

When the symbol never is found in the optimize attribute, we should add the following lines to codegen_fn_attr to emit the OptimizeAttr::Never member previously added:

} else if list_contains_name(&items[..], sym::never) {
                    OptimizeAttr::Never

At this point, we can annotate a function internally in the Rust compiler with OptimizeAttr::Never. What remains is to ensure it is applied to the LLVM IR as well.

To do so, we add the following to from_fn_attrs. This code is what actually marks the LLVM function with the desired attributes when rustc discovers a function with the #[optimize(never)] attribute.

OptimizeAttr::Never => {
            llvm::Attribute::MinSize.unapply_llfn(Function, llfn);
            llvm::Attribute::OptimizeForSize.unapply_llfn(Function, llfn);
            llvm::Attribute::OptimizeNone.apply_llfn(Function, llfn);
            llvm::Attribute::NoInline.apply_llfn(Function, llfn); 
// noopt requires noinline
        }

Now, we can add the optnone and noinline attributes to the LLVM IR from an #[optimize(never)] Rust attribute. Still, there remains some bookkeeping to do.

We need to update the feature gate to include information about the never option in the optimize attribute.

// RFC 2412
    gated!(
        optimize, Normal, template!(List: "size|speed|never"), 
optimize_attribute,
        experimental!(optimize),
    ),

We can build a stage1 compiler to test our changes.

./x.py build -i library/std
rustup toolchain link stage1 build/x86_64-apple-darwin/stage1

Results

Finally, we are ready to test the new attribute. Let’s mark the conditional_select function with the #[optimize(never)] attribute and compile for opt-level=3. To enable the optimize attribute, we add #![feature(optimize_attribute)] to the test.rs file.

rustc  +stage1 --emit llvm-ir,link -C opt-level=3  test.rs:

#[optimize(never)]
fn conditional_select(a: u32, b: u32, choice: bool) -> u32 {
let mask = -(choice as i32) as u32;
b ^ (mask & (a ^ b))
}

You’ll find that the corresponding IR is now:

; test::conditional_select
; Function Attrs: noinline optnone uwtable
define internal fastcc i32 @_ZN4test18conditional_select17h01ca56cd2cc74a72E(i32 %a, i32 %b, i1 zeroext %choice) unnamed_addr #5 {
start:
  %0 = icmp ule i1 %choice, true
  call void @llvm.assume(i1 %0)
  %_6 = zext i1 %choice to i32
  %_5 = sub i32 0, %_6
  %_11 = xor i32 %a, %b
  %_9 = and i32 %_5, %_11
  %1 = xor i32 %b, %_9
  ret i32 %1
}

Success! The optnone and noinline attributes are in use and the IR instructions are as desired. Are we done now? Just create the pull request and merge? Hold your horses! Before doing so, we should of course implement tests (the interested reader can find them here).

But we will leave that aside for now. Instead, let’s turn to a different aspect of what we’ve just accomplished: the instruction-selection phase of code generation.

There is always a ‘but’

It seems we’ve made great progress or even solved the problem of generating constant-time code. This is partly true, but as is common with cryptography (and with compilers), it is not that simple. Although we’ve prevented the optimizer from rewriting the function, there is still an instruction-selection phase during code generation. During this phase, the compiler back end chooses any target instruction(s) it sees fit. This is an aspect that we addressed briefly. We implicitly assumed that an instruction in LLVM IR, such as an xor, would become an equivalent instruction in the target instruction set, such as the x86 xor instruction. While it is likely that an IR xor instruction would be implemented as xor in the target architecture, there’s no guarantee that it will. Code generation could also evolve over time, and what once became a specific instruction could change with a different version of the compiler.

To make things worse, there are optimizations in the machine code generation process. An example for x86 is the X86CmovConverterPass that will convert cmov into conditional branches in certain circumstances. This essentially translates a constant-time operation (cmov) to a non-constant-time conditional branch, which could re-enable timing-based side-channel attacks.

It doesn’t stop there. Once we reach the actual target-specific operations, there could still be data-dependent timing, such as executing a div on AMD:

The hardware integer divider unit has a typical latency of 8 cycles plus 1 cycle for every 9 bits of quotient. The divider allows limited overlap between two consecutive independent divide operations. “Typical” 64-bit divides allow a throughput of one divide per 8 cycles (where the actual throughput is data dependent).

Summary

Claims of constant-time executing code become weak when written in a high-level language such as Rust. This holds true for languages like C and C++ as well. There are too many factors that we cannot control.

Does this mean that all is lost? Is every crypto implementation not written in target-specific assembly language broken? Probably not, but these implementations have to rely on tricks and hopes of reasonable code generation.

There is almost always a trade-off, as is true in many areas—size versus speed, time to market versus quality, etc. There are large gains in implementing crypto in a memory-safe, modern language with strong analysis tooling available. However, hand-written, target-specific assembly language can make stronger claims about constant-time properties, with the drawback of potentially introducing memory safety issues.

To be able to make such claims for code written in Rust, there needs to be strong support from the compiler, from the front end, and all the way through to the target machine code generation in the back end. We probably need constant time to be a property that the compiler is aware of in order for it to preserve it. This is a major undertaking, and there are several ongoing discussions and proposals to get us there.

For now, we have to rely on what we have. A small step forward could be incorporating the never optimize option to help.

Article Link: Part 2: Improving crypto code in Rust using LLVM’s optnone | Trail of Bits Blog