While AMD has historically enjoyed relative respite from side-channel attack publications, this past disparity was largely due to Intel’s processors being a more attractive research target, with a greater depth of information available around engineering features (e.g. red unlock) and internals (e.g. microcode structure), and a greater share of the server market at the time. In the five years since Meltdown and Spectre, researchers have been busy closing the knowledge gap around AMD’s processors, making it easier to discover impactful security issues.
The Zenbleed vulnerability exploits incorrect recovery behaviour after a branch misprediction involving optimised vector instructions, resulting in information within floating point unit (FPU) registers being leaked. Vectorisation is frequently utilised in common library functions (e.g.
strlen) for performance reasons, making this a very wide-reaching vulnerability in terms of the types of data that can be extracted.
To understand Zenbleed, we need to dig into modern processor design. Modern x86_64 processors do not simply execute one instruction after the next. Instead, they operate in a superscalar manner, essentially executing multiple instructions at once using techniques such as instruction-level parallelism (ILP) and out-of-order execution. While the processor outwardly appears to have a small number of general purpose registers (e.g.
r12, etc.) and a bank of SIMD registers (e.g.
ymm3, etc.), each processor core actually has a far larger number of internal registers. The named registers aren’t uniquely represented by a single physical hardware register each, but are rather dynamically allocated in a register file. This enables some very important optimisations.
For example, if you were to execute the instruction
xchg rax, rcx, the processor almost certainly doesn’t move any values between physical hardware registers within the register file. Instead, it performs a register rename, essentially swapping the labels on the register file entries. This also happens with SIMD registers, allowing for complex behaviours and optimisations relating to the “nesting” of registers (e.g.
xmm1 being one half of
ymm1, which in turn is one half of
When we think of a classical processor design, we typically think of it having an instruction decoder, an arithmetic logic unit (ALU), a floating point unit (FPU), etc. However, a superscalar processor actually has several of these per core, and uses a complex scheduling system to execute many operations at the same time. By identifying data dependencies between instructions, the processor can identify cases where later instructions do not depend upon the results of previous instructions, allowing it to execute the instruction at the same time.
For example, consider the following sequence of instructions:
assembly mov rcx, [rbp+0x8] lea rcx, [rcx*0x4] sub rax, 0x8 add rcx, rax xor rax, rax mov [rbp+0x8], rax mov [rbp+0x10], rcx
Rather than executing the first instruction, stalling while waiting for the memory fetch to complete, then working on the next instructions, the processor can instead look ahead and see that
sub rax, 0x8 does not depend upon the results of the first two instructions and choose to execute it simultaneously. It may also recognise that
xor rax, rax sets
rax to zero, thus not depending on the value of
rax before that time, allowing it to start working on further instructions too, as long as memory accesses are correctly ordered. Not only this, but if the processor’s register allocation scheme keeps track of which entries in the register file are zero, then it does not need to explicitly zero a register to represent
rax, but can simply reuse an already-zeroed entry.
By carefully accounting for data dependencies and memory access ordering, the processor can parallelise operations across multiple physical ALUs and other units at the same time, re-ordering operations to try to ensure maximum utilisation of parallel units at all times. This also occurs with SIMD instructions, with special accounting for the upper and lower halves of the SIMD registers (
zmm*) to help identify data dependencies when independent pieces of data are simultaneously processed in a vectorised manner.
This behaviour also interacts with speculative execution, where the processor tries to guess what the result of a branch instruction will be and continues execution as if the guess was correct, then rolls back to the previous state if the guess was incorrect. For example:
assembly cmp rax, [rcx] je skip add rcx, 4 lea rax, [rcx*2+8] mov [rcx], rax skip: add rcx, 8
When the processor hits
je skip, the memory fetch from the first instruction is still in flight, so it doesn’t yet know whether the branch will be taken or not. Without speculative execution this results in a pipeline stall while the memory fetch completes. To avoid this stall, the processor makes a branch prediction (i.e. an informed guess based on various metadata and prior observations) and saves a checkpoint. It then continues execution as if its prediction was correct (i.e. either after the branch or at the branch target, depending on what the prediction was) and either commits or rolls back its state depending on whether its prediction later turns out to be correct.
Let’s say that the processor guesses that the branch is not taken. It executes the code immediately after the branch (i.e.
add rcx, 4, …) and continues until it hits the write hazard at
mov [rcx], rax. It may also look ahead and see that it would execute
add rcx, 8, which is not dependent on the write hazard, and execute that too. ILP also applies here, so some of these operations can be done in parallel.
When the memory fetch issued by
cmp rax, [rcx] comes back, the processor now knows whether or not its prediction was correct. If it was, it commits the speculatively executed state and carries on. If it wasn’t, it has to roll back the state to an earlier checkpoint.
The Zenbleed vulnerability arises from faulty behaviour when a branch misprediction rollback occurs immediately after a special SIMD register optimisation and register rename occur.
The optimisation in question is called the XMM Register Merge Optimization. AMD Zen 2 processors keep track of SIMD registers whose upper halves have been zeroed, using a z-bit in its Register Allocation Table (RAT). When an instruction writes non-zero data to the upper half of a register, the z-bit is cleared, indicating that there is data present and any subsequent instructions that might be affected by that data cannot be executed until the data dependency is resolved. However, if the upper half is zeroed, instructions that also do not modify that upper half can proceed without waiting, avoiding the data dependency and resulting pipeline stall.
Tavis Ormandy’s writeup of the Zenbleed demonstrates this optimisation using the AVX2 optimised
strlen function from glibc:
assembly vpxor xmm0, xmm0, xmm0 ; xor xmm0 with xmm0 and store it in xmm0 (extends to ymm0) vpcmpeqb ymm1, ymm0, [rdi] ; compare the memory at rdi to ymm0, store result in ymm1 vpmovmskb eax, ymm1 ; set eax to a 32-bit bitmap of null bytes in the ymm1 register tzcnt eax, eax ; count the trailing zeroes vzeroupper ; zero the upper 128 bits of ymm0-ymm15
The first instruction zeroes the 128-bit SIMD register
xmm0 (similar to
xor rax, rax) and, in the process, also zeroes the 256-bit SIMD register
ymm0 which encompasses it, since
xmm0 is the lower half of
The second instruction,
vpcmpeqb (vector compare equal bytes), treats the
ymm0 register as 32 packed bytes and compares those to the 32 bytes of memory pointed to by
rdi. Bytes that are equal produce a corresponding byte of all 1s in the
ymm1 destination register, whereas bytes that are not equal produce a corresponding byte of all 0s.
The third instruction,
vpmovmskb (vector move byte mask), takes the most significant bit of each packed byte in the
ymm1 register and writes it to the corresponding bit in
eax. This results in MSBs from 32 separate bytes in
ymm1 being packed into a single 32-bit general purpose register.
The fourth instruction counts the trailing zeroes in
eax. Since each bit in
eax now represents a byte in the source memory that was zero, this finds how many trailing
\0 characters appeared after the end of a 32-byte aligned string chunk.
The fifth instruction,
vzeroupper, is not functionally required – the code has already finished calculating the number of trailing
\0 characters – but its presence is important for performance. The instruction zeroes the upper halves of all
ymm registers (and
zmm registers too) – or, rather, what this actually does is set the corresponding z-bits being in the RAT to indicate that the upper halves of each register are zero, without actually zeroing any underlying entries in the register file. The lower half of the
ymm register (accessible via
xmm*) is still allocated in the register file, but it is merged with an upper half that is unallocated and marked as zero via its z-bit.
This is why the
vzeroupper instruction helps prevent the processor from falsely assuming data dependencies in subsequent instructions that use the
ymm registers. The XMM Register Merge Optimization allows the processor to identify instructions which do not write to the upper portion of the register, thus letting them execute without treating the upper (zero) portion of the register as a data dependency. This uncouples the data dependency between overlapping
Unfortunately it seems that AMD Zen 2 processors do not correctly handle the case when a
vzeroupper instruction is speculatively executed and then rolled back due to branch misprediction. The scenario is as follows:
- SIMD instructions that support the XMM Register Merge Optimisation are executed, using
- A register rename is triggered on the overlapping
ymmoperand, e.g. by the
- A branch is reached and the CPU speculatively executes past it.
vzeroupperinstruction is speculatively executed, which sets the z-bit on the upper halves of all
ymmregisters and deallocates their respective entries in the register file.
- The branch condition is resolved and misprediction is detected.
- The processor rolls back the
vzeroupperinstruction by clearing the z-bits and re-allocating the entries.
- Execution continues from the correct branch path.
However, when the rollback occurs, the processor resets the z-bit to zero, leaving the register in an undefined state, with the upper half of the
ymm register pointing at an uninitialised entry in the register file. This is comparable to a use-after-free bug, but in the processor’s register file instead of system memory.
Since the register file is shared by SMT cores, this can be used to snoop on data in the SIMD registers across hyperthreads. This isn’t the only attack scenario, though – the same attack can be leveraged for privilege escalation.
While it might initially seem like SIMD registers aren’t particularly interesting, they are used in optimised versions of almost all string and memory manipulation functions in standard libraries. This means they are constantly handling sensitive data like passwords, keys, configuration files, etc. making all this data vulnerable to leakage.
There is a PoC exploit for Zenbleed on GitHub which is capable of dumping data across hyperthreads. The code is also nicely commented and quite easy to follow.
AMD released Bulletin AMD-SB-7008 “Cross-Process Information Leak” to track the issue. They also released a microcode patch to address the issue on Family 17h Model 31h (EPYC 7002 series) and Family 17h Model 0Ah (Sabrina SoCs). So far there are no microcode updates for consumer products, meaning that AMD’s desktop, mobile, HEDT, and workstation (Threadripper) processors remain vulnerable. AGESA firmware updates are scheduled for release in October and December 2023, which should contain new microcode for those products. It seems that the coordinated disclosure process for Zenbleed went a little off the rails, possibly due to AMD accidentally publishing information several months ahead of the agreed embargo date, resulting in the bug being disclosed 3-4 months ahead of patch availability.
On systems where the microcode or firmware updates cannot be applied, a workaround is possible using a chicken bit in the
DE_CFG register at MSR 0xC0011029. Setting bit 9 in this register enables a backup fix, but has additional performance impact compared to the microcode update. Linux’s name for this workaround bit is
MSR_AMD64_DE_CFG_ZEN2_FP_BACKUP_FIX_BIT, which it should automatically apply on affected platforms when no microcode update is present. The bit can manually be set on Linux using
msr-tools, or on FreeBSD with
At the time of writing, Microsoft do not appear to have a security update that applies the
DE_CFG chicken bit workaround. You can modify MSRs using RWEverything on Windows, although that comes with its own risks and is probably not a sensible thing to do in production.
It is possible to query which version of microcode has been applied, to test whether an updated version has been applied, although the method is OS specific. On Windows, the microcode version information is found in the following registry key:
Update Revision value describes the microcode version that has been loaded into the processor, and the
Previous Update Revision describes the microcode version that was loaded into the processor by the system firmware (UEFI / BIOS) at boot.
/proc/cpuinfo will list the microcode version alongside other processor details:
processor : 127 vendor_id : AuthenticAMD cpu family : 23 model : 1 model name : AMD EPYC 7601 32-Core Processor stepping : 2 microcode : 0x8001206
The same info can also usually be found in the kernel boot log.
For Zen 2 architecture EPYC processors, a microcode version of
0x0830107a or higher indicates that a fix was applied. For Zen 2 architecture Sabrina SoCs, a microcode version of
0x08a00008 or higher indicates that a fix was applied. As noted above, all other processor families, including desktop Ryzen processors, are yet to receive a microcode update with a patch, so we don’t yet know what the fixed microcode versions will be.
In the interim, Linux should automatically apply software mitigations for Zenbleed. You can query the status of these mitigations through the
sysfs interface, under the following directory:
If you’re running a server with a Zen 2 EPYC processor, you should update your firmware and install all OS patches to help ensure that Zenbleed is patched. If your system vendor has yet to release firmware updates to address this issue, it is possible that your OS will still load the new microcode blobs at boot, so make sure to check that first before trying to implement any manual workarounds. As always, refer to vendor guidance for good practice mitigation strategies.
The post Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions appeared first on LRQA Nettitude Labs.