Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions

MalBot · August 30, 2023, 12:15pm

While AMD has historically enjoyed relative respite from side-channel attack publications, this past disparity was largely due to Intel’s processors being a more attractive research target, with a greater depth of information available around engineering features (e.g. red unlock) and internals (e.g. microcode structure), and a greater share of the server market at the time. In the five years since Meltdown and Spectre, researchers have been busy closing the knowledge gap around AMD’s processors, making it easier to discover impactful security issues.

Zenbleed was recently discovered by Tavis Ormandy and reported to AMD on 15 May 2023. This vulnerability affects all AMD Zen 2 processors, and has been assigned CVE-2023-20593. The vulnerability is of particular concern for shared hosting providers, virtualisation platforms, and other shared-tenant systems. However, any scenario where a malicious actor can execute code potentially poses a threat, including in contexts such as privilege escalation, sandbox escape, and possibly even malicious JavaScript executing in a web browser.

The Zenbleed vulnerability exploits incorrect recovery behaviour after a branch misprediction involving optimised vector instructions, resulting in information within floating point unit (FPU) registers being leaked. Vectorisation is frequently utilised in common library functions (e.g. memcpy, memcmp, strlen) for performance reasons, making this a very wide-reaching vulnerability in terms of the types of data that can be extracted.

To understand Zenbleed, we need to dig into modern processor design. Modern x86_64 processors do not simply execute one instruction after the next. Instead, they operate in a superscalar manner, essentially executing multiple instructions at once using techniques such as instruction-level parallelism (ILP) and out-of-order execution. While the processor outwardly appears to have a small number of general purpose registers (e.g. rax, rbx, r12, etc.) and a bank of SIMD registers (e.g. xmm0, ymm3, etc.), each processor core actually has a far larger number of internal registers. The named registers aren’t uniquely represented by a single physical hardware register each, but are rather dynamically allocated in a register file. This enables some very important optimisations.

For example, if you were to execute the instruction xchg rax, rcx, the processor almost certainly doesn’t move any values between physical hardware registers within the register file. Instead, it performs a register rename, essentially swapping the labels on the register file entries. This also happens with SIMD registers, allowing for complex behaviours and optimisations relating to the “nesting” of registers (e.g. xmm1 being one half of ymm1, which in turn is one half of zmm1).

When we think of a classical processor design, we typically think of it having an instruction decoder, an arithmetic logic unit (ALU), a floating point unit (FPU), etc. However, a superscalar processor actually has several of these per core, and uses a complex scheduling system to execute many operations at the same time. By identifying data dependencies between instructions, the processor can identify cases where later instructions do not depend upon the results of previous instructions, allowing it to execute the instruction at the same time.

For example, consider the following sequence of instructions:

assembly
mov rcx, [rbp+0x8]
lea rcx, [rcx*0x4]
sub rax, 0x8
add rcx, rax
xor rax, rax
mov [rbp+0x8], rax
mov [rbp+0x10], rcx

Rather than executing the first instruction, stalling while waiting for the memory fetch to complete, then working on the next instructions, the processor can instead look ahead and see that sub rax, 0x8 does not depend upon the results of the first two instructions and choose to execute it simultaneously. It may also recognise that xor rax, rax sets rax to zero, thus not depending on the value of rax before that time, allowing it to start working on further instructions too, as long as memory accesses are correctly ordered. Not only this, but if the processor’s register allocation scheme keeps track of which entries in the register file are zero, then it does not need to explicitly zero a register to represent rax, but can simply reuse an already-zeroed entry.

By carefully accounting for data dependencies and memory access ordering, the processor can parallelise operations across multiple physical ALUs and other units at the same time, re-ordering operations to try to ensure maximum utilisation of parallel units at all times. This also occurs with SIMD instructions, with special accounting for the upper and lower halves of the SIMD registers (xmm*, ymm*, zmm*) to help identify data dependencies when independent pieces of data are simultaneously processed in a vectorised manner.

This behaviour also interacts with speculative execution, where the processor tries to guess what the result of a branch instruction will be and continues execution as if the guess was correct, then rolls back to the previous state if the guess was incorrect. For example:

assembly
cmp rax, [rcx]
je skip
add rcx, 4
lea rax, [rcx*2+8]
mov [rcx], rax
skip:
add rcx, 8

When the processor hits je skip, the memory fetch from the first instruction is still in flight, so it doesn’t yet know whether the branch will be taken or not. Without speculative execution this results in a pipeline stall while the memory fetch completes. To avoid this stall, the processor makes a branch prediction (i.e. an informed guess based on various metadata and prior observations) and saves a checkpoint. It then continues execution as if its prediction was correct (i.e. either after the branch or at the branch target, depending on what the prediction was) and either commits or rolls back its state depending on whether its prediction later turns out to be correct.

Let’s say that the processor guesses that the branch is not taken. It executes the code immediately after the branch (i.e. add rcx, 4, …) and continues until it hits the write hazard at mov [rcx], rax. It may also look ahead and see that it would execute add rcx, 8, which is not dependent on the write hazard, and execute that too. ILP also applies here, so some of these operations can be done in parallel.

When the memory fetch issued by cmp rax, [rcx] comes back, the processor now knows whether or not its prediction was correct. If it was, it commits the speculatively executed state and carries on. If it wasn’t, it has to roll back the state to an earlier checkpoint.

The Zenbleed vulnerability arises from faulty behaviour when a branch misprediction rollback occurs immediately after a special SIMD register optimisation and register rename occur.

The optimisation in question is called the XMM Register Merge Optimization. AMD Zen 2 processors keep track of SIMD registers whose upper halves have been zeroed, using a z-bit in its Register Allocation Table (RAT). When an instruction writes non-zero data to the upper half of a register, the z-bit is cleared, indicating that there is data present and any subsequent instructions that might be affected by that data cannot be executed until the data dependency is resolved. However, if the upper half is zeroed, instructions that also do not modify that upper half can proceed without waiting, avoiding the data dependency and resulting pipeline stall.

Tavis Ormandy’s writeup of the Zenbleed demonstrates this optimisation using the AVX2 optimised strlen function from glibc:

assembly
vpxor xmm0, xmm0, xmm0 ; xor xmm0 with xmm0 and store it in xmm0 (extends to ymm0)
vpcmpeqb ymm1, ymm0, [rdi] ; compare the memory at rdi to ymm0, store result in ymm1
vpmovmskb eax, ymm1 ; set eax to a 32-bit bitmap of null bytes in the ymm1 register
tzcnt eax, eax ; count the trailing zeroes
vzeroupper ; zero the upper 128 bits of ymm0-ymm15

The first instruction zeroes the 128-bit SIMD register xmm0 (similar to xor rax, rax) and, in the process, also zeroes the 256-bit SIMD register ymm0 which encompasses it, since xmm0 is the lower half of ymm0.

The second instruction, vpcmpeqb (vector compare equal bytes), treats the ymm0 register as 32 packed bytes and compares those to the 32 bytes of memory pointed to by rdi. Bytes that are equal produce a corresponding byte of all 1s in the ymm1 destination register, whereas bytes that are not equal produce a corresponding byte of all 0s.

The third instruction, vpmovmskb (vector move byte mask), takes the most significant bit of each packed byte in the ymm1 register and writes it to the corresponding bit in eax. This results in MSBs from 32 separate bytes in ymm1 being packed into a single 32-bit general purpose register.

The fourth instruction counts the trailing zeroes in eax. Since each bit in eax now represents a byte in the source memory that was zero, this finds how many trailing \0 characters appeared after the end of a 32-byte aligned string chunk.

The fifth instruction, vzeroupper, is not functionally required – the code has already finished calculating the number of trailing \0 characters – but its presence is important for performance. The instruction zeroes the upper halves of all ymm registers (and zmm registers too) – or, rather, what this actually does is set the corresponding z-bits being in the RAT to indicate that the upper halves of each register are zero, without actually zeroing any underlying entries in the register file. The lower half of the ymm register (accessible via xmm*) is still allocated in the register file, but it is merged with an upper half that is unallocated and marked as zero via its z-bit.

This is why the vzeroupper instruction helps prevent the processor from falsely assuming data dependencies in subsequent instructions that use the ymm registers. The XMM Register Merge Optimization allows the processor to identify instructions which do not write to the upper portion of the register, thus letting them execute without treating the upper (zero) portion of the register as a data dependency. This uncouples the data dependency between overlapping xmm and ymm registers.

Unfortunately it seems that AMD Zen 2 processors do not correctly handle the case when a vzeroupper instruction is speculatively executed and then rolled back due to branch misprediction. The scenario is as follows:

SIMD instructions that support the XMM Register Merge Optimisation are executed, using xmm operands.
A register rename is triggered on the overlapping ymm operand, e.g. by the vmovdqa instruction.
A branch is reached and the CPU speculatively executes past it.
A vzeroupper instruction is speculatively executed, which sets the z-bit on the upper halves of all ymm registers and deallocates their respective entries in the register file.
The branch condition is resolved and misprediction is detected.
The processor rolls back the vzeroupper instruction by clearing the z-bits and re-allocating the entries.
Execution continues from the correct branch path.

However, when the rollback occurs, the processor resets the z-bit to zero, leaving the register in an undefined state, with the upper half of the ymm register pointing at an uninitialised entry in the register file. This is comparable to a use-after-free bug, but in the processor’s register file instead of system memory.

Since the register file is shared by SMT cores, this can be used to snoop on data in the SIMD registers across hyperthreads. This isn’t the only attack scenario, though – the same attack can be leveraged for privilege escalation.

While it might initially seem like SIMD registers aren’t particularly interesting, they are used in optimised versions of almost all string and memory manipulation functions in standard libraries. This means they are constantly handling sensitive data like passwords, keys, configuration files, etc. making all this data vulnerable to leakage.

There is a PoC exploit for Zenbleed on GitHub which is capable of dumping data across hyperthreads. The code is also nicely commented and quite easy to follow.

AMD released Bulletin AMD-SB-7008 “Cross-Process Information Leak” to track the issue. They also released a microcode patch to address the issue on Family 17h Model 31h (EPYC 7002 series) and Family 17h Model 0Ah (Sabrina SoCs). So far there are no microcode updates for consumer products, meaning that AMD’s desktop, mobile, HEDT, and workstation (Threadripper) processors remain vulnerable. AGESA firmware updates are scheduled for release in October and December 2023, which should contain new microcode for those products. It seems that the coordinated disclosure process for Zenbleed went a little off the rails, possibly due to AMD accidentally publishing information several months ahead of the agreed embargo date, resulting in the bug being disclosed 3-4 months ahead of patch availability.

On systems where the microcode or firmware updates cannot be applied, a workaround is possible using a chicken bit in the DE_CFG register at MSR 0xC0011029. Setting bit 9 in this register enables a backup fix, but has additional performance impact compared to the microcode update. Linux’s name for this workaround bit is MSR_AMD64_DE_CFG_ZEN2_FP_BACKUP_FIX_BIT, which it should automatically apply on affected platforms when no microcode update is present. The bit can manually be set on Linux using msr-tools, or on FreeBSD with cpucontrol.

At the time of writing, Microsoft do not appear to have a security update that applies the DE_CFG[9] chicken bit workaround. You can modify MSRs using RWEverything on Windows, although that comes with its own risks and is probably not a sensible thing to do in production.

It is possible to query which version of microcode has been applied, to test whether an updated version has been applied, although the method is OS specific. On Windows, the microcode version information is found in the following registry key:

HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor\0

The Update Revision value describes the microcode version that has been loaded into the processor, and the Previous Update Revision describes the microcode version that was loaded into the processor by the system firmware (UEFI / BIOS) at boot.

On Linux, /proc/cpuinfo will list the microcode version alongside other processor details:

processor : 127
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7601 32-Core Processor
stepping : 2
microcode : 0x8001206

The same info can also usually be found in the kernel boot log.

For Zen 2 architecture EPYC processors, a microcode version of 0x0830107a or higher indicates that a fix was applied. For Zen 2 architecture Sabrina SoCs, a microcode version of 0x08a00008 or higher indicates that a fix was applied. As noted above, all other processor families, including desktop Ryzen processors, are yet to receive a microcode update with a patch, so we don’t yet know what the fixed microcode versions will be.

In the interim, Linux should automatically apply software mitigations for Zenbleed. You can query the status of these mitigations through the sysfs interface, under the following directory:

/sys/devices/system/cpu/vulnerabilities/

If you’re running a server with a Zen 2 EPYC processor, you should update your firmware and install all OS patches to help ensure that Zenbleed is patched. If your system vendor has yet to release firmware updates to address this issue, it is possible that your OS will still load the new microcode blobs at boot, so make sure to check that first before trying to implement any manual workarounds. As always, refer to vendor guidance for good practice mitigation strategies.

The post Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions appeared first on LRQA Nettitude Labs.

Article Link: Zenbleed - AMD Side-Channel Attack Targets Vectorised Functions - LRQA Nettitude Labs