A deep dive into Linux’s new mseal syscall

By Alan Cao

If you love exploit mitigations, you may have heard of a new system call named mseal landing into the Linux kernel’s 6.10 release, providing a protection called “memory sealing.” Beyond notes from the authors, very little information about this mitigation exists. In this blog post, we’ll explain what this syscall is, including how it’s different from prior memory protection schemes and how it works in the kernel to protect virtual memory. We’ll also describe the particular exploit scenarios that mseal helps stop in Linux userspace, such as stopping malicious permissions tampering and preventing memory unmapping attacks.

What mseal is (and isn’t)

Memory sealing allows developers to make memory regions immutable from illicit modifications during program runtime. When a virtual memory address (VMA) range is sealed, an attacker with a code execution primitive cannot perform subsequent virtual memory operations to change the VMA’s permissions or modify how it is laid out for their benefit.

If you’re like me and followed the spicy discourse surrounding this syscall in the kernel mailing lists, you may have observed that Chrome’s Security team introduced it to support their V8 CFI strategy, initially for Linux-based ChromeOS. After some lengthy deliberation and several rewrites, it finally landed in the kernel, with plans to expand its use case beyond browsers with its integration into glibc, possibly in version 2.41.

mseal’s security guarantees are unlike Linux’s memfd_create and its memfd_secret variant, which provide file sealing. memfd_create and memfd_secret allow one to create RAM-backed anonymous files as an alternative to storing content to tmpfs, with memfd_secret taking it a step further by ensuring that the region of memory is accessible only to the process holding the file descriptor. This lets developers create “secure enclave”-style userspace mappings that can guard sensitive in-memory data.

mseal digresses from prior memory protection schemes on Linux because it is a syscall tailored specifically for exploit mitigation against remote attackers seeking code execution rather than potentially local ones looking to exfiltrate sensitive secrets in-memory.

To understand mseal’s security mitigations, we must first study its implementation to understand how it operates. Luckily, mseal is simple to understand, so let’s look at how it works in the kernel!

A look under the hood

mseal has a simple function signature:

int mseal(unsigned start addr, size_t len, unsigned long flags)
  • start and len represent the start/end range of a valid VMA that we want to seal, and len must be properly page-aligned.
  • flags are unused at the time of writing and must be set to 0.

In the 6.12 kernel, its syscall definition calls do_mseal:

static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
{
    size_t len;
    int ret = 0;
    unsigned long end;
    struct mm_struct *mm = current->mm;     // [1]
// ... Check flags == 0, check page alignment, and compute `end`

if (mmap_write_lock_killable(mm))          // [2]
    return -EINTR;

/*
 * First pass, this helps to avoid
 * partial sealing in case of error in input address range,
 * e.g. ENOMEM error.
 */
ret = check_mm_seal(start, end);            // [3]
if (ret)
    goto out;

/*
 * Second pass, this should success, unless there are errors
 * from vma_modify_flags, e.g. merge/split error, or process
 * reaching the max supported VMAs, however, those cases shall
 * be rare.
 */
ret = apply_mm_seal(start, end);            // [4] 

out:
mmap_write_unlock(current->mm);
return ret;
}

do_mseal will first compute an end offset from the provided length and lock the memory region [2] to prevent concurrent access to the page. The global current at [1] represents the current executing task_struct (i.e., the process invoking mseal). The referenced field is the mm_struct representing the task’s entire virtual memory address space. The critical field in mm_struct on which this syscall will operate is mmap, a list of vm_area_struct values. This represents a single contiguous memory region created by mmap, such as the stack or VDSO.

The check_mm_seal call at [3] ensures that the targeted memory map for sealing is a valid range by iterating over each VMA from current->mm to test boundary correctness.

static int check_mm_seal(unsigned long start, unsigned long end)
{
    struct vm_area_struct *vma;
    unsigned long nstart = start;
VMA_ITERATOR(vmi, current->mm, start);

/* going through each vma to check. */
for_each_vma_range(vmi, vma, end) {
    if (vma->vm_start > nstart)
        /* unallocated memory found. */
        return -ENOMEM;
    if (vma->vm_end >= end)
        return 0;

    nstart = vma->vm_end;
}
return -ENOMEM;

}

The magic happens in the apply_mm_seal call [4], which walks over each VMA again and arranges for the targeted region to have an additional VM_SEALED flag through the mseal_fixup call:

static int apply_mm_seal(unsigned long start, unsigned long end)
{
    // ...
    nstart = start;
    for_each_vma_range(vmi, vma, end) {
        int error;
        unsigned long tmp;
        vm_flags_t newflags;
    newflags = vma->vm_flags | VM_SEALED;
    tmp = vma->vm_end;
    if (tmp > end)
        tmp = end;
    error = mseal_fixup(vmi, vma, &prev, nstart, tmp, newflags);
    if (error)
        return error;
    nstart = vma_iter_end(&vmi);
}
return 0;

}

To ensure that unwanted memory operations respect this new flag, the mseal patchset adds VM_SEALED checks to the following files:

 mm/madvise.c                                |   12 +
 mm/mmap.c                                   |   31 +-
 mm/mprotect.c                               |   10 +
 mm/mremap.c                                 |   31 +
 mm/mseal.c                                  |  307 ++++

For instance, mprotect and pkey_mprotect will enforce this check when it eventually invokes mprotect_fixup:

int
mprotect_fixup(..., struct vm_area_struct *vma, ...)
{
    // ...
    if (!can_modify_vma(vma))
        return -EPERM;
    }
    // ...
}

To determine whether the syscall should continue, can_modify_vma—defined in mm/vma.h—will test for the existence of VM_SEALED in the specified vm_area_struct:

static inline bool vma_is_sealed(struct vm_area_struct *vma)
{
    return (vma->vm_flags & VM_SEALED);
}

/*

  • check if a vma is sealed for modification.

  • return true, if modification is allowed.
    */
    static inline bool can_modify_vma(struct vm_area_struct *vma)
    {
    if (unlikely(vma_is_sealed(vma)))
    return false;

    return true;
    }

From the changes in other memory-management syscalls, we can determine the operations that are not permitted on a VMA after it is sealed:

  • Changing permission bits with mprotect and pkey_mprotect
  • Unmapping with munmap
  • Replacement of a sealed map with mmap(MAP_FIXED) with another one that is mutable/unsealed
  • Expanding or shrinking its size with mremap. Shrinking to zero could create a refillable hole for a new mapping with no sealing, as it triggers an unmap altogether.
  • Migrating to a new destination with mremap(MREMAP_MAYMOVE | MREMAP_FIXED). Note that sealing checks are imposed on both the source and destination VMAs. Also, the source VMA will be unmapped if MREMAP_DONTUNMAP is not supplied, but the munmap sealing check will still apply.
  • Calling madvise with the following destructive flags

For now, one can invoke mseal on a 6.10+ kernel through a direct syscall invocation. Here’s a basic wrapper implementation to help you get started:

#include <sys/syscall.h>
#include <unistd.h>

#define MSEAL_SYSCALL 462

long mseal(unsigned long start, size_t len)
{
int page_size;
uintptr_t page_aligned_start;

/* how large a page should be on our system (default: 4096 bytes) */
page_size = getpagesize();

/* page align the VMA range we want to seal */
page_aligned_start = start &amp; ~(page_size - 1);
return syscall(MSEAL_SYSCALL, page_aligned_start, len, 0);

}

What exploit techniques does mseal help mitigate?

From the disallowed operations, we can discern two particular exploit scenarios that memory sealing will prevent:

  • Tampering with a VMA’s permissions. Notably, not allowing executable permissions to be set can stop the revival of shellcode-based attacks.
  • “Hole-punching” through arbitrary unmapping/remapping of a memory region, mitigating data-only exploits that take advantage of refilling memory regions with attacker-controlled data.

Let’s examine these scenarios in more detail, and the defense-in-depth strategies developers can employ in their software implementations.

Hardening NX

Even with the continued existence of code reuse techniques like ROP, attackers may prefer to gain shellcoding capability during exploitation; this can provide a stable and “easy win,” especially if constraints are imposed on the gadget chain. Here is a potential workflow to achieve this:

  • Through some target functionality, spray shellcode onto a non-executable stack/heap region.
  • Exploit the target’s bug to kick off an initial ROP chain to call mprotect with PROT_EXEC to target the region holding the shellcode and turn off the NX bit.
  • Jump to it to revive old-school shellcoding!

The exploit for CVE-2018-7445 targeting Mikrotik RouterOS’s SMB daemon is a notable example. A socket-based shellcode is sprayed onto the non-executable heap, and the crafted ROP chain from a stack overflow modifies heap memory permissions before executing shellcode.

The most straightforward use case for memory sealing is disallowing VMA permission modification; once that happens, exploits that want to take advantage of traditional shellcode won’t be able to switch off executable bits.

As mentioned, mseal will be introduced in glibc 2.41+, where the dynamic loader will apply sealing across a predetermined set of VMAs. However, at the time of writing, this will not be done automatically for the stack or heap.

This is expected because these regions can expand during runtime. For instance, a heap allocator that wants to reclaim space will invoke the brk syscall, which could call arch_unmap and eventually do_vmi_unmap to perform shrinking. Of course, this would be disallowed under sealing and thus break dynamic memory allocation for the application altogether.

So, for now, the software developer is responsible for protecting these regions, as they have the context to determine when and where sealing should be applied appropriately.

Let’s use mseal to enhance the stack’s old-school NX (non-executable) protection. Here’s a simple example that emulates the scenario mentioned above:

int main(void)
{
    /* represents the stack that now contains /bin/sh shellcode we somehow sprayed */
    unsigned char exec_shellcode[] =
"\xe1\x45\x8c\xd2\x21\xcd\xad\xf2\xe1\x65\xce\xf2\x01\x0d\xe0\xf2"
"\xe1\x8f\x1f\xf8\xe1\x03\x1f\xaa\xe2\x03\x1f\xaa\xe0\x63\x21\x8b"
"\xa8\x1b\x80\xd2\xe1\x66\x02\xd4";
// vulnerability triggered, hijacked instruction pointer

/* ======= what our ROP chain would do: ======= */


/* compute the start of the page for the shellcode */
void (*exec_ptr)() =  (void(*)())&amp;exec_shellcode;
void *exec_offset = (void *)((int64_t) exec_ptr &amp; ~(getpagesize() - 1));

mprotect(exec_offset, getpagesize(), PROT_READ|PROT_WRITE|PROT_EXEC);

/* this now works! */
exec_ptr();
return 0;

}

As we’d expect, setting PROT_EXEC on the VMA permits exec_shellcode to become executable again:

~ gcc stack_no_sealing.c -o stack_no_sealing
~ ./stack_no_sealing
$

Let’s introduce memory sealing on the stack-based exec_offset VMA range:

int main(void)
{
    /* represents the stack that now contains /bin/sh shellcode we somehow sprayed */
    unsigned char exec_shellcode[] =
"\xe1\x45\x8c\xd2\x21\xcd\xad\xf2\xe1\x65\xce\xf2\x01\x0d\xe0\xf2"
"\xe1\x8f\x1f\xf8\xe1\x03\x1f\xaa\xe2\x03\x1f\xaa\xe0\x63\x21\x8b"
"\xa8\x1b\x80\xd2\xe1\x66\x02\xd4";
/* compute the start of the page for the shellcode */
void (*exec_ptr)() =  (void(*)())&amp;exec_shellcode;
void *exec_offset = (void *)((int64_t) exec_ptr &amp; ~(getpagesize() - 1));

/* seal the stack page containing the shellcode! */
if (mseal(exec_offset, getpagesize()) &lt; 0)
    handle_error("mseal");

// vulnerability triggered, hijacked instruction pointer

/* ======= what our ROP chain would do: ======= */

mprotect(exec_offset, getpagesize(), PROT_READ|PROT_WRITE|PROT_EXEC);
/* segfault now, as no permission change actually occurred */
exec_ptr();
return 0;

}

The aforementioned can_modify_vma check kicks in when mprotect is called, preventing the permission change from ever happening, and the attempt to shellcode now fails:

~ gcc stack_with_sealing.c -o stack_with_sealing
~ ./stack_with_sealing
[1]    48771 segmentation fault (core dumped)  ./stack_with_sealing

A simple strategy to accommodate real-world software could involve sparingly introducing a macro-ized version of the mseal code snippet and iteratively sealing pages in select stack frames where untrusted data could reside for exploitation:

#define SIMPLE_HARDEN_NX_SINGLE_PAGE(frame) \
  do { \
    void *frame_offset = (void *)((int64_t) &frame & ~(getpagesize() - 1)); \
    if (mseal(frame_offset, getpagesize()) == -1) { \
      handle_error("mseal"); \
    } \
  } while(0)

int frame_2(void)
{
int frame_start = 0;
unsigned char another_untrusted_buffer[1024] = { 0 };
SIMPLE_HARDEN_NX_SINGLE_PAGE(frame_start);
return 0;
}

int frame_1(void)
{
unsigned char untrusted_buffer[1024] = { 0 };
SIMPLE_HARDEN_NX_SINGLE_PAGE(untrusted_buffer);
return frame_2();
}

Even if a sealed VMA is reused as a frame for another function with sealing logic, invoking mseal again would be considered a no-op, so no errors would emerge. Of course, developers should be mindful of edge cases like automatic stack expansion from aggressive usage or bespoke features like stack splitting.

Hopefully, as the integration of mseal into glibc continues, we’ll see tunables emerge that do not require any manual use of the syscall for the stack. Commenters in the LWN mailing list yearn for an automatic sealing that can be toggled for simpler applications.

And with all this said, if an attacker doesn’t want to fully ROP and insists on bringing back shellcode nostalgia, they could always use their initial code reuse technique to mmap a fresh region that is executable. However, this is pretty laborious, as it now involves copying the exploit payload from a readable region to this new mapping.

Mitigating unmapping-based, data-only exploitation

Disallowing mprotect also prevents a sealed region from becoming writable, which is valuable if there are data variables that, when modified, could enhance an exploit primitive. However, during the inception of mseal, Chrome maintainers rationalized an easier and more powerful technique with the added benefit of circumventing CFI (control-flow integrity). They determined that if an attacker can pass a corrupted pointer to unmapping/remapping syscalls, they can “punch a hole” in memory that could be refilled with attacker-controlled data. This would not violate CFI guarantees, as forward- and backward-edge CFI would cover only tampered control-flow transitions (e.g., stack return addresses and function pointers).

This is incredibly enticing for a browser implementing a JIT compiler. V8’s Turbofan can create regions that switch between RW and RX, aiding the refill process and changing permissions. Thus, an attacker can take advantage of the JIT compilation process by emitting executable code from hot-path JavaScript into the unmapped region to overwrite critical data and then leverage modifications to yield code execution.

We argue this is a data-only exploitation technique, as it doesn’t involve directly hijacking control flow or requiring leaked pointers but rather tampering with particular data in memory that influences control flow to the attacker’s liking. In an era of mitigations like CFI, this has emerged as a pretty potent technique during exploitation. Thus, memory sealing can prevent these particular data-only techniques by disallowing hole-punching scenarios.

This particular data-only technique isn’t just for browsers with JIT compilers! A similar technique would be the House of Muney for userspace heap exploitation. As Max Dulin points out in his post, Qualys used this technique to perform a real-world exploit for an ancient bug in Qmail.

This technique relies on the fact that for huge allocated chunks (greater than the M_MAP_THRESHOLD tunable), malloc and free will directly invoke mmap and munmap, respectively, with no intermediate freelists that cache any freed chunks (which helps greatly simplify exploitation). Since size metadata exists at the top of allocated chunks, tampering it to a different page size and freeing it would cause a munmap on memory regions adjacent to the chunk. Dulin used the arbitrary munmap to target the .gnu.hash and .dynsym regions and after refilling them with another larger mmap chunk, enabled the overwriting of a single, yet-to-be-resolved PLT entry, reviving a GOT overwrite-style attack!

Dulin has a very well-done and annotated PoC for this attack here. Here’s an abridged version that goes up to the point where the unmapping and refill occur:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <malloc.h>

// With this allocation size,
// malloc is now equivalent to mmap
// free is now equivalent to munmap
#define THRESHOLD_SIZE 0x100000

int main() {
long long *bottom, *top, *refill;

bottom = malloc(THRESHOLD_SIZE);
memset(bottom, 'B', THRESHOLD_SIZE);

// [1] Allocation that we write into out-of-bounds from a prior chunk
top = malloc(THRESHOLD_SIZE);
memset(top, 'A', THRESHOLD_SIZE);

// [2] Corrupts size field, ensuring page alignment + mmap bit is set
// size to unmap = top + bottom + large arbitrary size
int unmap_size = (0xfffffffd &amp; top[-1]) + (0xfffffffd &amp; bottom[-1]) + 0x14000;
top[-1] = (unmap_size | 2);

// Trigger munmap with corrupted chunk
free(top);

// [3] Refill with new and larger mmap chunk
refill = malloc(0x5F0000);
memset(refill, 'X', 0x5F0000);
return 0;

}

By the time we finish [1], we can see that the top and bottom chunks now exist in a separate mapping below the heap, separated by 4096-byte padding. Note the adjacent libc mapping at 0xfffff7df0000:

At [2], we corrupt the size field of the chunk to a much larger page size and ensure that the mmap bit is set. When we break on the munmap occurring in the free [3], the size argument passed has been changed, allowing an unmap into the adjacent region!

After [3], this can be confirmed by examining the contents of the previous libc mapping at 0xfffff7df0000, now partially overwritten with Xs:

This is a pretty nifty data-only technique that can operate even in the presence of CFI and does not require a prerequisite ASLR leak!

Luckily, the aforementioned set of VMAs in mseal’s glibc integration is expected to automatically mitigate this without any developer intervention, as mapped binary code and dynamic libraries become sealed from any remap/unmapping tricks like this. For additional hardening, a developer can selectively seal mmap allocations that they know will never expand or become unmapped during the lifetime of their program. This will have the added benefit of preventing the previous exploit scenario if attacker-controlled data can be expected to be written into the mmap chunks and may become writable/executable.

Build stronger software with mseal

There are likely many other use cases and scenarios that we didn’t cover. After all, mseal is the newest kid on the block in the Linux kernel! As the glibc integration completes and matures, we expect to see improved iterations for the syscall to meet particular demands, including fleshing out the ultimate use of the flags parameter.

Hardening software is complex, as navigating and evaluating new security mitigations can be challenging in understanding the risk and reward payoff. If this blog post is interesting to you, check out some of our escapades into other security mitigations. If you’re seeking guidance in integrating mseal or any other modern mitigations into your software, contact us!

Article Link: A deep dive into Linux’s new mseal syscall | Trail of Bits Blog