Missing checks in kernel mount helpers

2025-01-09

Overview

This post assumes familiarity with Linux namespaces (see background reading for more).

User namespaces let you gain CAP_SYS_ADMIN without being privileged on the host. You're root inside your namespace, but that shouldn't let you touch host mounts, devices, or other namespaces' resources. This note looks at three places in the kernel's VFS mount layer where that boundary wasn't enforced:

do_change_type() - mount propagation changes (CVE-2025-38498)
clone_private_mount() - internal bind mounts for overlayfs (CVE-2025-38499)
The new mount API - cross-namespace filesystem setup

1. `do_change_type()`

do_change_type() handles propagation changes: mount --make-shared, --make-slave, --make-private, --make-unbindable. These control whether mount events ripple across namespace boundaries.

The function checked for CAP_SYS_ADMIN in the caller's namespace, but never verified that the caller's namespace actually owned the mount:

static int do_change_type(struct path *path, int ms_flags)
{
    /* may_mount() is checked earlier: verifies CAP_SYS_ADMIN over the caller's mount namespace */

    /* But never checks: is the mount in the same namespace as the caller? */

    change_mnt_propagation(mnt, ms_flags);
}

The missing check was check_mnt(mnt), a one-liner that verifies the mount belongs to the caller's mount namespace. Without it, an unprivileged user could hold an fd to a host mount, enter a new user namespace, and modify the mount's propagation flags:

# as unprivileged user, open an fd to a mount, then enter namespaces
$ exec 99</mnt
$ unshare --user --mount --map-root-user

# now mark the host's mount unbindable via the held fd
$ mount --make-unbindable /proc/self/fd/99
# succeeds (shouldn't)

This example uses unbindable, but any propagation setting could be changed. In containerized environments, this could disrupt mount propagation or affect other containers.

The fix (12f147d) adds the missing ownership check:

--- a/fs/namespace.c
+++ b/fs/namespace.c
 static int do_change_type(struct path *path, int ms_flags)
 {
+   if (!check_mnt(mnt)) {
+       err = -EINVAL;
+       goto out_unlock;
+   }
     change_mnt_propagation(mnt, ms_flags);
 }

The function was written in 2006, but the missing check only became exploitable by unprivileged users when user namespaces landed in 2012.

However, this fix broke CRIU (Checkpoint/Restore in Userspace), which modifies mounts from non-current namespaces during restore. The CRIU maintainers asked if the check could be relaxed to just require CAP_SYS_ADMIN in the target mount's user namespace. Al Viro's response:

"Not enough, both in terms of permissions and in terms of 'thou shalt not bugger the kernel data structures'."

The original (pre-fix) behavior was both a security problem and could corrupt internal mount propagation state, so simply reverting wasn't an option. A follow-up patch replaced check_mnt() with may_change_propagation(), which checks: (1) the mount is actually mounted somewhere, and (2) the caller has CAP_SYS_ADMIN in the userns that owns the mount's namespace.

2. `clone_private_mount()`

Container runtimes restrict access to certain sensitive paths by overmounting them. Kubernetes mounts /dev/null over /proc/kcore and marks /proc/sys as ro, for example. These procfs paths provide access to global host resources and can't be namespaced, so overmounting is the only way to restrict access to them.

The kernel has "locked mounts" to prevent unprivileged users from undoing this: when you create an unprivileged mount namespace, mounts from the parent are glued together. You can't unmount or bind-mount (clone) them individually.

Overlayfs is one place this locking matters: it clones its lowerdir internally. This clone has a safety check to fail if the mount is locked. This is to prevent you from bypassing overmounts by passing a mount with hidden paths as a lowerdir. However, if you passed overlayfs an fd reference to the unlocked mount from the parent namespace (before you unshared), overlayfs didn't see it as locked and cloned it anyway. Since overlayfs doesn't recursively clone child mounts, any overmounts were stripped, exposing the hidden files through the overlay.

The fix (c28f922) checks CAP_SYS_ADMIN in the user namespace that owns the mount, not just the caller's namespace:

--- a/fs/namespace.c
+++ b/fs/namespace.c
 struct vfsmount *clone_private_mount(const struct path *path)
 {
     struct mount *old = real_mount(path->mnt);
     struct mount *new;

+    if (!ns_capable(old_mnt->mnt_ns->user_ns, CAP_SYS_ADMIN))
+        return ERR_PTR(-EPERM);

     new = clone_mnt(old, path->dentry, CL_PRIVATE);

The bug became exploitable by unprivileged users when kernel 5.11 made overlayfs mountable from user namespaces.

Rough skeleton for illustration:

if (fork() == 0) {
    // get fsctx from parent through whatever mechanism
    // then configure using a lowerdir in our ns (host)
    fsconfig(fsctx, FSCONFIG_SET_STRING, "lowerdir", "/target/with/overmounts");
    fsconfig(fsctx, FSCONFIG_CMD_CREATE);
} else {
    // parent enters new user+mount namespace
    // all inherited mounts are now "locked"
    unshare(CLONE_NEWUSER | CLONE_NEWNS);
    
    // create overlay filesystem context
    fsctx = fsopen("overlay");
    
    // pass fsctx to child, wait for it to set it up, then
    // mount the overlay. overmount is removed
    move_mount(fsctx, "/mnt/overlay");
}

3. The new mount API

The old mount() syscall is a single atomic operation. The new mount API (fsopen/fsconfig/fsmount/move_mount) splits this into steps, returning file descriptors at each stage. Since fds can cross process boundaries, you can create filesystem context in one namespace and fill it in another. This has some interesting side effects on the kernel privilege checks for these resources.

Permission checks happen at fsopen() time, but the superblock's user namespace is determined at FSCONFIG_CMD_CREATE time. An unprivileged user can fsopen("overlay") in their user namespace (where they have privileges), then call fsconfig(FSCONFIG_CMD_CREATE) from init_user_ns (via an inherited fd). In kernels <6.5, the superblock ends up tagged as belonging to init_user_ns, even though an unprivileged user set up the filesystem.

One consequence: the kernel decides whether to force SB_I_NODEV based on which user namespace owns the superblock:

// fs/super.c
if (s->s_user_ns != &init_user_ns)
    s->s_iflags |= SB_I_NODEV;

With the cross-namespace setup, SB_I_NODEV never gets set. As a result, you can open device files through the overlay, bypassing nodev mount restrictions. This matters anywhere nodev is used to limit untrusted filesystems: NFS mounts, or FUSE, where unprivileged users can create arbitrary files including device nodes.

The ns_capable() check added to clone_private_mount() (bug #2) closes this path. The clone now requires privilege over the mount's owning namespace, which an unprivileged attacker cannot satisfy. This bug only affects kernels before 6.5, when overlayfs adopted the new fs_context infrastructure internally (1784fbc).

Background reading

Namespace overview:

User namespace security challenges:

Anatomy of a user namespaces vulnerability (LWN)
Filesystem mounts in user namespaces (LWN)

The new mount API:

Mounting into mount namespaces (brauner.io)