Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  • Christian Brauner's avatar
    fs: don't block i_writecount during exec · 2a010c41
    Christian Brauner authored
    Back in 2021 we already discussed removing deny_write_access() for
    executables. Back then I was hesistant because I thought that this might
    cause issues in userspace. But even back then I had started taking some
    notes on what could potentially depend on this and I didn't come up with
    a lot so I've changed my mind and I would like to try this.
    
    Here are some of the notes that I took:
    
    (1) The deny_write_access() mechanism is causing really pointless issues
        such as [1]. If a thread in a thread-group opens a file writable,
        then writes some stuff, then closing the file descriptor and then
        calling execve() they can fail the execve() with ETXTBUSY because
        another thread in the thread-group could have concurrently called
        fork(). Multi-threaded libraries such as go suffer from this.
    
    (2) There are userspace attacks that rely on overwriting the binary of a
        running process. These attacks are _mitigated_ but _not at all
        prevented_ from ocurring by the deny_write_access() mechanism.
    
        I'll go over some details. The clearest example of such attacks was
        the attack against runC in CVE-2019-5736 (cf. [3]).
    
        An attack could compromise the runC host binary from inside a
        _privileged_ runC container. The malicious binary could then be used
        to take over the host.
    
        (It is crucial to note that this attack is _not_ possible with
         unprivileged containers. IOW, the setup here is already insecure.)
    
        The attack can be made when attaching to a running container or when
        starting a container running a specially crafted image. For example,
        when runC attaches to a container the attacker can trick it into
        executing itself.
    
        This could be done by replacing the target binary inside the
        container with a custom binary pointing back at the runC binary
        itself. As an example, if the target binary was /bin/bash, this
        could be replaced with an executable script specifying the
        interpreter path #!/proc/self/exe.
    
        As such when /bin/bash is executed inside the container, instead the
        target of /proc/self/exe will be executed. That magic link will
        point to the runc binary on the host. The attacker can then proceed
        to write to the target of /proc/self/exe to try and overwrite the
        runC binary on the host.
    
        However, this will not succeed because of deny_write_access(). Now,
        one might think that this would prevent the attack but it doesn't.
    
        To overcome this, the attacker has multiple ways:
        * Open a file descriptor to /proc/self/exe using the O_PATH flag and
          then proceed to reopen the binary as O_WRONLY through
          /proc/self/fd/<nr> and try to write to it in a busy loop from a
          separate process. Ultimately it will succeed when the runC binary
          exits. After this the runC binary is compromised and can be used
          to attack other containers or the host itself.
        * Use a malicious shared library annotating a function in there with
          the constructor attribute making the malicious function run as an
          initializor. The malicious library will then open /proc/self/exe
          for creating a new entry under /proc/self/fd/<nr>. It'll then call
          exec to a) force runC to exit and b) hand the file descriptor off
          to a program that then reopens /proc/self/fd/<nr> for writing
          (which is now possible because runC has exited) and overwriting
          that binary.
    
        To sum up: the deny_write_access() mechanism doesn't prevent such
        attacks in insecure setups. It just makes them minimally harder.
        That's all.
    
        The only way back then to prevent this is to create a temporary copy
        of the calling binary itself when it starts or attaches to
        containers. So what I did back then for LXC (and Aleksa for runC)
        was to create an anonymous, in-memory file using the memfd_create()
        system call and to copy itself into the temporary in-memory file,
        which is then sealed to prevent further modifications. This sealed,
        in-memory file copy is then executed instead of the original on-disk
        binary.
    
        Any compromising write operations from a privileged container to the
        host binary will then write to the temporary in-memory binary and
        not to the host binary on-disk, preserving the integrity of the host
        binary. Also as the temporary, in-memory binary is sealed, writes to
        this will also fail.
    
        The point is that deny_write_access() is uselss to prevent these
        attacks.
    
    (3) Denying write access to an inode because it's currently used in an
        exec path could easily be done on an LSM level. It might need an
        additional hook but that should be about it.
    
    (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
        ago so while we do protect the main executable the bigger portion of
        the things you'd think need protecting such as the shared libraries
        aren't. IOW, we let anyone happily overwrite shared libraries.
    
    (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
        (5.1) We removed the legacy uselib() protection for preventing
              overwriting of shared libraries. Nobody cared in 3 years.
        (5.2) We allow write access to the elf interpreter after exec
              completed treating it on a par with shared libraries.
    
    Yes, someone in userspace could potentially be relying on this. It's not
    completely out of the realm of possibility but let's find out if that's
    actually the case and not guess.
    
    Link: https://github.com/golang/go/issues/22315 [1]
    Link: 49624efa ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
    Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
    Link: https://lwn.net/Articles/866493
    Link: https://github.com/golang/go/issues/22220
    Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
    Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
    Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
    Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
    Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
    Link: https://github.com/buildkite/agent/pull/2736
    Link: https://github.com/rust-lang/rust/issues/114554
    Link: https://bugs.openjdk.org/browse/JDK-8068370
    Link: https://github.com/dotnet/runtime/issues/58964
    Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
    
    
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    2a010c41