Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 25, 2024
  2. Feb 15, 2024
  3. Jan 22, 2024
  4. Dec 21, 2023
  5. Oct 18, 2023
  6. Jul 14, 2023
    • Christian Brauner's avatar
      attr: block mode changes of symlinks · 5d1f903f
      Christian Brauner authored
      Changing the mode of symlinks is meaningless as the vfs doesn't take the
      mode of a symlink into account during path lookup permission checking.
      
      However, the vfs doesn't block mode changes on symlinks. This however,
      has lead to an untenable mess roughly classifiable into the following
      two categories:
      
      (1) Filesystems that don't implement a i_op->setattr() for symlinks.
      
          Such filesystems may or may not know that without i_op->setattr()
          defined, notify_change() falls back to simple_setattr() causing the
          inode's mode in the inode cache to be changed.
      
          That's a generic issue as this will affect all non-size changing
          inode attributes including ownership changes.
      
          Example: afs
      
      (2) Filesystems that fail with EOPNOTSUPP but change the mode of the
          symlink nonetheless.
      
          Some filesystems will happily update the mode of a symlink but still
          return EOPNOTSUPP. This is the biggest source of confusion for
          userspace.
      
          The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
          comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
      
              if (!err && attr->ia_valid & ATTR_MODE)
                      err = posix_acl_chmod(idmap, dentry, inode->i_mode);
      
          Filesystems including btrfs don't implement i_op->set_acl() so
          posix_acl_chmod() will report EOPNOTSUPP.
      
          When posix_acl_chmod() is called, most filesystems will have
          finished updating the inode.
      
          Perversely, this has the consequences that this behavior may depend
          on two kconfig options and mount options:
      
          * CONFIG_POSIX_ACL={y,n}
          * CONFIG_${FSTYPE}_POSIX_ACL={y,n}
          * Opt_acl, Opt_noacl
      
          Example: btrfs, ext4, xfs
      
      The only way to change the mode on a symlink currently involves abusing
      an O_PATH file descriptor in the following manner:
      
              fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
      
              char path[PATH_MAX];
              snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
              chmod(path, 0000);
      
      But for most major filesystems with POSIX ACL support such as btrfs,
      ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
      the mode still updated due to the aforementioned posix_acl_chmod()
      nonsense.
      
      So, given that for all major filesystems this would fail with EOPNOTSUPP
      and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
      changes on symlinks we should just try and block mode changes on
      symlinks directly in the vfs and have a clean break with this nonsense.
      
      If this causes any regressions, we do the next best thing and fix up all
      filesystems that do return EOPNOTSUPP with the mode updated to not call
      posix_acl_chmod() on symlinks.
      
      But as usual, let's try the clean cut solution first. It's a simple
      patch that can be easily reverted. Not marking this for backport as I'll
      do that manually if we're reasonably sure that this works and there are
      no strong objections.
      
      We could block this in chmod_common() but it's more appropriate to do it
      notify_change() as it will also mean that we catch filesystems that
      change symlink permissions explicitly or accidently.
      
      Similar proposals were floated in the past as in [3] and [4] and again
      recently in [5]. There's also a couple of bugs about this inconsistency
      as in [6] and [7].
      
      Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
      Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
      Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
      Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
      Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
      Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
      Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17
      
       [7]
      Reviewed-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      5d1f903f
  7. Jul 13, 2023
  8. Mar 30, 2023
  9. Jan 19, 2023
    • Christian Brauner's avatar
      fs: port vfs{g,u}id helpers to mnt_idmap · 4d7ca409
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      4d7ca409
    • Christian Brauner's avatar
      fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap · e67fe633
      Christian Brauner authored
      Convert to struct mnt_idmap.
      Remove legacy file_mnt_user_ns() and mnt_user_ns().
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      e67fe633
    • Christian Brauner's avatar
      fs: port i_{g,u}id_{needs_}update() to mnt_idmap · 0dbe12f2
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      0dbe12f2
    • Christian Brauner's avatar
      fs: port privilege checking helpers to mnt_idmap · 9452e93e
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      9452e93e
    • Christian Brauner's avatar
      fs: port inode_owner_or_capable() to mnt_idmap · 01beba79
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Christoph Hellwig...
      01beba79
    • Christian Brauner's avatar
      fs: port xattr to mnt_idmap · 39f60c1c
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      39f60c1c
    • Christian Brauner's avatar
      fs: port ->permission() to pass mnt_idmap · 4609e1f1
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Christoph Hellwig <hch...
      4609e1f1
    • Christian Brauner's avatar
      fs: port ->setattr() to pass mnt_idmap · c1632a0f
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@ls...>
      c1632a0f
  10. Jan 18, 2023
    • Christian Brauner's avatar
      fs: port vfs_*() helpers to struct mnt_idmap · abf08576
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      abf08576
  11. Jan 11, 2023
  12. Oct 18, 2022
    • Christian Brauner's avatar
      attr: use consistent sgid stripping checks · ed5a7047
      Christian Brauner authored
      
      Currently setgid stripping in file_remove_privs()'s should_remove_suid()
      helper is inconsistent with other parts of the vfs. Specifically, it only
      raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
      inode isn't in the caller's groups and the caller isn't privileged over the
      inode although we require this already in setattr_prepare() and
      setattr_copy() and so all filesystem implement this requirement implicitly
      because they have to use setattr_{prepare,copy}() anyway.
      
      But the inconsistency shows up in setgid stripping bugs for overlayfs in
      xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
      generic/687). For example, we test whether suid and setgid stripping works
      correctly when performing various write-like operations as an unprivileged
      user (fallocate, reflink, write, etc.):
      
      echo "Test 1 - qa_user, non-exec file $verb"
      setup_testfile
      chmod a+rws $junk_file
      commit_and_check "$qa_user" "$verb" 64k 64k
      
      The test basically creates a file with 6666 permissions. While the file has
      the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
      regular filesystem like xfs what will happen is:
      
      sys_fallocate()
      -> vfs_fallocate()
         -> xfs_file_fallocate()
            -> file_modified()
               -> __file_remove_privs()
                  -> dentry_needs_remove_privs()
                     -> should_remove_suid()
                  -> __remove_privs()
                     newattrs.ia_valid = ATTR_FORCE | kill;
                     -> notify_change()
                        -> setattr_copy()
      
      In should_remove_suid() we can see that ATTR_KILL_SUID is raised
      unconditionally because the file in the test has S_ISUID set.
      
      But we also see that ATTR_KILL_SGID won't be set because while the file
      is S_ISGID it is not S_IXGRP (see above) which is a condition for
      ATTR_KILL_SGID being raised.
      
      So by the time we call notify_change() we have attr->ia_valid set to
      ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
      ATTR_KILL_SUID is set and does:
      
      ia_valid = attr->ia_valid |= ATTR_MODE
      attr->ia_mode = (inode->i_mode & ~S_ISUID);
      
      which means that when we call setattr_copy() later we will definitely
      update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.
      
      Now we call into the filesystem's ->setattr() inode operation which will
      end up calling setattr_copy(). Since ATTR_MODE is set we will hit:
      
      if (ia_valid & ATTR_MODE) {
              umode_t mode = attr->ia_mode;
              vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
              if (!vfsgid_in_group_p(vfsgid) &&
                  !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                      mode &= ~S_ISGID;
              inode->i_mode = mode;
      }
      
      and since the caller in the test is neither capable nor in the group of the
      inode the S_ISGID bit is stripped.
      
      But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
      has the consequence that neither the setgid nor the suid bits are stripped
      even though it should be stripped because the inode isn't in the caller's
      groups and the caller isn't privileged over the inode.
      
      If overlayfs is in the mix things become a bit more complicated and the bug
      shows up more clearly. When e.g., ovl_setattr() is hit from
      ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
      ATTR_KILL_SGID might be raised but because the check in notify_change() is
      questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
      stripped the S_ISGID bit isn't removed even though it should be stripped:
      
      sys_fallocate()
      -> vfs_fallocate()
         -> ovl_fallocate()
            -> file_remove_privs()
               -> dentry_needs_remove_privs()
                  -> should_remove_suid()
               -> __remove_privs()
                  newattrs.ia_valid = ATTR_FORCE | kill;
                  -> notify_change()
                     -> ovl_setattr()
                        // TAKE ON MOUNTER'S CREDS
                        -> ovl_do_notify_change()
                           -> notify_change()
                        // GIVE UP MOUNTER'S CREDS
           // TAKE ON MOUNTER'S CREDS
           -> vfs_fallocate()
              -> xfs_file_fallocate()
                 -> file_modified()
                    -> __file_remove_privs()
                       -> dentry_needs_remove_privs()
                          -> should_remove_suid()
                       -> __remove_privs()
                          newattrs.ia_valid = attr_force | kill;
                          -> notify_change()
      
      The fix for all of this is to make file_remove_privs()'s
      should_remove_suid() helper to perform the same checks as we already
      require in setattr_prepare() and setattr_copy() and have notify_change()
      not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
      first place because the caller must calculate the flags via
      should_remove_suid() anyway which would raise ATTR_KILL_SGID.
      
      While we're at it we move should_remove_suid() from inode.c to attr.c
      where it belongs with the rest of the iattr helpers. Especially since it
      returns ATTR_KILL_S{G,U}ID flags. We also rename it to
      setattr_should_drop_suidgid() to better reflect that it indicates both
      setuid and setgid bit removal and also that it returns attr flags.
      
      Running xfstests with this doesn't report any regressions. We should really
      try and use consistent checks.
      
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      ed5a7047
    • Christian Brauner's avatar
      attr: add setattr_should_drop_sgid() · 72ae017c
      Christian Brauner authored
      
      The current setgid stripping logic during write and ownership change
      operations is inconsistent and strewn over multiple places. In order to
      consolidate it and make more consistent we'll add a new helper
      setattr_should_drop_sgid(). The function retains the old behavior where
      we remove the S_ISGID bit unconditionally when S_IXGRP is set but also
      when it isn't set and the caller is neither in the group of the inode
      nor privileged over the inode.
      
      We will use this helper both in write operation permission removal such
      as file_remove_privs() as well as in ownership change operations.
      
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      72ae017c
    • Christian Brauner's avatar
      fs: move should_remove_suid() · e243e3f9
      Christian Brauner authored
      
      Move the helper from inode.c to attr.c. This keeps the the core of the
      set{g,u}id stripping logic in one place when we add follow-up changes.
      It is the better place anyway, since should_remove_suid() returns
      ATTR_KILL_S{G,U}ID flags.
      
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      e243e3f9
    • Christian Brauner's avatar
      attr: add in_group_or_capable() · 11c2a870
      Christian Brauner authored
      
      In setattr_{copy,prepare}() we need to perform the same permission
      checks to determine whether we need to drop the setgid bit or not.
      Instead of open-coding it twice add a simple helper the encapsulates the
      logic. We will reuse this helpers to make dropping the setgid bit during
      write operations more consistent in a follow up patch.
      
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      11c2a870
  13. Aug 08, 2022
    • David Howells's avatar
      vfs: Check the truncate maximum size in inode_newsize_ok() · e2ebff9c
      David Howells authored
      If something manages to set the maximum file size to MAX_OFFSET+1, this
      can cause the xfs and ext4 filesystems at least to become corrupt.
      
      Ordinarily, the kernel protects against userspace trying this by
      checking the value early in the truncate() and ftruncate() system calls
      calls - but there are at least two places that this check is bypassed:
      
       (1) Cachefiles will round up the EOF of the backing file to DIO block
           size so as to allow DIO on the final block - but this might push
           the offset negative. It then calls notify_change(), but this
           inadvertently bypasses the checking. This can be triggered if
           someone puts an 8EiB-1 file on a server for someone else to try and
           access by, say, nfs.
      
       (2) ksmbd doesn't check the value it is given in set_end_of_file_info()
           and then calls vfs_truncate() directly - which also bypasses the
           check.
      
      In both cases, it is potentially possible for a network filesystem to
      cause a disk filesystem to be corrupted: cachefiles in the client's
      cache filesystem; ksmbd in the server's filesystem.
      
      nfsd is okay as it checks the value, but we can then remove this check
      too.
      
      Fix this by adding a check to inode_newsize_ok(), as called from
      setattr_prepare(), thereby catching the issue as filesystems set up to
      perform the truncate with minimal opportunity for bypassing the new
      check.
      
      Fixes: 1f08c925 ("cachefiles: Implement backing file wrangling")
      Fixes: f4415848
      
       ("cifsd: add file operations")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reported-by: default avatarJeff Layton <jlayton@kernel.org>
      Tested-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarNamjae Jeon <linkinjeon@kernel.org>
      Cc: stable@kernel.org
      Acked-by: default avatarAlexander Viro <viro@zeniv.linux.org.uk>
      cc: Steve French <sfrench@samba.org>
      cc: Hyunchul Lee <hyc.lee@gmail.com>
      cc: Chuck Lever <chuck.lever@oracle.com>
      cc: Dave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2ebff9c
  14. Jun 27, 2022
  15. Jun 26, 2022
    • Christian Brauner's avatar
      attr: port attribute changes to new types · b27c82e1
      Christian Brauner authored
      Now that we introduced new infrastructure to increase the type safety
      for filesystems supporting idmapped mounts port the first part of the
      vfs over to them.
      
      This ports the attribute changes codepaths to rely on the new better
      helpers using a dedicated type.
      
      Before this change we used to take a shortcut and place the actual
      values that would be written to inode->i_{g,u}id into struct iattr. This
      had the advantage that we moved idmappings mostly out of the picture
      early on but it made reasoning about changes more difficult than it
      should be.
      
      The filesystem was never explicitly told that it dealt with an idmapped
      mount. The transition to the value that needed to be stored in
      inode->i_{g,u}id appeared way too early and increased the probability of
      bugs in various codepaths.
      
      We know place the same value in struct iattr no matter if this is an
      idmapped mount or not. The vfs will only deal with type safe
      vfs{g,u}id_t. This makes it massively safer to perform permission checks
      as the type will tell us what checks we need to perform and what helpers
      we need to use.
      
      Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
      inode->i_{g,u}id since they are different types. Instead they need to
      use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
      vfs{g,u}id into the filesystem.
      
      The other nice effect is that filesystems like overlayfs don't need to
      care about idmappings explicitly anymore and can simply set up struct
      iattr accordingly directly.
      
      Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
      Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      b27c82e1
    • Christian Brauner's avatar
      security: pass down mount idmapping to setattr hook · 0e363cf3
      Christian Brauner authored
      Before this change we used to take a shortcut and place the actual
      values that would be written to inode->i_{g,u}id into struct iattr. This
      had the advantage that we moved idmappings mostly out of the picture
      early on but it made reasoning about changes more difficult than it
      should be.
      
      The filesystem was never explicitly told that it dealt with an idmapped
      mount. The transition to the value that needed to be stored in
      inode->i_{g,u}id appeared way too early and increased the probability of
      bugs in various codepaths.
      
      We know place the same value in struct iattr no matter if this is an
      idmapped mount or not. The vfs will only deal with type safe
      vfs{g,u}id_t. This makes it massively safer to perform permission checks
      as the type will tell us what checks we need to perform and what helpers
      we need to use.
      
      Adapt the security_inode_setattr() helper to pass down the mount's
      idmapping to account for that change.
      
      Link: https://lore.kernel.org/r/20220621141454.2914719-8-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      0e363cf3
    • Christian Brauner's avatar
      fs: port to iattr ownership update helpers · 35faf310
      Christian Brauner authored
      Earlier we introduced new helpers to abstract ownership update and
      remove code duplication. This converts all filesystems supporting
      idmapped mounts to make use of these new helpers.
      
      For now we always pass the initial idmapping which makes the idmapping
      functions these helpers call nops.
      
      This is done because we currently always pass the actual value to be
      written to i_{g,u}id via struct iattr. While this allowed us to treat
      the {g,u}id values in struct iattr as values that can be directly
      written to inode->i_{g,u}id it also increases the potential for
      confusion for filesystems.
      
      Now that we are have dedicated types to prevent this confusion we will
      ultimately only map the value from the idmapped mount into a filesystem
      value that can be written to inode->i_{g,u}id when the filesystem
      actually updates the inode. So pass down the initial idmapping until we
      finished that conversion at which point we pass down the mount's
      idmapping.
      
      No functional changes intended.
      
      Link: https://lore.kernel.org/r/20220621141454.2914719-6-brauner@kernel.org
      
      
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      35faf310
  16. Jun 14, 2022
    • Christian Brauner's avatar
      fs: account for group membership · 168f9128
      Christian Brauner authored
      When calling setattr_prepare() to determine the validity of the
      attributes the ia_{g,u}id fields contain the value that will be written
      to inode->i_{g,u}id. This is exactly the same for idmapped and
      non-idmapped mounts and allows callers to pass in the values they want
      to see written to inode->i_{g,u}id.
      
      When group ownership is changed a caller whose fsuid owns the inode can
      change the group of the inode to any group they are a member of. When
      searching through the caller's groups we need to use the gid mapped
      according to the idmapped mount otherwise we will fail to change
      ownership for unprivileged users.
      
      Consider a caller running with fsuid and fsgid 1000 using an idmapped
      mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
      owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
      idmapped mount.
      
      The caller now requests the gid of the file to be changed to 1000 going
      through the idmapped mount. In the vfs we will immediately map the
      requested gid to the value that will need to be written to inode->i_gid
      and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
      1000 we place 65534 in attr->ia_gid.
      
      When we check whether the caller is allowed to change group ownership we
      first validate that their fsuid matches the inode's uid. The
      inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
      Since the caller's fsuid is 1000 we pass the check.
      
      We now check whether the caller is allowed to change inode->i_gid to the
      requested gid by calling in_group_p(). This will compare the passed in
      gid to the caller's fsgid and search the caller's additional groups.
      
      Since we're dealing with an idmapped mount we need to pass in the gid
      mapped according to the idmapped mount. This is akin to checking whether
      a caller is privileged over the future group the inode is owned by. And
      that needs to take the idmapped mount into account. Note, all helpers
      are nops without idmapped mounts.
      
      New regression test sent to xfstests.
      
      Link: https://github.com/lxc/lxd/issues/10537
      Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
      Fixes: 2f221d6f
      
       ("attr: handle idmapped mounts")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org # 5.15+
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      168f9128
  17. Nov 17, 2021
    • Christian Brauner's avatar
      fs: handle circular mappings correctly · 96821970
      Christian Brauner authored
      When calling setattr_prepare() to determine the validity of the attributes the
      ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id.
      When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id
      matches the current {g,u}id attribute the attribute change is allowed.
      
      The value in ia_{g,u}id does already account for idmapped mounts and will have
      taken the relevant idmapping into account. So in order to verify that the
      {g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value
      against the inode's i_{g,u}id value.
      
      This only has any meaning for idmapped mounts as idmapping helpers are
      idempotent without them. And for idmapped mounts this really only has a meaning
      when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped
      to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be
      useful when sharing the same home directory between multiple users at the same
      time.
      
      As an e...
      96821970
  18. Aug 13, 2021
  19. Jan 24, 2021
    • Christian Brauner's avatar
      ima: handle idmapped mounts · a2d2329e
      Christian Brauner authored
      IMA does sometimes access the inode's i_uid and compares it against the
      rules' fowner. Enable IMA to handle idmapped mounts by passing down the
      mount's user namespace. We simply make use of the helpers we introduced
      before. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.com
      
      
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      a2d2329e
    • Christian Brauner's avatar
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner authored
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • Christian Brauner's avatar
      commoncap: handle idmapped mounts · 71bc356f
      Christian Brauner authored
      When interacting with user namespace and non-user namespace aware
      filesystem capabilities the vfs will perform various security checks to
      determine whether or not the filesystem capabilities can be used by the
      caller, whether they need to be removed and so on. The main
      infrastructure for this resides in the capability codepaths but they are
      called through the LSM security infrastructure even though they are not
      technically an LSM or optional. This extends the existing security hooks
      security_inode_removexattr(), security_inode_killpriv(),
      security_inode_getsecurity() to pass down the mount's user namespace and
      makes them aware of idmapped mounts.
      
      In order to actually get filesystem capabilities from disk the
      capability infrastructure exposes the get_vfs_caps_from_disk() helper.
      For user namespace aware filesystem capabilities a root uid is stored
      alongside the capabilities.
      
      In order to determine whether the caller can make use of the filesystem
      capability or whether it needs to be ignored it is translated according
      to the superblock's user namespace. If it can be translated to uid 0
      according to that id mapping the caller can use the filesystem
      capabilities stored on disk. If we are accessing the inode that holds
      the filesystem capabilities through an idmapped mount we map the root
      uid according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts: reading filesystem caps from disk
      enforces that the root uid associated with the filesystem capability
      must have a mapping in the superblock's user namespace and that the
      caller is either in the same user namespace or is a descendant of the
      superblock's user namespace. For filesystems that are mountable inside
      user namespace the caller can just mount the filesystem and won't
      usually need to idmap it. If they do want to idmap it they can create an
      idmapped mount and mark it with a user namespace they created and which
      is thus a descendant of s_user_ns. For filesystems that are not
      mountable inside user namespaces the descendant rule is trivially true
      because the s_user_ns will be the initial user namespace.
      
      If the initial user namespace is passed nothing changes so non-idmapped
      mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      71bc356f
    • Christian Brauner's avatar
      attr: handle idmapped mounts · 2f221d6f
      Christian Brauner authored
      When file attributes are changed most filesystems rely on the
      setattr_prepare(), setattr_copy(), and notify_change() helpers for
      initialization and permission checking. Let them handle idmapped mounts.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Helpers that perform checks on the ia_uid and ia_gid fields in struct
      iattr assume that ia_uid and ia_gid are intended values and have already
      been mapped correctly at the userspace-kernelspace boundary as we
      already do today. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      2f221d6f
    • Christian Brauner's avatar
      inode: make init and permission helpers idmapped mount aware · 21cb47be
      Christian Brauner authored
      The inode_owner_or_capable() helper determines whether the caller is the
      owner of the inode or is capable with respect to that inode. Allow it to
      handle idmapped mounts. If the inode is accessed through an idmapped
      mount it according to the mount's user namespace. Afterwards the checks
      are identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Similarly, allow the inode_init_owner() helper to handle idmapped
      mounts. It initializes a new inode on idmapped mounts by mapping the
      fsuid and fsgid of the caller from the mount's user namespace. If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      ...
      21cb47be
    • Christian Brauner's avatar
      namei: make permission helpers idmapped mount aware · 47291baa
      Christian Brauner authored
      The two helpers inode_permission() and generic_permission() are used by
      the vfs to perform basic permission checking by verifying that the
      caller is privileged over an inode. In order to handle idmapped mounts
      we extend the two helpers with an additional user namespace argument.
      On idmapped mounts the two helpers will make sure to map the inode
      according to the mount's user namespace and then peform identical
      permission checks to inode_permission() and generic_permission(). If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      47291baa