Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 09, 2024
  2. Jul 03, 2024
  3. Apr 26, 2024
  4. Apr 24, 2024
  5. Apr 22, 2024
  6. Apr 09, 2024
  7. Feb 23, 2024
  8. Feb 22, 2024
  9. Dec 20, 2023
  10. Oct 25, 2023
    • Hugh Dickins's avatar
      mempolicy: alloc_pages_mpol() for NUMA policy without vma · ddc1a5cb
      Hugh Dickins authored
      Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
      allocation.  alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
      principal actor for passing mempolicy choice down to __alloc_pages(),
      rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
      
      vma_alloc_folio() and alloc_pages() remain, but as wrappers around
      alloc_pages_mpol().  alloc_pages_bulk_*() untouched, except to provide the
      additional args to policy_nodemask(), which subsumes policy_node(). 
      Cleanup throughout, cutting out some unhelpful "helpers".
      
      It would all be much simpler without MPOL_INTERLEAVE, but that adds a
      dynamic to the constant mpol: complicated by v3.6 commit 09c231cb
      ("tmpfs: distribute interleave better across nodes"), which added ino bias
      to the interleave, hidden from mm/mempolicy.c until this commit.
      
      Hence "ilx" throughout, the "interleave index".  Originally I thought it
      could be done just with nid, but that's wrong: the nodemask may come from
      the shared policy layer below a shmem vma, or it may come from the task
      layer above a shmem vma; and without the final nodemask then nodeid cannot
      be decided.  And how ilx is applied depends also on page order.
      
      The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
      with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
      passed down from vma-less alloc_pages() is also used as hint not to use
      THP-style hugepage allocation - to avoid the overhead of a hugepage arg
      (though I don't understand why we never just added a GFP bit for THP - if
      it actually needs a different allocation strategy from other pages of the
      same order).  vma_alloc_folio() still carries its hugepage arg here, but
      it is not used, and should be removed when agreed.
      
      get_vma_policy() no longer allows a NULL vma: over time I believe we've
      eradicated all the places which used to need it e.g.  swapoff and madvise
      used to pass NULL vma to read_swap_cache_async(), but now know the vma.
      
      [hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
        Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
      Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
      Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ddc1a5cb
  11. Oct 18, 2023
  12. Aug 18, 2023
  13. Aug 15, 2023
    • Joel Granados's avatar
      sysctl: Add a size arg to __register_sysctl_table · bff97cf1
      Joel Granados authored
      
      We make these changes in order to prepare __register_sysctl_table and
      its callers for when we remove the sentinel element (empty element at
      the end of ctl_table arrays). We don't actually remove any sentinels in
      this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
      is available when the removal occurs.
      
      We add a table_size argument to __register_sysctl_table and adjust
      callers, all of which pass ctl_table pointers and need an explicit call
      to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
      order to forward the size of the array pointer received from the network
      register calls.
      
      The new table_size argument does not yet have any effect in the
      init_header call which is still dependent on the sentinel's presence.
      table_size *does* however drive the `kzalloc` allocation in
      __register_sysctl_table with no adverse effects as the allocated memory
      is either one element greater than the calculated ctl_table array (for
      the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
      of the calculated ctl_table array (for the call from sysctl_net.c and
      register_sysctl). This approach will allows us to "just" remove the
      sentinel without further changes to __register_sysctl_table as
      table_size will represent the exact size for all the callers at that
      point.
      
      Signed-off-by: default avatarJoel Granados <j.granados@samsung.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      bff97cf1
  14. Jul 24, 2023
  15. Jul 11, 2023
  16. Feb 09, 2023
  17. Jan 27, 2023
  18. Jan 19, 2023
    • Christian Brauner's avatar
      fs: port ->permission() to pass mnt_idmap · 4609e1f1
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: Christoph Hellwig <hch...
      4609e1f1
    • Christian Brauner's avatar
      fs: port ->create() to pass mnt_idmap · 6c960e68
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      6c960e68
  19. Jan 18, 2023
    • Christian Brauner's avatar
      fs: port vfs_*() helpers to struct mnt_idmap · abf08576
      Christian Brauner authored
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed
      
       ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      abf08576
  20. Dec 11, 2022
  21. Dec 05, 2022
    • Jann Horn's avatar
      ipc/sem: Fix dangling sem_array access in semtimedop race · b52be557
      Jann Horn authored
      When __do_semtimedop() goes to sleep because it has to wait for a
      semaphore value becoming zero or becoming bigger than some threshold, it
      links the on-stack sem_queue to the sem_array, then goes to sleep
      without holding a reference on the sem_array.
      
      When __do_semtimedop() comes back out of sleep, one of two things must
      happen:
      
       a) We prove that the on-stack sem_queue has been disconnected from the
          (possibly freed) sem_array, making it safe to return from the stack
          frame that the sem_queue exists in.
      
       b) We stabilize our reference to the sem_array, lock the sem_array, and
          detach the sem_queue from the sem_array ourselves.
      
      sem_array has RCU lifetime, so for case (b), the reference can be
      stabilized inside an RCU read-side critical section by locklessly
      checking whether the sem_queue is still connected to the sem_array.
      
      However, the current code does the lockless check on sem_queue before
      starting an RCU read-side critical section, so the result of the
      lockless check immediately becomes useless.
      
      Fix it by doing rcu_read_lock() before the lockless check.  Now RCU
      ensures that if we observe the object being on our queue, the object
      can't be freed until rcu_read_unlock().
      
      This bug is only hittable on kernel builds with full preemption support
      (either CONFIG_PREEMPT or PREEMPT_DYNAMIC with preempt=full).
      
      Fixes: 370b262c
      
       ("ipc/sem: avoid idr tree lookup for interrupted semop")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b52be557
  22. Nov 22, 2022
    • Mike Kravetz's avatar
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz authored
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: h...
      b6305049
  23. Oct 28, 2022
  24. Oct 03, 2022
  25. Sep 26, 2022
  26. Sep 12, 2022
  27. Jul 19, 2022
  28. Jul 17, 2022
  29. Jun 22, 2022
    • Alexey Gladkov's avatar
      ipc: Free mq_sysctls if ipc namespace creation failed · db7cfc38
      Alexey Gladkov authored
      
      The problem that Dmitry Vyukov pointed out is that if setup_ipc_sysctls fails,
      mq_sysctls must be freed before return.
      
      executing program
      BUG: memory leak
      unreferenced object 0xffff888112fc9200 (size 512):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          ef d3 60 85 ff ff ff ff 0c 9b d2 12 81 88 ff ff  ..`.............
          04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff814b6eb3>] kmemdup+0x23/0x50 mm/util.c:129
          [<ffffffff82219a9b>] kmemdup include/linux/fortify-string.h:456 [inline]
          [<ffffffff82219a9b>] setup_mq_sysctls+0x4b/0x1c0 ipc/mq_sysctl.c:89
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fd5f00 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          00 92 fc 12 81 88 ff ff 00 00 00 00 01 00 00 00  ................
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fea1b>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fea1b>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fea1b>] __register_sysctl_table+0x7b/0x7f0 fs/proc/proc_sysctl.c:1344
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fbba00 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          78 ba fb 12 81 88 ff ff 00 00 00 00 01 00 00 00  x...............
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline]
          [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline]
          [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      BUG: memory leak
      unreferenced object 0xffff888112fbb900 (size 256):
        comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s)
        hex dump (first 32 bytes):
          78 b9 fb 12 81 88 ff ff 00 00 00 00 01 00 00 00  x...............
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline]
          [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline]
          [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline]
          [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline]
          [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373
          [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112
          [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline]
          [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91
          [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90
          [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
          [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165
          [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline]
          [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline]
          [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234
          [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Reported-by: default avatar <syzbot+b4b0d1b35442afbf6fd2@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/000000000000f5004705e1db8bad@google.com
      Link: https://lkml.kernel.org/r/20220622200729.2639663-1-legion@kernel.org
      
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      db7cfc38
  30. May 09, 2022
  31. May 03, 2022