Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 15, 2024
    • Alex Shi (Tencent)'s avatar
      mm/memcg: alignment memcg_data define condition · a52c6330
      Alex Shi (Tencent) authored
      commit 21c690a3
      
       ("mm: introduce slabobj_ext to support slab object
      extensions") changed the folio/page->memcg_data define condition from
      MEMCG to SLAB_OBJ_EXT. This action make memcg_data exposed while !MEMCG.
      
      As Vlastimil Babka suggested, let's add _unused_slab_obj_exts for
      SLAB_MATCH for slab.obj_exts while !MEMCG. That could resolve the match
      issue, clean up the feature logical.
      
      Signed-off-by: default avatarAlex Shi (Tencent) <alexs@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yoann Congal <yoann.congal@smile.fr>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      a52c6330
  2. Jul 11, 2024
  3. Jul 09, 2024
    • Miaohe Lin's avatar
      mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio · f708f697
      Miaohe Lin authored
      A kernel crash was observed when migrating hugetlb folio:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000008
      PGD 0 P4D 0
      Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
      RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
      RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
      RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
      RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
      R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
      R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
      FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __folio_migrate_mapping+0x59e/0x950
       __migrate_folio.con...
      f708f697
    • Miaohe Lin's avatar
      mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio() · 5596d9e8
      Miaohe Lin authored
      There is a potential race between __update_and_free_hugetlb_folio() and
      try_memory_failure_hugetlb():
      
       CPU1					CPU2
       __update_and_free_hugetlb_folio	try_memory_failure_hugetlb
      					 folio_test_hugetlb
      					  -- It's still hugetlb folio.
        folio_clear_hugetlb_hwpoison
        					  spin_lock_irq(&hugetlb_lock);
      					   __get_huge_page_for_hwpoison
      					    folio_set_hugetlb_hwpoison
      					  spin_unlock_irq(&hugetlb_lock);
        spin_lock_irq(&hugetlb_lock);
        __folio_clear_hugetlb(folio);
         -- Hugetlb flag is cleared but too late.
        spin_unlock_irq(&hugetlb_lock);
      
      When the above race occurs, raw error page info will be leaked.  Even
      worse, raw error pages won't have hwpoisoned flag set and hit
      pcplists/buddy.  Fix this issue by deferring
      folio_clear_hugetlb_hwpoison() until __folio_clear_hugetlb() is done.  So
      all raw error pages will have hwpoisoned flag set.
      
      Link: https://lkml.kernel.org/r/20240708025127.107713-1-linmiaohe@huawei.com
      Fixes: 32c87719
      
       ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5596d9e8
    • ZhangPeng's avatar
      filemap: replace pte_offset_map() with pte_offset_map_nolock() · 24be02a4
      ZhangPeng authored
      The vmf->ptl in filemap_fault_recheck_pte_none() is still set from
      handle_pte_fault().  But at the same time, we did a pte_unmap(vmf->pte). 
      After a pte_unmap(vmf->pte) unmap and rcu_read_unlock(), the page table
      may be racily changed and vmf->ptl maybe fails to protect the actual page
      table.  Fix this by replacing pte_offset_map() with
      pte_offset_map_nolock().
      
      As David said, the PTL pointer might be stale so if we continue to use
      it infilemap_fault_recheck_pte_none(), it might trigger UAF.  Also, if
      the PTL fails, the issue fixed by commit 58f327f2 ("filemap: avoid
      unnecessary major faults in filemap_fault()") might reappear.
      
      Link: https://lkml.kernel.org/r/20240313012913.2395414-1-zhangpeng362@huawei.com
      Fixes: 58f327f2
      
       ("filemap: avoid unnecessary major faults in filemap_fault()")
      Signed-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24be02a4
  4. Jul 08, 2024
  5. Jul 06, 2024
    • Hugh Dickins's avatar
      mm: fix crashes from deferred split racing folio migration · be9581ea
      Hugh Dickins authored
      Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
      flags when freeing, yet the flags shown are not bad: PG_locked had been
      set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
      deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
      symptoms implying double free by deferred split and large folio migration.
      
      6.7 commit 9bcef597 ("mm: memcg: fix split queue list crash when large
      folio migration") was right to fix the memcg-dependent locking broken in
      85ce2c51 ("memcontrol: only transfer the memcg data for migration"),
      but missed a subtlety of deferred_split_scan(): it moves folios to its own
      local list to work on them without split_queue_lock, during which time
      folio->_deferred_list is not empty, but even the "right" lock does nothing
      to secure the folio and the list it is on.
      
      Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
      folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
      while the old folio's reference count is temporarily frozen to 0 - adding
      such a freeze in the !mapping case too (originally, folio lock and
      unmapping and no swap cache left an anon folio unreachable, so no freezing
      was needed there: but the deferred split queue offers a way to reach it).
      
      Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
      Fixes: 9bcef597
      
       ("mm: memcg: fix split queue list crash when large folio migration")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be9581ea
    • Yang Shi's avatar
      mm: gup: stop abusing try_grab_folio · f442fa61
      Yang Shi authored
      A kernel warning was reported when pinning folio in CMA memory when
      launching SEV virtual machine.  The splat looks like:
      
      [  464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
      [  464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
      [  464.325477] RIP: 0010:__get_user_pages+0x423/0x520
      [  464.325515] Call Trace:
      [  464.325520]  <TASK>
      [  464.325523]  ? __get_user_pages+0x423/0x520
      [  464.325528]  ? __warn+0x81/0x130
      [  464.325536]  ? __get_user_pages+0x423/0x520
      [  464.325541]  ? report_bug+0x171/0x1a0
      [  464.325549]  ? handle_bug+0x3c/0x70
      [  464.325554]  ? exc_invalid_op+0x17/0x70
      [  464.325558]  ? asm_exc_invalid_op+0x1a/0x20
      [  464.325567]  ? __get_user_pages+0x423/0x520
      [  464.325575]  __gup_longterm_locked+0x212/0x7a0
      [  464.325583]  internal_get_user_pages_fast+0xfb/0x190
      [  464.325590]  pin_user_pages_fast+0x47/0x60
      [  464.325598]  sev_pin_memory+0xca/0x170 [kvm_amd]
      [  464.325616]  sev_mem_enc_register_region+0x81/0x130 [kvm_amd]
      
      Per the analysis done by yangge, when starting the SEV virtual machine, it
      will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. 
      But the page is in CMA area, so fast GUP will fail then fallback to the
      slow path due to the longterm pinnalbe check in try_grab_folio().
      
      The slow path will try to pin the pages then migrate them out of CMA area.
      But the slow path also uses try_grab_folio() to pin the page, it will
      also fail due to the same check then the above warning is triggered.
      
      In addition, the try_grab_folio() is supposed to be used in fast path and
      it elevates folio refcount by using add ref unless zero.  We are guaranteed
      to have at least one stable reference in slow path, so the simple atomic add
      could be used.  The performance difference should be trivial, but the
      misuse may be confusing and misleading.
      
      Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
      to try_grab_folio(), and use them in the proper paths.  This solves both
      the abuse and the kernel warning.
      
      The proper naming makes their usecase more clear and should prevent from
      abusing in the future.
      
      peterx said:
      
      : The user will see the pin fails, for gpu-slow it further triggers the WARN
      : right below that failure (as in the original report):
      : 
      :         folio = try_grab_folio(page, page_increm - 1,
      :                                 foll_flags);
      :         if (WARN_ON_ONCE(!folio)) { <------------------------ here
      :                 /*
      :                         * Release the 1st page ref if the
      :                         * folio is problematic, fail hard.
      :                         */
      :                 gup_put_folio(page_folio(page), 1,
      :                                 foll_flags);
      :                 ret = -EFAULT;
      :                 goto out;
      :         }
      
      [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/
      
      [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast]
        Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com
      Fixes: 57edfcfd
      
       ("mm/gup: accelerate thp gup even for "pages != NULL"")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avataryangge <yangge1116@126.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f442fa61
  6. Jul 04, 2024
    • Suren Baghdasaryan's avatar
      mm, slab: move allocation tagging code in the alloc path into a hook · 302a3ea3
      Suren Baghdasaryan authored
      
      Move allocation tagging specific code in the allocation path into
      alloc_tagging_slab_alloc_hook, similar to how freeing path uses
      alloc_tagging_slab_free_hook. No functional changes, just code
      cleanup.
      
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      302a3ea3
    • Yu Zhao's avatar
      mm/hugetlb_vmemmap: fix race with speculative PFN walkers · bd225530
      Yu Zhao authored
      While investigating HVO for THPs [1], it turns out that speculative PFN
      walkers like compaction can race with vmemmap modifications, e.g.,
      
        CPU 1 (vmemmap modifier)         CPU 2 (speculative PFN walker)
        -------------------------------  ------------------------------
        Allocates an LRU folio page1
                                         Sees page1
        Frees page1
      
        Allocates a hugeTLB folio page2
        (page1 being a tail of page2)
      
        Updates vmemmap mapping page1
                                         get_page_unless_zero(page1)
      
      Even though page1->_refcount is zero after HVO, get_page_unless_zero() can
      still try to modify this read-only field, resulting in a crash.
      
      An independent report [2] confirmed this race.
      
      There are two discussed approaches to fix this race:
      1. Make RO vmemmap RW so that get_page_unless_zero() can fail without
         triggering a PF.
      2. Use RCU to make sure get_page_unless_zero() either sees zero
         page->_refcount through the old vmemmap or non-zero page->_refcount
         through the new one.
      
      The second approach is preferred here because:
      1. It can prevent illegal modifications to struct page[] that has been
         HVO'ed;
      2. It can be generalized, in a way similar to ZERO_PAGE(), to fix
         similar races in other places, e.g., arch_remove_memory() on x86
         [3], which frees vmemmap mapping offlined struct page[].
      
      While adding synchronize_rcu(), the goal is to be surgical, rather than
      optimized.  Specifically, calls to synchronize_rcu() on the error handling
      paths can be coalesced, but it is not done for the sake of Simplicity:
      noticeably, this fix removes ~50% more lines than it adds.
      
      According to the hugetlb_optimize_vmemmap section in
      Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or
      freeing hugeTLB pages "~2x slower than before".  Having synchronize_rcu()
      on top makes those operations even worse, and this also affects the user
      interface /proc/sys/vm/nr_overcommit_hugepages.
      
      This is *very* hard to trigger:
      
      1. Most hugeTLB use cases I know of are static, i.e., reserved at
         boot time, because allocating at runtime is not reliable at all.
      
      2. On top of that, someone has to be very unlucky to get tripped
         over above, because the race window is so small -- I wasn't able to
         trigger it with a stress testing that does nothing but that (with
         THPs though).
      
      [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
      [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/
      [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/
      
      Link: https://lkml.kernel.org/r/20240627222705.2974207-1-yuzhao@google.com
      
      
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Yang Shi <yang@os.amperecomputing.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd225530
    • Nhat Pham's avatar
      cachestat: do not flush stats in recency check · 5a4d8944
      Nhat Pham authored
      syzbot detects that cachestat() is flushing stats, which can sleep, in its
      RCU read section (see [1]).  This is done in the workingset_test_recent()
      step (which checks if the folio's eviction is recent).
      
      Move the stat flushing step to before the RCU read section of cachestat,
      and skip stat flushing during the recency check.
      
      [1]: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/
      
      Link: https://lkml.kernel.org/r/20240627201737.3506959-1-nphamcs@gmail.com
      Fixes: b0068472
      
       ("mm: workingset: move the stats flush into workingset_test_recent()")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Reported-by: default avatar <syzbot+b7f13b2d0cc156edf61a@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/
      
      
      Debugged-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: <stable@vger.kernel.org>	[6.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a4d8944
    • Gavin Shan's avatar
      mm/shmem: disable PMD-sized page cache if needed · 9fd154ba
      Gavin Shan authored
      For shmem files, it's possible that PMD-sized page cache can't be
      supported by xarray.  For example, 512MB page cache on ARM64 when the base
      page size is 64KB can't be supported by xarray.  It leads to errors as the
      following messages indicate when this sort of xarray entry is split.
      
      WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6   \
      nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject        \
      nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4  \
      ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs  \
      libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \
      net_failover virtio_console virtio_blk failover dimlib virtio_mmio
      CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff8000882af5f0
      x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768
      x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858
      x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000
      x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c
      x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0
      x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       shmem_undo_range+0x2bc/0x6a8
       shmem_fallocate+0x134/0x430
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is larger
      than MAX_PAGECACHE_ORDER.  As Matthew Wilcox pointed, the page cache in a
      shmem file isn't represented by a multi-index entry and doesn't have this
      limitation when the xarry entry is split until commit 6b24ca4a ("mm:
      Use multi-index entries in the page cache").
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-5-gshan@redhat.com
      Fixes: 6b24ca4a
      
       ("mm: Use multi-index entries in the page cache")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9fd154ba
    • Gavin Shan's avatar
      mm/filemap: skip to create PMD-sized page cache if needed · 3390916a
      Gavin Shan authored
      On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB.  The
      PMD-sized page cache can't be supported by xarray as the following error
      messages indicate.
      
      ------------[ cut here ]------------
      WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128
      Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib  \
      nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct    \
      nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4    \
      ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm      \
      fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64      \
      sha1_ce virtio_net net_failover virtio_console virtio_blk failover \
      dimlib virtio_mmio
      CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9
      Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024
      pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
      pc : xas_split_alloc+0xf8/0x128
      lr : split_huge_page_to_list_to_order+0x1c4/0x720
      sp : ffff800087a4f6c0
      x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff
      x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858
      x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000
      x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000
      x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000
      x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020
      x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28
      x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8
      x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40
      x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
      Call trace:
       xas_split_alloc+0xf8/0x128
       split_huge_page_to_list_to_order+0x1c4/0x720
       truncate_inode_partial_folio+0xdc/0x160
       truncate_inode_pages_range+0x1b4/0x4a8
       truncate_pagecache_range+0x84/0xa0
       xfs_flush_unmap_range+0x70/0x90 [xfs]
       xfs_file_fallocate+0xfc/0x4d8 [xfs]
       vfs_fallocate+0x124/0x2e8
       ksys_fallocate+0x4c/0xa0
       __arm64_sys_fallocate+0x24/0x38
       invoke_syscall.constprop.0+0x7c/0xd8
       do_el0_svc+0xb4/0xd0
       el0_svc+0x44/0x1d8
       el0t_64_sync_handler+0x134/0x150
       el0t_64_sync+0x17c/0x180
      
      Fix it by skipping to allocate PMD-sized page cache when its size is
      larger than MAX_PAGECACHE_ORDER.  For this specific case, we will fall to
      regular path where the readahead window is determined by BDI's sysfs file
      (read_ahead_kb).
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-4-gshan@redhat.com
      Fixes: 4687fdbb
      
       ("mm/filemap: Support VM_HUGEPAGE for file mappings")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3390916a
    • Gavin Shan's avatar
      mm/readahead: limit page cache size in page_cache_ra_order() · 1f789a45
      Gavin Shan authored
      In page_cache_ra_order(), the maximal order of the page cache to be
      allocated shouldn't be larger than MAX_PAGECACHE_ORDER.  Otherwise, it's
      possible the large page cache can't be supported by xarray when the
      corresponding xarray entry is split.
      
      For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is
      64KB.  The PMD-sized page cache can't be supported by xarray.
      
      Link: https://lkml.kernel.org/r/20240627003953.1262512-3-gshan@redhat.com
      Fixes: 793917d9
      
       ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zhenyu Zhang <zhenyzha@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1f789a45
    • SeongJae Park's avatar
      mm/damon/core: merge regions aggressively when max_nr_regions is unmet · 310d6c15
      SeongJae Park authored
      DAMON keeps the number of regions under max_nr_regions by skipping regions
      split operations when doing so can make the number higher than the limit. 
      It works well for preventing violation of the limit.  But, if somehow the
      violation happens, it cannot recovery well depending on the situation.  In
      detail, if the real number of regions having different access pattern is
      higher than the limit, the mechanism cannot reduce the number below the
      limit.  In such a case, the system could suffer from high monitoring
      overhead of DAMON.
      
      The violation can actually happen.  For an example, the user could reduce
      max_nr_regions while DAMON is running, to be lower than the current number
      of regions.  Fix the problem by repeating the merge operations with
      increasing aggressiveness in kdamond_merge_regions() for the case, until
      the limit is met.
      
      [sj@kernel.org: increase regions merge aggressiveness while respecting min_nr_regions]
        Link: https://lkml.kernel.org/r/20240626164753.46270-1-sj@kernel.org
      [sj@kernel.org: ensure max threshold attempt for max_nr_regions violation]
        Link: https://lkml.kernel.org/r/20240627163153.75969-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240624175814.89611-1-sj@kernel.org
      Fixes: b9a6ac4e
      
       ("mm/damon: adaptively adjust regions")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      310d6c15
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: check if a hash-index is in cpu_possible_mask · a34acf30
      Uladzislau Rezki (Sony) authored
      The problem is that there are systems where cpu_possible_mask has gaps
      between set CPUs, for example SPARC.  In this scenario addr_to_vb_xa()
      hash function can return an index which accesses to not-possible and not
      setup CPU area using per_cpu() macro.  This results in an oops on SPARC.
      
      A per-cpu vmap_block_queue is also used as hash table, incorrectly
      assuming the cpu_possible_mask has no gaps.  Fix it by adjusting an index
      to a next possible CPU.
      
      Link: https://lkml.kernel.org/r/20240626140330.89836-1-urezki@gmail.com
      Fixes: 062eacf5
      
       ("mm: vmalloc: remove a global vmap_blocks xarray")
      Reported-by: default avatarNick Bowler <nbowler@draconx.ca>
      Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/
      
      
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hailong.Liu <hailong.liu@oppo.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a34acf30
    • Yang Shi's avatar
      mm: page_ref: remove folio_try_get_rcu() · fa2690af
      Yang Shi authored
      The below bug was reported on a non-SMP kernel:
      
      [  275.267158][ T4335] ------------[ cut here ]------------
      [  275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275!
      [  275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI
      [  275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1
      [  275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
      [  275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [  275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202
      [  275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000
      [  275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034
      [  275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006
      [  275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136
      [  275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008
      [  275.276790][ T4335] FS:  00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000
      [  275.277570][ T4335] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0
      [  275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  275.280201][ T4335] Call Trace:
      [  275.280499][ T4335]  <TASK>
      [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
      [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153)
      [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174)
      [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212)
      [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264)
      [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
      [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3))
      [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148)
      [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1))
      [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188)
      [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825)
      [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1))
      [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209)
      [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204)
      [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722)
      [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106)
      [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350)
      [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350)
      [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1))
      [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573)
      [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360
      [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10
      [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573)
      [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0
      [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228)
      [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10
      [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284)
      [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259)
      [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
      [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291)
      [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112)
      [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1))
      [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1))
      [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
      [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359)
      [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97)
      [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      
      This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in
      folio_ref_try_add_rcu() for non-SMP kernel.
      
      The process_vm_readv() calls GUP to pin the THP. An optimization for
      pinning THP instroduced by commit 57edfcfd ("mm/gup: accelerate thp
      gup even for "pages != NULL"") calls try_grab_folio() to pin the THP,
      but try_grab_folio() is supposed to be called in atomic context for
      non-SMP kernel, for example, irq disabled or preemption disabled, due to
      the optimization introduced by commit e286781d ("mm: speculative
      page references").
      
      The commit efa7df3e ("mm: align larger anonymous mappings on THP
      boundaries") is not actually the root cause although it was bisected to.
      It just makes the problem exposed more likely.
      
      The follow up discussion suggested the optimization for non-SMP kernel
      may be out-dated and not worth it anymore [1].  So removing the
      optimization to silence the BUG.
      
      However calling try_grab_folio() in GUP slow path actually is
      unnecessary, so the following patch will clean this up.
      
      [1] https://lore.kernel.org/linux-mm/821cf1d6-92b9-4ac4-bacc-d8f2364ac14f@paulmck-laptop/
      
      Link: https://lkml.kernel.org/r/20240625205350.1777481-1-yang@os.amperecomputing.com
      Fixes: 57edfcfd
      
       ("mm/gup: accelerate thp gup even for "pages != NULL"")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarOliver Sang <oliver.sang@intel.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
      Cc: <stable@vger.kernel.org>	[6.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa2690af
  7. Jul 03, 2024
    • Jan Kara's avatar
      mm: avoid overflows in dirty throttling logic · 385d838d
      Jan Kara authored
      The dirty throttling logic is interspersed with assumptions that dirty
      limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
      fit into 64-bits).  If limits end up being larger, we will hit overflows,
      possible divisions by 0 etc.  Fix these problems by never allowing so
      large dirty limits as they have dubious practical value anyway.  For
      dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
      so large limits.  For dirty_ratio / dirty_background_ratio it isn't so
      simple as the dirty limit is computed from the amount of available memory
      which can change due to memory hotplug etc.  So when converting dirty
      limits from ratios to numbers of pages, we just don't allow the result to
      exceed UINT_MAX.
      
      This is root-only triggerable problem which occurs when the operator
      sets dirty limits to >16 TB.
      
      Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: Zach O'Keefe <z...
      385d838d
    • Jan Kara's avatar
      Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again" · 30139c70
      Jan Kara authored
      Patch series "mm: Avoid possible overflows in dirty throttling".
      
      Dirty throttling logic assumes dirty limits in page units fit into
      32-bits.  This patch series makes sure this is true (see patch 2/2 for
      more details).
      
      
      This patch (of 2):
      
      This reverts commit 9319b647.
      
      The commit is broken in several ways.  Firstly, the removed (u64) cast
      from the multiplication will introduce a multiplication overflow on 32-bit
      archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
      default settings with 4GB of RAM will trigger this).  Secondly, the
      div64_u64() is unnecessarily expensive on 32-bit archs.  We have
      div64_ul() in case we want to be safe & cheap.  Thirdly, if dirty
      thresholds are larger than 1<<32 pages, then dirty balancing is going to
      blow up in many other spectacular ways anyway so trying to fix one
      possible overflow is just moot.
      
      Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
      Fixes: 9319b647
      
       ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-By: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30139c70
    • Kees Cook's avatar
      mm/util: Use dedicated slab buckets for memdup_user() · d73778e4
      Kees Cook authored
      Both memdup_user() and vmemdup_user() handle allocations that are
      regularly used for exploiting use-after-free type confusion flaws in
      the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4]
      respectively).
      
      Since both are designed for contents coming from userspace, it allows
      for userspace-controlled allocation sizes. Use a dedicated set of kmalloc
      buckets so these allocations do not share caches with the global kmalloc
      buckets.
      
      After a fresh boot under Ubuntu 23.10, we can see the caches are already
      in active use:
      
       # grep ^memdup /proc/slabinfo
       memdup_user-8k         4      4   8192    4    8 : ...
       memdup_user-4k         8      8   4096    8    8 : ...
       memdup_user-2k        16     16   2048   16    8 : ...
       memdup_user-1k         0      0   1024   16    4 : ...
       memdup_user-512        0      0    512   16    2 : ...
       memdup_user-256        0      0    256   16    1 : ...
       memdup_user-128        0      0    128   32    1 : ...
       memdup_user-64       256    256     64   64    1 : ...
       memdup_user-32       512    512     32  128    1 : ...
       memdup_user-16      1024   1024     16  256    1 : ...
       memdup_user-8       2048   2048      8  512    1 : ...
       memdup_user-192        0      0    192   21    1 : ...
       memdup_user-96       168    168     96   42    1 : ...
      
      Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1]
      Link: https://duasynt.com/blog/linux-kernel-heap-spray [2]
      Link: https://etenal.me/archives/1336 [3]
      Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c
      
       [4]
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      d73778e4
    • Kees Cook's avatar
      mm/slab: Introduce kmem_buckets_create() and family · b32801d1
      Kees Cook authored
      Dedicated caches are available for fixed size allocations via
      kmem_cache_alloc(), but for dynamically sized allocations there is only
      the global kmalloc API's set of buckets available. This means it isn't
      possible to separate specific sets of dynamically sized allocations into
      a separate collection of caches.
      
      This leads to a use-after-free exploitation weakness in the Linux
      kernel since many heap memory spraying/grooming attacks depend on using
      userspace-controllable dynamically sized allocations to collide with
      fixed size allocations that end up in same cache.
      
      While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
      against these kinds of "type confusion" attacks, including for fixed
      same-size heap objects, we can create a complementary deterministic
      defense for dynamically sized allocations that are directly user
      controlled. Addressing these cases is limited in scope, so isolating these
      kinds of interfaces will not become an unbounded game of whack-a-mole. For
      example, many pass through memdup_user(), making isolation there very
      effective.
      
      In order to isolate user-controllable dynamically-sized
      allocations from the common system kmalloc allocations, introduce
      kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
      kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
      kmem_buckets_alloc_track_caller() for where caller tracking is
      needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
      is needed. Note that these caches are specifically flagged with
      SLAB_NO_MERGE, since merging would defeat the entire purpose of the
      mitigation.
      
      This can also be used in the future to extend allocation profiling's use
      of code tagging to implement per-caller allocation cache isolation[1]
      even for dynamic allocations.
      
      Memory allocation pinning[2] is still needed to plug the Use-After-Free
      cross-allocator weakness (where attackers can arrange to free an
      entire slab page and have it reallocated to a different cache),
      but that is an existing and separate issue which is complementary
      to this improvement. Development continues for that feature via the
      SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
      complementary improvement).
      
      Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
      Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
      Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/
      
       [3]
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      b32801d1
    • Kees Cook's avatar
      mm/slab: Introduce kvmalloc_buckets_node() that can take kmem_buckets argument · 2e8000b8
      Kees Cook authored
      
      Plumb kmem_buckets arguments through kvmalloc_node_noprof() so it is
      possible to provide an API to perform kvmalloc-style allocations with
      a particular set of buckets. Introduce kvmalloc_buckets_node() that takes a
      kmem_buckets argument.
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      2e8000b8
    • Kees Cook's avatar
      mm/slab: Plumb kmem_buckets into __do_kmalloc_node() · 67f2df3b
      Kees Cook authored
      
      Introduce CONFIG_SLAB_BUCKETS which provides the infrastructure to
      support separated kmalloc buckets (in the following kmem_buckets_create()
      patches and future codetag-based separation). Since this will provide
      a mitigation for a very common case of exploits, it is recommended to
      enable this feature for general purpose distros. By default, the new
      Kconfig will be enabled if CONFIG_SLAB_FREELIST_HARDENED is enabled (and
      it is added to the hardening.config Kconfig fragment).
      
      To be able to choose which buckets to allocate from, make the buckets
      available to the internal kmalloc interfaces by adding them as the
      second argument, rather than depending on the buckets being chosen from
      the fixed set of global buckets. Where the bucket is not available,
      pass NULL, which means "use the default system kmalloc bucket set"
      (the prior existing behavior), as implemented in kmalloc_slab().
      
      To avoid adding the extra argument when !CONFIG_SLAB_BUCKETS, only the
      top-level macros and static inlines use the buckets argument (where
      they are stripped out and compiled out respectively). The actual extern
      functions can then be built without the argument, and the internals
      fall back to the global kmalloc buckets unconditionally.
      
      Co-developed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      67f2df3b
    • Kees Cook's avatar
      mm/slab: Introduce kmem_buckets typedef · 72e0fe22
      Kees Cook authored
      
      Encapsulate the concept of a single set of kmem_caches that are used
      for the kmalloc size buckets. Redefine kmalloc_caches as an array
      of these buckets (for the different global cache buckets).
      
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      72e0fe22
    • Vlastimil Babka's avatar
      slab, rust: extend kmalloc() alignment guarantees to remove Rust padding · ad59baa3
      Vlastimil Babka authored
      Slab allocators have been guaranteeing natural alignment for
      power-of-two sizes since commit 59bb4798
      
       ("mm, sl[aou]b: guarantee
      natural alignment for kmalloc(power-of-two)"), while any other sizes are
      guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although
      in practice are aligned more than that in non-debug scenarios).
      
      Rust's allocator API specifies size and alignment per allocation, which
      have to satisfy the following rules, per Alice Ryhl [1]:
      
        1. The alignment is a power of two.
        2. The size is non-zero.
        3. When you round up the size to the next multiple of the alignment,
           then it must not overflow the signed type isize / ssize_t.
      
      In order to map this to kmalloc()'s guarantees, some requested
      allocation sizes have to be padded to the next power-of-two size [2].
      For example, an allocation of size 96 and alignment of 32 will be padded
      to an allocation of size 128, because the existing kmalloc-96 bucket
      doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab
      debugging active, the layout of the kmalloc-96 slabs however naturally
      align the objects to 32 bytes, so extending the size to 128 bytes is
      wasteful.
      
      To improve the situation we can extend the kmalloc() alignment
      guarantees in a way that
      
      1) doesn't change the current slab layout (and thus does not increase
         internal fragmentation) when slab debugging is not active
      2) reduces waste in the Rust allocator use case
      3) is a superset of the current guarantee for power-of-two sizes.
      
      The extended guarantee is that alignment is at least the largest
      power-of-two divisor of the requested size. For power-of-two sizes the
      largest divisor is the size itself, but let's keep this case documented
      separately for clarity.
      
      For current kmalloc size buckets, it means kmalloc-96 will guarantee
      alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes.
      
      This covers the rules 1 and 2 above of Rust's API as long as the size is
      a multiple of the alignment. The Rust layer should now only need to
      round up the size to the next multiple if it isn't, while enforcing the
      rule 3.
      
      Implementation-wise, this changes the alignment calculation in
      create_boot_cache(). While at it also do the calulation only for caches
      with the SLAB_KMALLOC flag, because the function is also used to create
      the initial kmem_cache and kmem_cache_node caches, where no alignment
      guarantee is necessary.
      
      In the Rust allocator's krealloc_aligned(), remove the code that padded
      sizes to the next power of two (suggested by Alice Ryhl) as it's no
      longer necessary with the new guarantees.
      
      Reported-by: default avatarAlice Ryhl <aliceryhl@google.com>
      Reported-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1]
      Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/
      
       [2]
      Reviewed-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarAlice Ryhl <aliceryhl@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      ad59baa3
  8. Jul 02, 2024
  9. Jun 25, 2024
    • Mateusz Guzik's avatar
      vfs: remove redundant smp_mb for thp handling in do_dentry_open · 8e344782
      Mateusz Guzik authored
      
      opening for write performs:
      
      if (f->f_mode & FMODE_WRITE) {
      [snip]
              smp_mb();
              if (filemap_nr_thps(inode->i_mapping)) {
      [snip]
              }
      }
      
      filemap_nr_thps on kernels built without CONFIG_READ_ONLY_THP_FOR
      expands to 0, allowing the compiler to eliminate the entire thing, with
      exception of the fence (and the branch leading there).
      
      So happens required synchronisation between i_writecount and nr_thps
      changes is already provided by the full fence coming from
      get_write_access -> atomic_inc_unless_negative, thus the smp_mb instance
      above can be removed regardless of CONFIG_READ_ONLY_THP_FOR.
      
      While I updated commentary in places claiming to match the now-removed
      fence, I did not try to patch them to act on the compile option.
      
      I did not bother benchmarking it, not issuing a spurious full fence in
      the fast path does not warrant justification from perf standpoint.
      
      Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
      Link: https://lore.kernel.org/r/20240624085402.493630-1-mjguzik@gmail.com
      
      
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
  10. Jun 24, 2024
    • Andrew Bresticker's avatar
      mm/memory: don't require head page for do_set_pmd() · ab1ffc86
      Andrew Bresticker authored
      The requirement that the head page be passed to do_set_pmd() was added in
      commit ef37b2ea ("mm/memory: page_add_file_rmap() ->
      folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
      finish_fault() and filemap_map_pages() paths if the page to be inserted is
      anything but the head page for an otherwise suitable vma and pmd-sized
      page.
      
      Matthew said:
      
      : We're going to stop using PMDs to map large folios unless the fault is
      : within the first 4KiB of the PMD.  No idea how many workloads that
      : affects, but it only needs to be backported as far as v6.8, so we may
      : as well backport it.
      
      Link: https://lkml.kernel.org/r/20240611153216.2794513-1-abrestic@rivosinc.com
      Fixes: ef37b2ea
      
       ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
      Signed-off-by: default avatarAndrew Bresticker <abrestic@rivosinc.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab1ffc86
    • yangge's avatar
      mm/page_alloc: Separate THP PCP into movable and non-movable categories · bf14ed81
      yangge authored
      Since commit 5d0a661d ("mm/page_alloc: use only one PCP list for
      THP-sized allocations") no longer differentiates the migration type of
      pages in THP-sized PCP list, it's possible that non-movable allocation
      requests may get a CMA page from the list, in some cases, it's not
      acceptable.
      
      If a large number of CMA memory are configured in system (for example, the
      CMA memory accounts for 50% of the system memory), starting a virtual
      machine with device passthrough will get stuck.  During starting the
      virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
      ...) to pin memory.  Normally if a page is present and in CMA area,
      pin_user_pages_remote() will migrate the page from CMA area to non-CMA
      area because of FOLL_LONGTERM flag.  But if non-movable allocation
      requests return CMA memory, migrate_longterm_unpinnable_pages() will
      migrate a CMA page to another CMA page, which will fail to pass the check
      in check_and_migrate_movable_pages() and cause migration endless.
      
      Call trace:
      pin_user_pages_remote
      --__gup_longterm_locked // endless loops in this function
      ----_get_user_pages_locked
      ----check_and_migrate_movable_pages
      ------migrate_longterm_unpinnable_pages
      --------alloc_migration_target
      
      This problem will also have a negative impact on CMA itself.  For example,
      when CMA is borrowed by THP, and we need to reclaim it through cma_alloc()
      or dma_alloc_coherent(), we must move those pages out to ensure CMA's
      users can retrieve that contigous memory.  Currently, CMA's memory is
      occupied by non-movable pages, meaning we can't relocate them.  As a
      result, cma_alloc() is more likely to fail.
      
      To fix the problem above, we add one PCP list for THP, which will not
      introduce a new cacheline for struct per_cpu_pages.  THP will have 2 PCP
      lists, one PCP list is used by MOVABLE allocation, and the other PCP list
      is used by UNMOVABLE allocation.  MOVABLE allocation contains GPF_MOVABLE,
      and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE.
      
      Link: https://lkml.kernel.org/r/1718845190-4456-1-git-send-email-yangge1116@126.com
      Fixes: 5d0a661d
      
       ("mm/page_alloc: use only one PCP list for THP-sized allocations")
      Signed-off-by: default avataryangge <yangge1116@126.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf14ed81
    • Zi Yan's avatar
      mm/migrate: make migrate_pages_batch() stats consistent · c6408250
      Zi Yan authored
      As Ying pointed out in [1], stats->nr_thp_failed needs to be updated to
      avoid stats inconsistency between MIGRATE_SYNC and MIGRATE_ASYNC when
      calling migrate_pages_batch().
      
      Because if not, when migrate_pages_batch() is called via
      migrate_pages(MIGRATE_ASYNC), nr_thp_failed will not be increased and when
      migrate_pages_batch() is called via migrate_pages(MIGRATE_SYNC*),
      nr_thp_failed will be increase in migrate_pages_sync() by
      stats->nr_thp_failed += astats.nr_thp_split.
      
      [1] https://lore.kernel.org/linux-mm/87msnq7key.fsf@yhuang6-desk2.ccr.corp.intel.com/
      
      Link: https://lkml.kernel.org/r/20240620012712.19804-1-zi.yan@sent.com
      Link: https://lkml.kernel.org/r/20240618134151.29214-1-zi.yan@sent.com
      Fixes: 7262f208
      
       ("mm/migrate: split source folio if it is on deferred split list")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Suggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6408250
    • Andrey Konovalov's avatar
      kasan: fix bad call to unpoison_slab_object · 1c61990d
      Andrey Konovalov authored
      Commit 29d7355a ("kasan: save alloc stack traces for mempool") messed
      up one of the calls to unpoison_slab_object: the last two arguments are
      supposed to be GFP flags and whether to init the object memory.
      
      Fix the call.
      
      Without this fix, __kasan_mempool_unpoison_object provides the object's
      size as GFP flags to unpoison_slab_object, which can cause LOCKDEP reports
      (and probably other issues).
      
      Link: https://lkml.kernel.org/r/20240614143238.60323-1-andrey.konovalov@linux.dev
      Fixes: 29d7355a
      
       ("kasan: save alloc stack traces for mempool")
      Signed-off-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reported-by: default avatarBrad Spengler <spender@grsecurity.net>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c61990d
    • Suren Baghdasaryan's avatar
      mm: handle profiling for fake memory allocations during compaction · 34a023dc
      Suren Baghdasaryan authored
      During compaction isolated free pages are marked allocated so that they
      can be split and/or freed.  For that, post_alloc_hook() is used inside
      split_map_pages() and release_free_list().  split_map_pages() marks free
      pages allocated, splits the pages and then lets
      alloc_contig_range_noprof() free those pages.  release_free_list() marks
      free pages and immediately frees them.  This usage of post_alloc_hook()
      affect memory allocation profiling because these functions might not be
      called from an instrumented allocator, therefore current->alloc_tag is
      NULL and when debugging is enabled (CONFIG_MEM_ALLOC_PROFILING_DEBUG=y)
      that causes warnings.  To avoid that, wrap such post_alloc_hook() calls
      into an instrumented function which acts as an allocator which will be
      charged for these fake allocations.  Note that these allocations are very
      short lived until they are freed, therefore the associated counters should
      usually read 0.
      
      Link: https://lkml.kernel.org/r/20240614230504.3849136-1-surenb@google.com
      
      
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34a023dc
    • Suren Baghdasaryan's avatar
      mm/slab: fix 'variable obj_exts set but not used' warning · b4601d09
      Suren Baghdasaryan authored
      slab_post_alloc_hook() uses prepare_slab_obj_exts_hook() to obtain
      slabobj_ext object.  Currently the only user of slabobj_ext object in this
      path is memory allocation profiling, therefore when it's not enabled this
      object is not needed.  This also generates a warning when compiling with
      CONFIG_MEM_ALLOC_PROFILING=n.  Move the code under this configuration to
      fix the warning.  If more slabobj_ext users appear in the future, the code
      will have to be changed back to call prepare_slab_obj_exts_hook().
      
      Link: https://lkml.kernel.org/r/20240614225951.3845577-1-surenb@google.com
      Fixes: 4b873696
      
       ("mm/slab: add allocation accounting into slab allocation and free paths")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202406150444.F6neSaiy-lkp@intel.com/
      
      
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b4601d09
    • Jeff Xu's avatar
      /proc/pid/smaps: add mseal info for vma · 399ab86e
      Jeff Xu authored
      Add sl in /proc/pid/smaps to indicate vma is sealed
      
      Link: https://lkml.kernel.org/r/20240614232014.806352-2-jeffxu@google.com
      Fixes: 8be7258a
      
       ("mseal: add mseal syscall")
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stephen Röttger <sroettger@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      399ab86e
    • Zhaoyang Huang's avatar
      mm: fix incorrect vbq reference in purge_fragmented_block · 8c61291f
      Zhaoyang Huang authored
      xa_for_each() in _vm_unmap_aliases() loops through all vbs.  However,
      since commit 062eacf5 ("mm: vmalloc: remove a global vmap_blocks
      xarray") the vb from xarray may not be on the corresponding CPU
      vmap_block_queue.  Consequently, purge_fragmented_block() might use the
      wrong vbq->lock to protect the free list, leading to vbq->free breakage.
      
      Incorrect lock protection can exhaust all vmalloc space as follows:
      CPU0                                            CPU1
      +--------------------------------------------+
      |    +--------------------+     +-----+      |
      +--> |                    |---->|     |------+
           | CPU1:vbq free_list |     | vb1 |
      +--- |                    |<----|     |<-----+
      |    +--------------------+     +-----+      |
      +--------------------------------------------+
      
      _vm_unmap_aliases()                             vb_alloc()
                                                      new_vmap_block()
      xa_for_each(&vbq->vmap_blocks, idx, vb)
      --> vb in CPU1:vbq->freelist
      
      purge_fragmented_block(vb)
      spin_lock(&vbq->lock)                           spin_lock(&vbq->lock)
      --> use CPU0:vbq->lock                          --> use CPU1:vbq->lock
      
      list_del_rcu(&vb->free_list)                    list_add_tail_rcu(&vb->free_list, &vbq->free)
          __list_del(vb->prev, vb->next)
              next->prev = prev
          +--------------------+
          |                    |
          | CPU1:vbq free_list |
      +---|                    |<--+
      |   +--------------------+   |
      +----------------------------+
                                                      __list_add(new, head->prev, head)
      +--------------------------------------------+
      |    +--------------------+     +-----+      |
      +--> |                    |---->|     |------+
           | CPU1:vbq free_list |     | vb2 |
      +--- |                    |<----|     |<-----+
      |    +--------------------+     +-----+      |
      +--------------------------------------------+
      
              prev->next = next
      +--------------------------------------------+
      |----------------------------+               |
      |    +--------------------+  |  +-----+      |
      +--> |                    |--+  |     |------+
           | CPU1:vbq free_list |     | vb2 |
      +--- |                    |<----|     |<-----+
      |    +--------------------+     +-----+      |
      +--------------------------------------------+
      Here’s a list breakdown. All vbs, which were to be added to
      ‘prev’, cannot be used by list_for_each_entry_rcu(vb, &vbq->free,
      free_list) in vb_alloc(). Thus, vmalloc space is exhausted.
      
      This issue affects both erofs and f2fs, the stacktrace is as follows:
      erofs:
      [<ffffffd4ffb93ad4>] __switch_to+0x174
      [<ffffffd4ffb942f0>] __schedule+0x624
      [<ffffffd4ffb946f4>] schedule+0x7c
      [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24
      [<ffffffd4ffb962ec>] __mutex_lock+0x374
      [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14
      [<ffffffd4ffb95954>] mutex_lock+0x24
      [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44
      [<ffffffd4fef25908>] alloc_vmap_area+0x2e0
      [<ffffffd4fef24ea0>] vm_map_ram+0x1b0
      [<ffffffd4ff1b46f4>] z_erofs_lz4_decompress+0x278
      [<ffffffd4ff1b8ac4>] z_erofs_decompress_queue+0x650
      [<ffffffd4ff1b8328>] z_erofs_runqueue+0x7f4
      [<ffffffd4ff1b66a8>] z_erofs_read_folio+0x104
      [<ffffffd4feeb6fec>] filemap_read_folio+0x6c
      [<ffffffd4feeb68c4>] filemap_fault+0x300
      [<ffffffd4fef0ecac>] __do_fault+0xc8
      [<ffffffd4fef0c908>] handle_mm_fault+0xb38
      [<ffffffd4ffb9f008>] do_page_fault+0x288
      [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40
      [<ffffffd4fec39c78>] do_mem_abort+0x58
      [<ffffffd4ffb8c3e4>] el0_ia+0x70
      [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0
      [<ffffffd4fec11588>] ret_to_user[jt]+0x0
      
      f2fs:
      [<ffffffd4ffb93ad4>] __switch_to+0x174
      [<ffffffd4ffb942f0>] __schedule+0x624
      [<ffffffd4ffb946f4>] schedule+0x7c
      [<ffffffd4ffb947cc>] schedule_preempt_disabled+0x24
      [<ffffffd4ffb962ec>] __mutex_lock+0x374
      [<ffffffd4ffb95998>] __mutex_lock_slowpath+0x14
      [<ffffffd4ffb95954>] mutex_lock+0x24
      [<ffffffd4fef2900c>] reclaim_and_purge_vmap_areas+0x44
      [<ffffffd4fef25908>] alloc_vmap_area+0x2e0
      [<ffffffd4fef24ea0>] vm_map_ram+0x1b0
      [<ffffffd4ff1a3b60>] f2fs_prepare_decomp_mem+0x144
      [<ffffffd4ff1a6c24>] f2fs_alloc_dic+0x264
      [<ffffffd4ff175468>] f2fs_read_multi_pages+0x428
      [<ffffffd4ff17b46c>] f2fs_mpage_readpages+0x314
      [<ffffffd4ff1785c4>] f2fs_readahead+0x50
      [<ffffffd4feec3384>] read_pages+0x80
      [<ffffffd4feec32c0>] page_cache_ra_unbounded+0x1a0
      [<ffffffd4feec39e8>] page_cache_ra_order+0x274
      [<ffffffd4feeb6cec>] do_sync_mmap_readahead+0x11c
      [<ffffffd4feeb6764>] filemap_fault+0x1a0
      [<ffffffd4ff1423bc>] f2fs_filemap_fault+0x28
      [<ffffffd4fef0ecac>] __do_fault+0xc8
      [<ffffffd4fef0c908>] handle_mm_fault+0xb38
      [<ffffffd4ffb9f008>] do_page_fault+0x288
      [<ffffffd4ffb9ed64>] do_translation_fault[jt]+0x40
      [<ffffffd4fec39c78>] do_mem_abort+0x58
      [<ffffffd4ffb8c3e4>] el0_ia+0x70
      [<ffffffd4ffb8c260>] el0t_64_sync_handler[jt]+0xb0
      [<ffffffd4fec11588>] ret_to_user[jt]+0x0
      
      To fix this, introducee cpu within vmap_block to record which this vb
      belongs to.
      
      Link: https://lkml.kernel.org/r/20240614021352.1822225-1-zhaoyang.huang@unisoc.com
      Link: https://lkml.kernel.org/r/20240607023116.1720640-1-zhaoyang.huang@unisoc.com
      Fixes: fc1e0d98
      
       ("mm/vmalloc: prevent stale TLBs in fully utilized blocks")
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Suggested-by: default avatarHailong.Liu <hailong.liu@oppo.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c61291f
    • Chengming Zhou's avatar
      slab: delete useless RED_INACTIVE and RED_ACTIVE · 4a24bbab
      Chengming Zhou authored
      
      These seem useless since we use the SLUB_RED_INACTIVE and SLUB_RED_ACTIVE,
      so just delete them, no functional change.
      
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarChristoph Lameter (Ampere) <cl@linux.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      4a24bbab
  11. Jun 19, 2024
    • James Gowans's avatar
      memblock: Move late alloc warning down to phys alloc · 94ff46de
      James Gowans authored
      
      If a driver/subsystem tries to do an allocation after the memblock
      allocations have been freed and the memory handed to the buddy
      allocator, it will not actually be legal to use that allocation: the
      buddy allocator owns the memory. Currently this mis-use is handled by
      the memblock function which does allocations and returns virtual
      addresses by printing a warning and doing a kmalloc instead. However
      the physical allocation function does not to do this check - callers of
      the physical alloc function are unprotected against mis-use.
      
      Improve the error catching here by moving the check into the physical
      allocation function which is used by the virtual addr allocation
      function.
      
      Signed-off-by: default avatarJames Gowans <jgowans@amazon.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alex Graf <graf@amazon.de>
      Link: https://lore.kernel.org/r/20240619095555.85980-1-jgowans@amazon.com
      
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      94ff46de
    • Steven Rostedt (Google)'s avatar
      mm/memblock: Add "reserve_mem" to reserved named memory at boot up · 1e4c64b7
      Steven Rostedt (Google) authored
      In order to allow for requesting a memory region that can be used for
      things like pstore on multiple machines where the memory layout is not the
      same, add a new option to the kernel command line called "reserve_mem".
      
      The format is:  reserve_mem=nn:align:name
      
      Where it will find nn amount of memory at the given alignment of align.
      The name field is to allow another subsystem to retrieve where the memory
      was found. For example:
      
        reserve_mem=12M:4096:oops ramoops.mem_name=oops
      
      Where ramoops.mem_name will tell ramoops that memory was reserved for it
      via the reserve_mem option and it can find it by calling:
      
        if (reserve_mem_find_by_name("oops", &start, &size)) {
      	// start holds the start address and size holds the size given
      
      This is typically used for systems that do not wipe the RAM, and this
      command line will try to reserve the same physical memory on soft reboots.
      Note, it is not guaranteed to be the same location. For example, if KASLR
      places the kernel at the location of where the RAM reservation was from a
      previous boot, the new reservation will be at a different location.  Any
      subsystem using this feature must add a way to verify that the contents of
      the physical memory is from a previous boot, as there may be cases where
      the memory will not be located at the same location.
      
      Not all systems may work either. There could be bit flips if the reboot
      goes through the BIOS. Using kexec to reboot the machine is likely to
      have better results in such cases.
      
      Link: https://lore.kernel.org/all/ZjJVnZUX3NZiGW6q@kernel.org/
      
      
      
      Suggested-by: default avatarMike Rapoport <rppt@kernel.org>
      Tested-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Link: https://lore.kernel.org/r/20240613155527.437020271@goodmis.org
      
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      1e4c64b7
  12. Jun 16, 2024