Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. May 05, 2024
  2. Apr 25, 2024
  3. Feb 22, 2024
  4. Feb 21, 2024
    • Sumanth Korikkar's avatar
      mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers · c5f1e2d1
      Sumanth Korikkar authored
      Patch series "implement "memmap on memory" feature on s390".
      
      This series provides "memmap on memory" support on s390 platform.  "memmap
      on memory" allows struct pages array to be allocated from the hotplugged
      memory range instead of allocating it from main system memory.
      
      s390 currently preallocates struct pages array for all potentially
      possible memory, which ensures memory onlining always succeeds, but with
      the cost of significant memory consumption from the available system
      memory during boottime.  In certain extreme configuration, this could lead
      to ipl failure.
      
      "memmap on memory" ensures struct pages array are populated from self
      contained hotplugged memory range instead of depleting the available
      system memory and this could eliminate ipl failure on s390 platform.
      
      On other platforms, system might go OOM when the physically hotplugged
      memory depletes the available memory before it is onlined.  Hence, "memmap
      on memory" feature was introduced as described in commit a08a2ae3
      ("mm,memory_hotplug: allocate memmap from the added memory range").
      
      Unlike other architectures, s390 memory blocks are not physically
      accessible until it is online.  To make it physically accessible two new
      memory notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and
      this notifier lets the hypervisor inform that the memory should be made
      physically accessible.  This allows for "memmap on memory" initialization
      during memory hotplug onlining phase, which is performed before calling
      MEM_GOING_ONLINE notifier.
      
      Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers
      to prepare the transition of memory to and from a physically accessible
      state.  New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to ensure
      altmap cannot be written when adding memory - before it is set online. 
      This enhancement is crucial for implementing the "memmap on memory"
      feature for s390 in a subsequent patch.
      
      Patches 2 allocates vmemmap pages from self-contained memory range for
      s390.  It allocates memory map (struct pages array) from the hotplugged
      memory range, rather than using system memory by passing altmap to vmemmap
      functions.
      
      Patch 3 removes unhandled memory notifier types on s390.
      
      Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers
      on s390.  MEM_PREPARE_ONLINE memory notifier makes memory block physical
      accessible via sclp assign command.  The notifier ensures self-contained
      memory maps are accessible and hence enabling the "memmap on memory" on
      s390.  MEM_FINISH_OFFLINE memory notifier shifts the memory block to an
      inaccessible state via sclp unassign command.
      
      Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390.
      
      
      This patch (of 5):
      
      Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to
      prepare the transition of memory to and from a physically accessible
      state.  This enhancement is crucial for implementing the "memmap on
      memory" feature for s390 in a subsequent patch.
      
      Platforms such as x86 can support physical memory hotplug via ACPI.  When
      there is physical memory hotplug, ACPI event leads to the memory addition
      with the following callchain:
      
      acpi_memory_device_add()
        -> acpi_memory_enable_device()
           -> __add_memory()
      
      After this, the hotplugged memory is physically accessible, and altmap
      support prepared, before the "memmap on memory" initialization in
      memory_block_online() is called.
      
      On s390, memory hotplug works in a different way.  The available hotplug
      memory has to be defined upfront in the hypervisor, but it is made
      physically accessible only when the user sets it online via sysfs,
      currently in the MEM_GOING_ONLINE notifier.  This is too late and "memmap
      on memory" initialization is performed before calling MEM_GOING_ONLINE
      notifier.
      
      During the memory hotplug addition phase, altmap support is prepared and
      during the memory onlining phase s390 requires memory to be physically
      accessible and then subsequently initiate the "memmap on memory"
      initialization process.
      
      The memory provider will handle new MEM_PREPARE_ONLINE /
      MEM_FINISH_OFFLINE notifications and make the memory accessible.
      
      The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when
      used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be written
      (e.g., poisoned) when adding memory -- before it is set online.  This
      allows for adding memory with an altmap that is not currently made
      available by a hypervisor.  When onlining that memory, the hypervisor can
      be instructed to make that memory accessible via the new notifiers and the
      onlining phase will not require any memory allocations, which is helpful
      in low-memory situations.
      
      All architectures ignore unknown memory notifiers.  Therefore, the
      introduction of these new notifiers does not result in any functional
      modifications across architectures.
      
      Link: https://lkml.kernel.org/r/20240108132747.3238763-1-sumanthk@linux.ibm.com
      Link: https://lkml.kernel.org/r/20240108132747.3238763-2-sumanthk@linux.ibm.com
      
      
      Signed-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Suggested-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5f1e2d1
  5. Jan 12, 2024
  6. Jan 08, 2024
  7. Dec 10, 2023
  8. Dec 06, 2023
    • Sumanth Korikkar's avatar
      mm/memory_hotplug: fix error handling in add_memory_resource() · f42ce5f0
      Sumanth Korikkar authored
      In add_memory_resource(), creation of memory block devices occurs after
      successful call to arch_add_memory().  However, creation of memory block
      devices could fail.  In that case, arch_remove_memory() is called to
      perform necessary cleanup.
      
      Currently with or without altmap support, arch_remove_memory() is always
      passed with altmap set to NULL during error handling.  This leads to
      freeing of struct pages using free_pages(), eventhough the allocation
      might have been performed with altmap support via
      altmap_alloc_block_buf().
      
      Fix the error handling by passing altmap in arch_remove_memory(). This
      ensures the following:
      * When altmap is disabled, deallocation of the struct pages array occurs
        via free_pages().
      * When altmap is enabled, deallocation occurs via vmem_altmap_free().
      
      Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
      Fixes: a08a2ae3
      
       ("mm,memory_hotplug: allocate memmap from the added memory range")
      Signed-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f42ce5f0
    • Sumanth Korikkar's avatar
      mm/memory_hotplug: add missing mem_hotplug_lock · 001002e7
      Sumanth Korikkar authored
      From Documentation/core-api/memory-hotplug.rst:
      When adding/removing/onlining/offlining memory or adding/removing
      heterogeneous/device memory, we should always hold the mem_hotplug_lock
      in write mode to serialise memory hotplug (e.g. access to global/zone
      variables).
      
      mhp_(de)init_memmap_on_memory() functions can change zone stats and
      struct page content, but they are currently called w/o the
      mem_hotplug_lock.
      
      When memory block is being offlined and when kmemleak goes through each
      populated zone, the following theoretical race conditions could occur:
      CPU 0:					     | CPU 1:
      memory_offline()			     |
      -> offline_pages()			     |
      	-> mem_hotplug_begin()		     |
      	   ...				     |
      	-> mem_hotplug_done()		     |
      					     | kmemleak_scan()
      					     | -> get_online_mems()
      					     |    ...
      -> mhp_deinit_memmap_on_memory()	     |
        [not protected by mem_hotplug_begin/done()]|
        Marks memory section as offline,	     |   Retrieves zone_start_pfn
        poisons vmemmap struct pages and updates   |   and struct page members.
        the zone related data			     |
         					     |    ...
         					     | -> put_online_mems()
      
      Fix this by ensuring mem_hotplug_lock is taken before performing
      mhp_init_memmap_on_memory().  Also ensure that
      mhp_deinit_memmap_on_memory() holds the lock.
      
      online/offline_pages() are currently only called from
      memory_block_online/offline(), so it is safe to move the locking there.
      
      Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
      Fixes: a08a2ae3
      
       ("mm,memory_hotplug: allocate memmap from the added memory range")
      Signed-off-by: default avatarSumanth Korikkar <sumanthk@linux.ibm.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: <stable@vger.kernel.org>	[5.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      001002e7
  9. Oct 25, 2023
  10. Oct 04, 2023
  11. Aug 21, 2023
  12. Aug 18, 2023
  13. Jun 23, 2023
  14. Jun 19, 2023
  15. Jun 09, 2023
  16. Apr 18, 2023
  17. Apr 05, 2023
  18. Feb 20, 2023
  19. Feb 13, 2023
  20. Oct 03, 2022
  21. Sep 11, 2022
  22. Jul 29, 2022
  23. Jul 03, 2022
  24. Jun 16, 2022
  25. May 13, 2022
    • Muchun Song's avatar
      mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl · 78f39084
      Muchun Song authored
      We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
      reboot the server to enable or disable the feature of optimizing vmemmap
      pages associated with HugeTLB pages.  However, rebooting usually takes a
      long time.  So add a sysctl to enable or disable the feature at runtime
      without rebooting.  Why we need this?  There are 3 use cases.
      
      1) The feature of minimizing overhead of struct page associated with
         each HugeTLB is disabled by default without passing
         "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
         deliver the servers to the users who want to enable this feature, they
         have to configure the grub (change boot cmdline) and reboot the
         servers, whereas rebooting usually takes a long time (we have thousands
         of servers).  It's a very bad experience for the users.  So we need a
         approach to enable this feature after rebooting.  This is a use case in
         our practical environment.
      
      2) Some use cases are that HugeTLB pages are allocated 'on the fly'
         instead of being pulled from the HugeTLB pool, those workloads would be
         affected with this feature enabled.  Those workloads could be
         identified by the characteristics of they never explicitly allocating
         huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
         and then let the pages be allocated from the buddy allocator at fault
         time.  We can confirm it is a real use case from the commit
         099730d6.  For those workloads, the page fault time could be ~2x
         slower than before.  We suspect those users want to disable this
         feature if the system has enabled this before and they don't think the
         memory savings benefit is enough to make up for the performance drop.
      
      3) If the workload which wants vmemmap pages to be optimized and the
         workload which wants to set 'nr_overcommit_hugepages' and does not want
         the extera overhead at fault time when the overcommitted pages be
         allocated from the buddy allocator are deployed in the same server. 
         The user could enable this feature and set 'nr_hugepages' and
         'nr_overcommit_hugepages', then disable the feature.  In this case, the
         overcommited HugeTLB pages will not encounter the extra overhead at
         fault time.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78f39084
    • Muchun Song's avatar
      mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on · 6e02c46b
      Muchun Song authored
      Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap
      on hot added memory.  If "hugetlb_free_vmemmap=on" and
      memory_hotplug.memmap_on_memory" are both passed on the kernel command
      line, optimizing hugetlb pages takes precedence.  However, the global
      variable memmap_on_memory will still be set to 1, even though we will not
      try to allocate memmap on hot added memory.
      
      Also introduce mhp_memmap_on_memory() helper to move the definition of
      "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY.  In the
      next patch, mhp_memmap_on_memory() will also be exported to be used in
      hugetlb_vmemmap.c.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e02c46b
    • Zi Yan's avatar
      mm: make alloc_contig_range work at pageblock granularity · b2c9e2fb
      Zi Yan authored
      alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid
      merging pageblocks with different migratetypes.  It might unnecessarily
      convert extra pageblocks at the beginning and at the end of the range. 
      Change alloc_contig_range() to work at pageblock granularity.
      
      Special handling is needed for free pages and in-use pages across the
      boundaries of the range specified by alloc_contig_range().  Because these=
      
      Partially isolated pages causes free page accounting issues.  The free
      pages will be split and freed into separate migratetype lists; the in-use=
      
      Pages will be migrated then the freed pages will be handled in the
      aforementioned way.
      
      [ziy@nvidia.com: fix deadlock/crash]
        Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com
      Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com
      
      
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Eric Ren <renzhengeek@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2c9e2fb
    • liusongtang's avatar
      mm/memory_hotplug: use pgprot_val to get value of pgprot · 6366238b
      liusongtang authored
      pgprot.pgprot is non-portable code.  It should be replaced by portable
      macro pgprot_val.
      
      Link: https://lkml.kernel.org/r/20220426071302.220646-1-liusongtang@huawei.com
      
      
      Signed-off-by: default avatarliusongtang <liusongtang@huawei.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6366238b