Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 04, 2024
  2. May 07, 2024
  3. Feb 21, 2024
  4. Jan 05, 2024
    • Shakeel Butt's avatar
      mm: ratelimit stat flush from workingset shrinker · d4a5b369
      Shakeel Butt authored
      One of our workloads (Postgres 14 + sysbench OLTP) regressed on newer
      upstream kernel and on further investigation, it seems like the cause is
      the always synchronous rstat flush in the count_shadow_nodes() added by
      the commit f82e6bf9 ("mm: memcg: use rstat for non-hierarchical
      stats").  On further inspection it seems like we don't really need
      accurate stats in this function as it was already approximating the amount
      of appropriate shadow entries to keep for maintaining the refault
      information.  Since there is already 2 sec periodic rstat flush, we don't
      need exact stats here.  Let's ratelimit the rstat flush in this code path.
      
      Link: https://lkml.kernel.org/r/20231228073055.4046430-1-shakeelb@google.com
      Fixes: f82e6bf9
      
       ("mm: memcg: use rstat for non-hierarchical stats")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4a5b369
  5. Dec 20, 2023
    • Yosry Ahmed's avatar
      mm: memcg: restore subtree stats flushing · 7d7ef0a4
      Yosry Ahmed authored
      Stats flushing for memcg currently follows the following rules:
      - Always flush the entire memcg hierarchy (i.e. flush the root).
      - Only one flusher is allowed at a time. If someone else tries to flush
        concurrently, they skip and return immediately.
      - A periodic flusher flushes all the stats every 2 seconds.
      
      The reason this approach is followed is because all flushes are serialized
      by a global rstat spinlock.  On the memcg side, flushing is invoked from
      userspace reads as well as in-kernel flushers (e.g.  reclaim, refault,
      etc).  This approach aims to avoid serializing all flushers on the global
      lock, which can cause a significant performance hit under high
      concurrency.
      
      This approach has the following problems:
      - Occasionally a userspace read of the stats of a non-root cgroup will
        be too expensive as it has to flush the entire hierarchy [1].
      - Sometimes the stats accuracy are compromised if there is an ongoing
        flush, and we skip and return before the subtree of interest is
        actually flushed, yielding stale stats (by up to 2s due to periodic
        flushing). This is more visible when reading stats from userspace,
        but can also affect in-kernel flushers.
      
      The latter problem is particulary a concern when userspace reads stats
      after an event occurs, but gets stats from before the event. Examples:
      - When memory usage / pressure spikes, a userspace OOM handler may look
        at the stats of different memcgs to select a victim based on various
        heuristics (e.g. how much private memory will be freed by killing
        this). Reading stale stats from before the usage spike in this case
        may cause a wrongful OOM kill.
      - A proactive reclaimer may read the stats after writing to
        memory.reclaim to measure the success of the reclaim operation. Stale
        stats from before reclaim may give a false negative.
      - Reading the stats of a parent and a child memcg may be inconsistent
        (child larger than parent), if the flush doesn't happen when the
        parent is read, but happens when the child is read.
      
      As for in-kernel flushers, they will occasionally get stale stats.  No
      regressions are currently known from this, but if there are regressions,
      they would be very difficult to debug and link to the source of the
      problem.
      
      This patch aims to fix these problems by restoring subtree flushing, and
      removing the unified/coalesced flushing logic that skips flushing if there
      is an ongoing flush.  This change would introduce a significant regression
      with global stats flushing thresholds.  With per-memcg stats flushing
      thresholds, this seems to perform really well.  The thresholds protect the
      underlying lock from unnecessary contention.
      
      This patch was tested in two ways to ensure the latency of flushing is
      up to par, on a machine with 384 cpus:
      
      - A synthetic test with 5000 concurrent workers in 500 cgroups doing
        allocations and reclaim, as well as 1000 readers for memory.stat
        (variation of [2]). No regressions were noticed in the total runtime.
        Note that significant regressions in this test are observed with
        global stats thresholds, but not with per-memcg thresholds.
      
      - A synthetic stress test for concurrently reading memcg stats while
        memory allocation/freeing workers are running in the background,
        provided by Wei Xu [3]. With 250k threads reading the stats every
        100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
        of reads take more than 1ms, and no reads take more than 100ms.
      
      [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
      [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
      [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
      
      [akpm@linux-foundation.org: fix mm/zswap.c]
      [yosryahmed@google.com: remove stats flushing mutex]
        Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com
      Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com
      
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Tested-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ivan Babrou <ivan@cloudflare.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutny <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Wei Xu <weixugc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d7ef0a4
    • Yosry Ahmed's avatar
      mm: workingset: move the stats flush into workingset_test_recent() · b0068472
      Yosry Ahmed authored
      The workingset code flushes the stats in workingset_refault() to get
      accurate stats of the eviction memcg.  In preparation for more scoped
      flushed and passing the eviction memcg to the flush call, move the call to
      workingset_test_recent() where we have a pointer to the eviction memcg.
      
      The flush call is sleepable, and cannot be made in an rcu read section. 
      Hence, minimize the rcu read section by also moving it into
      workingset_test_recent().  Furthermore, instead of holding the rcu read
      lock throughout workingset_test_recent(), only hold it briefly to get a
      ref on the eviction memcg.  This allows us to make the flush call after we
      get the eviction memcg.
      
      As for workingset_refault(), nothing else there appears to be protected by
      rcu.  The memcg of the faulted folio (which is not necessarily the same as
      the eviction memcg) is protected by the folio lock, which is held from all
      callsites.  Add a VM_BUG_ON() to make sure this doesn't change from under
      us.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com
      
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Tested-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ivan Babrou <ivan@cloudflare.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutny <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Wei Xu <weixugc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0068472
  6. Dec 12, 2023
    • Yu Zhao's avatar
      mm/mglru: fix underprotected page cache · 08148805
      Yu Zhao authored
      Unmapped folios accessed through file descriptors can be underprotected. 
      Those folios are added to the oldest generation based on:
      
      1. The fact that they are less costly to reclaim (no need to walk the
         rmap and flush the TLB) and have less impact on performance (don't
         cause major PFs and can be non-blocking if needed again).
      2. The observation that they are likely to be single-use. E.g., for
         client use cases like Android, its apps parse configuration files
         and store the data in heap (anon); for server use cases like MySQL,
         it reads from InnoDB files and holds the cached data for tables in
         buffer pools (anon).
      
      However, the oldest generation can be very short lived, and if so, it
      doesn't provide the PID controller with enough time to respond to a surge
      of refaults.  (Note that the PID controller uses weighted refaults and
      those from evicted generations only take a half of the whole weight.) In
      other words, for a short lived generation, the moving average smooths out
      the spike quickly.
      
      To fix the problem:
      1. For folios that are already on LRU, if they can be beyond the
         tracking range of tiers, i.e., five accesses through file
         descriptors, move them to the second oldest generation to give them
         more time to age. (Note that tiers are used by the PID controller
         to statistically determine whether folios accessed multiple times
         through file descriptors are worth protecting.)
      2. When adding unmapped folios to LRU, adjust the placement of them so
         that they are not too close to the tail. The effect of this is
         similar to the above.
      
      On Android, launching 55 apps sequentially:
                                 Before     After      Change
        workingset_refault_anon  25641024   25598972   0%
        workingset_refault_file  115016834  106178438  -8%
      
      Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
      Fixes: ac35a490
      
       ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Tested-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Cc: T.J. Mercier <tjmercier@google.com>
      Cc: Kairui Song <ryncsn@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      08148805
    • Nhat Pham's avatar
      list_lru: allow explicit memcg and NUMA node selection · 0a97c01c
      Nhat Pham authored
      Patch series "workload-specific and memory pressure-driven zswap
      writeback", v8.
      
      There are currently several issues with zswap writeback:
      
      1. There is only a single global LRU for zswap, making it impossible to
         perform worload-specific shrinking - an memcg under memory pressure
         cannot determine which pages in the pool it owns, and often ends up
         writing pages from other memcgs. This issue has been previously
         observed in practice and mitigated by simply disabling
         memcg-initiated shrinking:
      
         https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
      
         But this solution leaves a lot to be desired, as we still do not
         have an avenue for an memcg to free up its own memory locked up in
         the zswap pool.
      
      2. We only shrink the zswap pool when the user-defined limit is hit.
         This means that if we set the limit too high, cold data that are
         unlikely to be used again will reside in the pool, wasting precious
         memory. It is hard to predict how much zswap space will be needed
         ahead of time, as this depends on the workload (specifically, on
         factors such as memory access patterns and compressibility of the
         memory pages).
      
      This patch series solves these issues by separating the global zswap LRU
      into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
      memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
      shrinker does not have any parameter that must be tuned by the user, and
      can be opted in or out on a per-memcg basis.
      
      As a proof of concept, we ran the following synthetic benchmark: build the
      linux kernel in a memory-limited cgroup, and allocate some cold data in
      tmpfs to see if the shrinker could write them out and improved the overall
      performance.  Depending on the amount of cold data generated, we observe
      from 14% to 35% reduction in kernel CPU time used in the kernel builds.
      
      
      This patch (of 6):
      
      The interface of list_lru is based on the assumption that the list node
      and the data it represents belong to the same allocated on the correct
      node/memcg.  While this assumption is valid for existing slab objects LRU
      such as dentries and inodes, it is undocumented, and rather inflexible for
      certain potential list_lru users (such as the upcoming zswap shrinker and
      the THP shrinker).  It has caused us a lot of issues during our
      development.
      
      This patch changes list_lru interface so that the caller must explicitly
      specify numa node and memcg when adding and removing objects.  The old
      list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
      list_lru_del_obj(), respectively.
      
      It also extends the list_lru API with a new function, list_lru_putback,
      which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
      does not increment the LRU node count (as list_lru_isolate does not
      decrement the node count).  list_lru_putback also allows for explicit
      memcg and NUMA node selection.
      
      Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a97c01c
  7. Oct 04, 2023
    • Qi Zheng's avatar
      mm: workingset: dynamically allocate the mm-shadow shrinker · 219c666e
      Qi Zheng authored
      Use new APIs to dynamically allocate the mm-shadow shrinker.
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-20-zhengqi.arch@bytedance.com
      
      
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      219c666e
  8. Aug 24, 2023
    • Yosry Ahmed's avatar
      mm: memcg: use rstat for non-hierarchical stats · f82e6bf9
      Yosry Ahmed authored
      Currently, memcg uses rstat to maintain aggregated hierarchical stats. 
      Counters are maintained for hierarchical stats at each memcg.  Rstat
      tracks which cgroups have updates on which cpus to keep those counters
      fresh on the read-side.
      
      Non-hierarchical stats are currently not covered by rstat.  Their per-cpu
      counters are summed up on every read, which is expensive.  The original
      implementation did the same.  At some point before rstat, non-hierarchical
      aggregated counters were introduced by commit a983b5eb ("mm:
      memcontrol: fix excessive complexity in memory.stat reporting").  However,
      those counters were updated on the performance critical write-side, which
      caused regressions, so they were later removed by commit 815744d7
      ("mm: memcontrol: don't batch updates of local VM stats and events").  See
      [1] for more detailed history.
      
      Kernel versions in between a983b5eb & 815744d7 (a year and a half)
      enjoyed cheap reads of non-hierarchical stats, s...
      f82e6bf9
  9. Jun 09, 2023
    • Kalesh Singh's avatar
      Multi-gen LRU: fix workingset accounting · 3af0191a
      Kalesh Singh authored
      On Android app cycle workloads, MGLRU showed a significant reduction in
      workingset refaults although pgpgin/pswpin remained relatively unchanged. 
      This indicated MGLRU may be undercounting workingset refaults.
      
      This has impact on userspace programs, like Android's LMKD, that monitor
      workingset refault statistics to detect thrashing.
      
      It was found that refaults were only accounted if the MGLRU shadow entry
      was for a recently evicted folio.  However, recently evicted folios should
      be accounted as workingset activation, and refaults should be accounted
      regardless of recency.
      
      Fix MGLRU's workingset refault and activation accounting to more closely
      match that of the conventional active/inactive LRU.
      
      Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com
      Fixes: ac35a490
      
       ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Reported-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
      Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3af0191a
    • T.J. Alumbaugh's avatar
      mm: multi-gen LRU: cleanup lru_gen_test_recent() · d7f1afd0
      T.J. Alumbaugh authored
      Avoid passing memcg* and pglist_data* to lru_gen_test_recent()
      since we only use the lruvec anyway.
      
      Link: https://lkml.kernel.org/r/20230522112058.2965866-4-talumbau@google.com
      
      
      Signed-off-by: default avatarT.J. Alumbaugh <talumbau@google.com>
      Reviewed-by: default avatarYuanchu Xie <yuanchu@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7f1afd0
    • Nhat Pham's avatar
      workingset: refactor LRU refault to expose refault recency check · ffcb5f52
      Nhat Pham authored
      Patch series "cachestat: a new syscall for page cache state of files",
      v13.
      
      There is currently no good way to query the page cache statistics of large
      files and directory trees.  There is mincore(), but it scales poorly: the
      kernel writes out a lot of bitmap data that userspace has to aggregate,
      when the user really does not care about per-page information in that
      case.  The user also needs to mmap and unmap each file as it goes along,
      which can be quite slow as well.
      
      Some use cases where this information could come in handy:
        * Allowing database to decide whether to perform an index scan or direct
          table queries based on the in-memory cache state of the index.
        * Visibility into the writeback algorithm, for performance issues
          diagnostic.
        * Workload-aware writeback pacing: estimating IO fulfilled by page cache
          (and IO to be done) within a range of a file, allowing for more
          frequent syncing when and where there is IO capacity, and batching
          when there is not.
        * Computing memory usage of large files/directory trees, analogous to
          the du tool for disk usage.
      
      More information about these use cases could be found in this thread:
      https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
      
      This series of patches introduces a new system call, cachestat, that
      summarizes the page cache statistics (number of cached pages, dirty pages,
      pages marked for writeback, evicted pages etc.) of a file, in a specified
      range of bytes.  It also include a selftest suite that tests some typical
      usage.  Currently, the syscall is only wired in for x86 architecture.
      
      This interface is inspired by past discussion and concerns with fincore,
      which has a similar design (and as a result, issues) as mincore.  Relevant
      links:
      
      https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
      https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
      
      
      I have also developed a small tool that computes the memory usage of files
      and directories, analogous to the du utility.  User can choose between
      mincore or cachestat (with cachestat exporting more information than
      mincore).  To compare the performance of these two options, I benchmarked
      the tool on the root directory of a Meta's server machine, each for five
      runs:
      
      Using cachestat
      real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
      user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
      sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
      
      Using mincore:
      real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
      user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
      sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
      
      I also ran both syscalls on a 2TB sparse file:
      
      Using cachestat:
      real    0m0.009s
      user    0m0.000s
      sys     0m0.009s
      
      Using mincore:
      real    0m37.510s
      user    0m2.934s
      sys     0m34.558s
      
      Very large files like this are the pathological case for mincore.  In
      fact, to compute the stats for a single 2TB file, mincore takes as long as
      cachestat takes to compute the stats for the entire tree!  This could
      easily happen inadvertently when we run it on subdirectories.  Mincore is
      clearly not suitable for a general-purpose command line tool.
      
      Regarding security concerns, cachestat() should not pose any additional
      issues.  The caller already has read permission to the file itself (since
      they need an fd to that file to call cachestat).  This means that the
      caller can access the underlying data in its entirety, which is a much
      greater source of information (and as a result, a much greater security
      risk) than the cache status itself.
      
      The latest API change (in v13 of the patch series) is suggested by Jens
      Axboe.  It allows for 64-bit length argument, even on 32-bit architecture
      (which is previously not possible due to the limit on the number of
      syscall arguments).  Furthermore, it eliminates the need for compatibility
      handling - every user can use the same ABI.
      
      
      This patch (of 4):
      
      In preparation for computing recently evicted pages in cachestat, refactor
      workingset_refault and lru_gen_refault to expose a helper function that
      would test if an evicted page is recently evicted.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
        Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
      Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ffcb5f52
  10. Apr 18, 2023
  11. Feb 03, 2023
  12. Jan 18, 2023
    • Johannes Weiner's avatar
      workingset: fix confusion around eviction vs refault container · f78dfc7b
      Johannes Weiner authored
      Refault decisions are made based on the lruvec where the page was evicted,
      as that determined its LRU order while it was alive.  Stats and workingset
      aging must then occur on the lruvec of the new page, as that's the node
      and cgroup that experience the refault and that's the lruvec whose
      nonresident info ages out by a new resident page.  Those lruvecs could be
      different when a page is shared between cgroups, or the refaulting page is
      allocated on a different node.
      
      There are currently two mix-ups:
      
      1. When swap is available, the resident anon set must be considered
         when comparing the refault distance. The comparison is made against
         the right anon set, but the check for swap is not. When pages get
         evicted from a cgroup with swap, and refault in one without, this
         can incorrectly consider a hot refault as cold - and vice
         versa. Fix that by using the eviction cgroup for the swap check.
      
      2. The stats and workingset age are updated against the wrong lruvec
         altogether: the right cgroup but the wrong NUMA node. When a page
         refaults on a different NUMA node, this will have confusing stats
         and distort the workingset age on a different lruvec - again
         possibly resulting in hot/cold misclassifications down the line.
      
      Fix the swap check and the refault pgdat to address both concerns.
      
      This was found during code review.  It hasn't caused notable issues in
      production, suggesting that those refault-migrations are relatively rare
      in practice.
      
      Link: https://lkml.kernel.org/r/20230104222944.2380117-1-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Co-developed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f78dfc7b
    • Yu Zhao's avatar
      mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio · 391655fe
      Yu Zhao authored
      Patch series "mm: multi-gen LRU: memcg LRU", v3.
      
      Overview
      ========
      
      An memcg LRU is a per-node LRU of memcgs.  It is also an LRU of LRUs,
      since each node and memcg combination has an LRU of folios (see
      mem_cgroup_lruvec()).
      
      Its goal is to improve the scalability of global reclaim, which is
      critical to system-wide memory overcommit in data centers.  Note that
      memcg reclaim is currently out of scope.
      
      Its memory bloat is a pointer to each lruvec and negligible to each
      pglist_data.  In terms of traversing memcgs during global reclaim, it
      improves the best-case complexity from O(n) to O(1) and does not affect
      the worst-case complexity O(n).  Therefore, on average, it has a sublinear
      complexity in contrast to the current linear complexity.
      
      The basic structure of an memcg LRU can be understood by an analogy to
      the active/inactive LRU (of folios):
      1. It has the young and the old (generations), i.e., the counterparts
         to the active and the inactive;
      2. The increment of max_seq triggers promotion, i.e., the counterpart
         to activation;
      3. Other events trigger similar operations, e.g., offlining an memcg
         triggers demotion, i.e., the counterpart to deactivation.
      
      In terms of global reclaim, it has two distinct features:
      1. Sharding, which allows each thread to start at a random memcg (in
         the old generation) and improves parallelism;
      2. Eventual fairness, which allows direct reclaim to bail out at will
         and reduces latency without affecting fairness over some time.
      
      The commit message in patch 6 details the workflow:
      https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
      
      The following is a simple test to quickly verify its effectiveness.
      
        Test design:
        1. Create multiple memcgs.
        2. Each memcg contains a job (fio).
        3. All jobs access the same amount of memory randomly.
        4. The system does not experience global memory pressure.
        5. Periodically write to the root memory.reclaim.
      
        Desired outcome:
        1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
           over mean(pgsteal) is close to 0%.
        2. The total pgsteal is close to the total requested through
           memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
           to 100%.
      
        Actual outcome [1]:
                                           MGLRU off    MGLRU on
        stddev(pgsteal) / mean(pgsteal)    75%          20%
        sum(pgsteal) / sum(requested)      425%         95%
      
        ####################################################################
        MEMCGS=128
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            mkdir /sys/fs/cgroup/memcg$memcg
        done
      
        start() {
            echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
      
            fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
                --filename=/dev/zero --size=1920M --rw=randrw \
                --rate=64m,64m --random_distribution=random \
                --fadvise_hint=0 --time_based --runtime=10h \
                --group_reporting --minimal
        }
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            start &
        done
      
        sleep 600
      
        for ((i = 0; i < 600; i++)); do
            echo 256m >/sys/fs/cgroup/memory.reclaim
            sleep 6
        done
      
        for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
            grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
        done
        ####################################################################
      
      [1]: This was obtained from running the above script (touches less
           than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
           hour.
      
      
      This patch (of 8):
      
      The new name lru_gen_folio will be more distinct from the coming
      lru_gen_memcg.
      
      Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
      
      
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      391655fe
  13. Dec 11, 2022
  14. Nov 08, 2022
    • Johannes Weiner's avatar
      mm: vmscan: make rotations a secondary factor in balancing anon vs file · 0538a82c
      Johannes Weiner authored
      We noticed a 2% webserver throughput regression after upgrading from 5.6. 
      This could be tracked down to a shift in the anon/file reclaim balance
      (confirmed with swappiness) that resulted in worse reclaim efficiency and
      thus more kswapd activity for the same outcome.
      
      The change that exposed the problem is aae466b0 ("mm/swap: implement
      workingset detection for anonymous LRU").  By qualifying swapins based on
      their refault distance, it lowered the cost of anon reclaim in this
      workload, in turn causing (much) more anon scanning than before.  Scanning
      the anon list is more expensive due to the higher ratio of mmapped pages
      that may rotate during reclaim, and so the result was an increase in %sys
      time.
      
      Right now, rotations aren't considered a cost when balancing scan pressure
      between LRUs.  We can end up with very few file refaults putting all the
      scan pressure on hot anon pages that are rotated en masse, don't get
      reclaimed, and never push back on the file LRU again.  We still only
      reclaim file cache in that case, but we burn a lot CPU rotating anon
      pages.  It's "fair" from an LRU age POV, but doesn't reflect the real cost
      it imposes on the system.
      
      Consider rotations as a secondary factor in balancing the LRUs.  This
      doesn't attempt to make a precise comparison between IO cost and CPU cost,
      it just says: if reloads are about comparable between the lists, or
      rotations are overwhelmingly different, adjust for CPU work.
      
      This fixed the regression on our webservers.  It has since been deployed
      to the entire Meta fleet and hasn't caused any problems.
      
      Link: https://lkml.kernel.org/r/20221013193113.726425-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0538a82c
  15. Sep 26, 2022
    • Yu Zhao's avatar
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao authored
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%M...
      ac35a490
  16. Jul 03, 2022
    • Roman Gushchin's avatar
      mm: shrinkers: provide shrinkers with names · e33c267a
      Roman Gushchin authored
      Currently shrinkers are anonymous objects.  For debugging purposes they
      can be identified by count/scan function names, but it's not always
      useful: e.g.  for superblock's shrinkers it's nice to have at least an
      idea of to which superblock the shrinker belongs.
      
      This commit adds names to shrinkers.  register_shrinker() and
      prealloc_shrinker() functions are extended to take a format and arguments
      to master a name.
      
      In some cases it's not possible to determine a good name at the time when
      a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
      provided.
      
      The expected format is:
          <subsystem>-<shrinker_type>[:<instance>]-<id>
      For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
      
      After this change the shrinker debugfs directory looks like:
        $ cd /sys/kernel/debug/shrinker/
        $ ls
          dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
          mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
          mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
          rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
          sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
          sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
          sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
          sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
          sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
          sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
          sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
          sb-dax-11           sb-proc-45       sb-tmpfs-35
          sb-debugfs-7        sb-proc-46       sb-tmpfs-40
      
      [roman.gushchin@linux.dev: fix build warnings]
        Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
      
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
      
      
      Signed-off-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e33c267a
  17. Apr 21, 2022
    • Shakeel Butt's avatar
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt authored
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223
      
       ("memcg: flush lruvec stats in the refault")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarDaniel Dao <dqminh@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
  18. Mar 22, 2022
  19. Mar 21, 2022
  20. Nov 09, 2021
    • Johannes Weiner's avatar
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner authored
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  21. Oct 18, 2021
  22. Sep 27, 2021
  23. Sep 23, 2021
    • Shakeel Butt's avatar
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt authored
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
      Fixes: aa48e47e
      
       ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Tested-by: default avatarMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f828223
  24. Sep 08, 2021
  25. Jun 30, 2021
  26. Jun 29, 2021
  27. May 05, 2021
  28. Feb 24, 2021
  29. Dec 15, 2020
    • Alex Shi's avatar
      mm/lru: move lock into lru_note_cost · 75cc3c91
      Alex Shi authored
      We have to move lru_lock into lru_note_cost, since it cycle up on memcg
      tree, for future per lruvec lru_lock replace.  It's a bit ugly and may
      cost a bit more locking, but benefit from multiple memcg locking could
      cover the lost.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-11-git-send-email-alex.shi@linux.alibaba.com
      
      
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75cc3c91