Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 05, 2024
    • Brian Foster's avatar
      vfs: don't mod negative dentry count when on shrinker list · aabfe57e
      Brian Foster authored
      The nr_dentry_negative counter is intended to only account negative
      dentries that are present on the superblock LRU. Therefore, the LRU
      add, remove and isolate helpers modify the counter based on whether
      the dentry is negative, but the shrinker list related helpers do not
      modify the counter, and the paths that change a dentry between
      positive and negative only do so if DCACHE_LRU_LIST is set.
      
      The problem with this is that a dentry on a shrinker list still has
      DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional
      DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a
      shrink related list. Therefore if a relevant operation (i.e. unlink)
      occurs while a dentry is present on a shrinker list, and the
      associated codepath only checks for DCACHE_LRU_LIST, then it is
      technically possible to modify the negative dentry count for a
      dentry that is off the LRU. Since the shrinker list related helpers
      do not modify the negative dentry count (because non-LRU dentries
      should not be included in the count) when the dentry is ultimately
      removed from the shrinker list, this can cause the negative dentry
      count to become permanently inaccurate.
      
      This problem can be reproduced via a heavy file create/unlink vs.
      drop_caches workload. On an 80xcpu system, I start 80 tasks each
      running a 1k file create/delete loop, and one task spinning on
      drop_caches. After 10 minutes or so of runtime, the idle/clean cache
      negative dentry count increases from somewhere in the range of 5-10
      entries to several hundred (and increasingly grows beyond
      nr_dentry_unused).
      
      Tweak the logic in the paths that turn a dentry negative or positive
      to filter out the case where the dentry is present on a shrink
      related list. This allows the above workload to maintain an accurate
      negative dentry count.
      
      Fixes: af0c9af1
      
       ("fs/dcache: Track & report number of negative dentries")
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Link: https://lore.kernel.org/r/20240703121301.247680-1-bfoster@redhat.com
      
      
      Acked-by: default avatarIan Kent <ikent@redhat.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      aabfe57e
  2. Jul 02, 2024
    • Christian Brauner's avatar
      fs: better handle deep ancestor chains in is_subdir() · 391b59b0
      Christian Brauner authored
      
      Jan reported that 'cd ..' may take a long time in deep directory
      hierarchies under a bind-mount. If concurrent renames happen it is
      possible to livelock in is_subdir() because it will keep retrying.
      
      Change is_subdir() from simply retrying over and over to retry once and
      then acquire the rename lock to handle deep ancestor chains better. The
      list of alternatives to this approach were less then pleasant. Change
      the scope of rcu lock to cover the whole walk while at it.
      
      A big thanks to Jan and Linus. Both Jan and Linus had proposed
      effectively the same thing just that one version ended up being slightly
      more elegant.
      
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      391b59b0
  3. May 29, 2024
    • Linus Torvalds's avatar
      Revert "vfs: Delete the associated dentry when deleting a file" · 4a4be1ad
      Linus Torvalds authored
      This reverts commit 681ce862.
      
      We gave it a try, but it turns out the kernel test robot did in fact
      find performance regressions for it, so we'll have to look at the more
      involved alternative fixes for Yafang Shao's Elasticsearch load issue.
      
      There were several alternatives discussed, they just weren't as simple
      as this first attempt.
      
      The report is of a -7.4% regression of filebench.sum_operations/s, which
      appears significant enough to trigger my "this patch may get reverted if
      somebody finds a performance regression on some other load" rule.
      
      So it's still the case that we should end up deleting dentries more
      aggressively - or just be better at pruning them later - but it needs a
      bit more finesse than this simple thing.
      
      Link: https://lore.kernel.org/all/202405291318.4dfbb352-oliver.sang@intel.com/
      
      
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a4be1ad
  4. May 22, 2024
    • Yafang Shao's avatar
      vfs: Delete the associated dentry when deleting a file · 681ce862
      Yafang Shao authored
      
      Our applications, built on Elasticsearch[0], frequently create and
      delete files.  These applications operate within containers, some with a
      memory limit exceeding 100GB.  Over prolonged periods, the accumulation
      of negative dentries within these containers can amount to tens of
      gigabytes.
      
      Upon container exit, directories are deleted.  However, due to the
      numerous associated dentries, this process can be time-consuming.  Our
      users have expressed frustration with this prolonged exit duration,
      which constitutes our first issue.
      
      Simultaneously, other processes may attempt to access the parent
      directory of the Elasticsearch directories.  Since the task responsible
      for deleting the dentries holds the inode lock, processes attempting
      directory lookup experience significant delays.  This issue, our second
      problem, is easily demonstrated:
      
        - Task 1 generates negative dentries:
        $ pwd
        ~/test
        $ mkdir es && cd es/ && ./create_and_delete_files.sh
      
        [ After generating tens of GB dentries ]
      
        $ cd ~/test && rm -rf es
      
        [ It will take a long duration to finish ]
      
        - Task 2 attempts to lookup the 'test/' directory
        $ pwd
        ~/test
        $ ls
      
        The 'ls' command in Task 2 experiences prolonged execution as Task 1
        is deleting the dentries.
      
      We've devised a solution to address both issues by deleting associated
      dentry when removing a file.  Interestingly, we've noted that a similar
      patch was proposed years ago[1], although it was rejected citing the
      absence of tangible issues caused by negative dentries.  Given our
      current challenges, we're resubmitting the proposal.  All relevant
      stakeholders from previous discussions have been included for reference.
      
      Some alternative solutions are also under discussion[2][3], such as
      shrinking child dentries outside of the parent inode lock or even
      asynchronously shrinking child dentries.  However, given the
      straightforward nature of the current solution, I believe this approach
      is still necessary.
      
      [ NOTE! This is a pretty fundamental change in how we deal with
        unlinking dentries, and it doesn't change the fact that you can have
        lots of negative dentries from just doing negative lookups.
      
        But the kernel test robot is at least initially happy with this from a
        performance angle, so I'm applying this ASAP just to get more testing
        and as a "known fix for an issue people hit in real life".
      
        Put another way: we should still look at the alternatives, and this
        patch may get reverted if somebody finds a performance regression on
        some other load.       - Linus ]
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Link: https://github.com/elastic/elasticsearch [0]
      Link: https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com [1]
      Link: https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/ [2]
      Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/
      
       [3]
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Wangkai <wangkai86@huawei.com>
      Cc: Colin Walters <walters@verbum.org>
      Tested-by: default avatarkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      681ce862
  5. Apr 09, 2024
  6. Feb 27, 2024
  7. Feb 09, 2024
  8. Dec 28, 2023
  9. Dec 12, 2023
    • Nhat Pham's avatar
      list_lru: allow explicit memcg and NUMA node selection · 0a97c01c
      Nhat Pham authored
      Patch series "workload-specific and memory pressure-driven zswap
      writeback", v8.
      
      There are currently several issues with zswap writeback:
      
      1. There is only a single global LRU for zswap, making it impossible to
         perform worload-specific shrinking - an memcg under memory pressure
         cannot determine which pages in the pool it owns, and often ends up
         writing pages from other memcgs. This issue has been previously
         observed in practice and mitigated by simply disabling
         memcg-initiated shrinking:
      
         https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
      
         But this solution leaves a lot to be desired, as we still do not
         have an avenue for an memcg to free up its own memory locked up in
         the zswap pool.
      
      2. We only shrink the zswap pool when the user-defined limit is hit.
         This means that if we set the limit too high, cold data that are
         unlikely to be used again will reside in the pool, wasting precious
         memory. It is hard to predict how much zswap space will be needed
         ahead of time, as this depends on the workload (specifically, on
         factors such as memory access patterns and compressibility of the
         memory pages).
      
      This patch series solves these issues by separating the global zswap LRU
      into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
      memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
      shrinker does not have any parameter that must be tuned by the user, and
      can be opted in or out on a per-memcg basis.
      
      As a proof of concept, we ran the following synthetic benchmark: build the
      linux kernel in a memory-limited cgroup, and allocate some cold data in
      tmpfs to see if the shrinker could write them out and improved the overall
      performance.  Depending on the amount of cold data generated, we observe
      from 14% to 35% reduction in kernel CPU time used in the kernel builds.
      
      
      This patch (of 6):
      
      The interface of list_lru is based on the assumption that the list node
      and the data it represents belong to the same allocated on the correct
      node/memcg.  While this assumption is valid for existing slab objects LRU
      such as dentries and inodes, it is undocumented, and rather inflexible for
      certain potential list_lru users (such as the upcoming zswap shrinker and
      the THP shrinker).  It has caused us a lot of issues during our
      development.
      
      This patch changes list_lru interface so that the caller must explicitly
      specify numa node and memcg when adding and removing objects.  The old
      list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
      list_lru_del_obj(), respectively.
      
      It also extends the list_lru API with a new function, list_lru_putback,
      which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
      does not increment the LRU node count (as list_lru_isolate does not
      decrement the node count).  list_lru_putback also allows for explicit
      memcg and NUMA node selection.
      
      Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
      
      
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a97c01c
  10. Nov 25, 2023
    • Vegard Nossum's avatar
      dcache: remove unnecessary NULL check in dget_dlock() · 1b6ae9f6
      Vegard Nossum authored
      dget_dlock() requires dentry->d_lock to be held when called, yet
      contains a NULL check for dentry.
      
      An audit of all calls to dget_dlock() shows that it is never called
      with a NULL pointer (as spin_lock()/spin_unlock() would crash in these
      cases):
      
        $ git grep -W '\<dget_dlock\>'
      
        arch/powerpc/platforms/cell/spufs/inode.c-              spin_lock(&dentry->d_lock);
        arch/powerpc/platforms/cell/spufs/inode.c-              if (simple_positive(dentry)) {
        arch/powerpc/platforms/cell/spufs/inode.c:                      dget_dlock(dentry);
      
        fs/autofs/expire.c-             spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
        fs/autofs/expire.c-             if (simple_positive(child)) {
        fs/autofs/expire.c:                     dget_dlock(child);
      
        fs/autofs/root.c:                       dget_dlock(active);
        fs/autofs/root.c-                       spin_unlock(&active->d_lock);
      
        fs/autofs/root.c:                       dget_dlock(expiring);
        fs/autofs/root.c-                       spin_unlock(&expiring->d_lock);
      
        fs/ceph/dir.c-          if (!spin_trylock(&dentry->d_lock))
        fs/ceph/dir.c-                  continue;
        [...]
        fs/ceph/dir.c:                          dget_dlock(dentry);
      
        fs/ceph/mds_client.c-           spin_lock(&alias->d_lock);
        [...]
        fs/ceph/mds_client.c:                   dn = dget_dlock(alias);
      
        fs/configfs/inode.c-            spin_lock(&dentry->d_lock);
        fs/configfs/inode.c-            if (simple_positive(dentry)) {
        fs/configfs/inode.c:                    dget_dlock(dentry);
      
        fs/libfs.c:                             found = dget_dlock(d);
        fs/libfs.c-                     spin_unlock(&d->d_lock);
      
        fs/libfs.c:             found = dget_dlock(child);
        fs/libfs.c-     spin_unlock(&child->d_lock);
      
        fs/libfs.c:                             child = dget_dlock(d);
        fs/libfs.c-                     spin_unlock(&d->d_lock);
      
        fs/ocfs2/dcache.c:                      dget_dlock(dentry);
        fs/ocfs2/dcache.c-                      spin_unlock(&dentry->d_lock);
      
        include/linux/dcache.h:static inline struct dentry *dget_dlock(struct dentry *dentry)
      
      After taking out the NULL check, dget_dlock() becomes almost identical
      to __dget_dlock(); the only difference is that dget_dlock() returns the
      dentry that was passed in. These are static inline helpers, so we can
      rely on the compiler to discard unused return values. We can therefore
      also remove __dget_dlock() and replace calls to it by dget_dlock().
      
      Also fix up and improve the kerneldoc comments while we're at it.
      
      Al Viro pointed out that we can also clean up some of the callers to
      make use of the returned value and provided a bit more info for the
      kerneldoc.
      
      While preparing v2 I also noticed that the tabs used in the kerneldoc
      comments were causing the kerneldoc to get parsed incorrectly so I also
      fixed this up (including for d_unhashed, which is otherwise unrelated).
      
      Testing: x86 defconfig build + boot; make htmldocs for the kerneldoc
      warning. objdump shows there are code generation changes.
      
      Link: https://lore.kernel.org/all/20231022164520.915013-1-vegard.nossum@oracle.com/
      
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1b6ae9f6
    • Al Viro's avatar
      kill DCACHE_MAY_FREE · 1b327b5a
      Al Viro authored
      
      With the new ordering in __dentry_kill() it has become redundant -
      it's set if and only if both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST
      are set.
      
      We set it in __dentry_kill(), after having set DCACHE_DENTRY_KILLED
      with the only condition being that DCACHE_SHRINK_LIST is there;
      all of that is done without dropping ->d_lock and the only place
      that checks that flag (shrink_dentry_list()) does so under ->d_lock,
      after having found the victim on its shrink list.  Since DCACHE_SHRINK_LIST
      is set only when placing dentry into shrink list and removed only by
      shrink_dentry_list() itself, a check for DCACHE_DENTRY_KILLED in
      there would be equivalent to check for DCACHE_MAY_FREE.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1b327b5a
    • Al Viro's avatar
      __d_unalias() doesn't use inode argument · ef69f050
      Al Viro authored
      
      ... and hasn't since 2015.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ef69f050
    • Al Viro's avatar
      d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant · f9f677c5
      Al Viro authored
      
      We only search in the damn thing under hlist_bl_lock(); RCU variant
      of insertion was, IIRC, pretty much cargo-culted - mea culpa...
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f9f677c5
    • Al Viro's avatar
      get rid of DCACHE_GENOCIDE · 57851607
      Al Viro authored
      
      ... now that we never call d_genocide() other than from kill_litter_super()
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      57851607
    • Al Viro's avatar
    • Al Viro's avatar
      kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller · f2824db1
      Al Viro authored
      
      now that the only user of d_instantiate_anon() is gone...
      [braino fix folded - kudos to Dan Carpenter]
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f2824db1
    • Al Viro's avatar
      retain_dentry(): introduce a trimmed-down lockless variant · 6367b491
      Al Viro authored
      
      	fast_dput() contains a small piece of code, preceded by scary
      comments about 5 times longer than it.	What is actually done there is
      a trimmed-down subset of retain_dentry() - in some situations we can
      tell that retain_dentry() would have returned true without ever needing
      ->d_lock and that's what that code checks.  If these checks come true
      fast_dput() can declare that we are done, without bothering with ->d_lock;
      otherwise it has to take the lock and do full variant of retain_dentry()
      checks.
      
      	Trimmed-down variant of the checks is hard to follow and
      it's asking for trouble - if we ever decide to change the rules in
      retain_dentry(), we'll have to remember to update that code.  It turns
      out that an equivalent variant of these checks more obviously parallel
      to retain_dentry() is not just possible, but easy to unify with
      retain_dentry() itself, passing it a new boolean argument ('locked')
      to distinguish between the full semantics and trimmed down one.
      
      	Note that in lockless case true is returned only when locked
      variant would have returned true without ever needing the lock; false
      means "punt to the locking path and recheck there".
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6367b491
    • Al Viro's avatar
      __dentry_kill(): new locking scheme · 1c18edd1
      Al Viro authored
      
      Currently we enter __dentry_kill() with parent (along with the victim
      dentry and victim's inode) held locked.  Then we
      	mark dentry refcount as dead
      	call ->d_prune()
      	remove dentry from hash
      	remove it from the parent's list of children
      	unlock the parent, don't need it from that point on
      	detach dentry from inode, unlock dentry and drop the inode
      (via ->d_iput())
      	call ->d_release()
      	regain the lock on dentry
      	check if it's on a shrink list (in which case freeing its empty husk
      has to be left to shrink_dentry_list()) or not (in which case we can free it
      ourselves).  In the former case, mark it as an empty husk, so that
      shrink_dentry_list() would know it can free the sucker.
      	drop the lock on dentry
      ... and usually the caller proceeds to drop a reference on the parent,
      possibly retaking the lock on it.
      
      That is painful for a bunch of reasons, starting with the need to take locks
      out of order, but not limited to that - the parent of positive dentry can
      change if we drop its ->d_lock, so getting these locks has to be done with
      care.  Moreover, as soon as dentry is out of the parent's list of children,
      shrink_dcache_for_umount() won't see it anymore, making it appear as if
      the parent is inexplicably busy.  We do work around that by having
      shrink_dentry_list() decrement the parent's refcount first and put it on
      shrink list to be evicted once we are done with __dentry_kill() of child,
      but that may in some cases lead to ->d_iput() on child called after the
      parent got killed.  That doesn't happen in cases where in-tree ->d_iput()
      instances might want to look at the parent, but that's brittle as hell.
      
      Solution: do removal from the parent's list of children in the very
      end of __dentry_kill().  As the result, the callers do not need to
      lock the parent and by the time we really need the parent locked,
      dentry is negative and is guaranteed not to be moved around.
      
      It does mean that ->d_prune() will be called with parent not locked.
      It also means that we might see dentries in process of being torn
      down while going through the parent's list of children; those dentries
      will be unhashed, negative and with refcount marked dead.  In practice,
      that's enough for in-tree code that looks through the list of children
      to do the right thing as-is.  Out-of-tree code might need to be adjusted.
      
      Calling conventions: __dentry_kill(dentry) is called with dentry->d_lock
      held, along with ->i_lock of its inode (if any).  It either returns
      the parent (locked, with refcount decremented to 0) or NULL (if there'd
      been no parent or if refcount decrement for parent hadn't reached 0).
      
      lock_for_kill() is adjusted for new requirements - it doesn't touch
      the parent's ->d_lock at all.
      
      Callers adjusted.  Note that for dput() we don't need to bother with
      fast_dput() for the parent - we just need to check retain_dentry()
      for it, since its ->d_lock is still held since the moment when
      __dentry_kill() had taken it to remove the victim from the list of
      children.
      
      The kludge with early decrement of parent's refcount in
      shrink_dentry_list() is no longer needed - shrink_dcache_for_umount()
      sees the half-killed dentries in the list of children for as long
      as they are pinning the parent.  They are easily recognized and
      accounted for by select_collect(), so we know we are not done yet.
      
      As the result, we always have the expected ordering for ->d_iput()/->d_release()
      vs. __dentry_kill() of the parent, no exceptions.  Moreover, the current
      rules for shrink lists (one must make sure that shrink_dcache_for_umount()
      won't happen while any dentries from the superblock in question are on
      any shrink lists) are gone - shrink_dcache_for_umount() will do the
      right thing in all cases, taking such dentries out.  Their empty
      husks (memory occupied by struct dentry itself + its external name,
      if any) will remain on the shrink lists, but they are no obstacles
      to filesystem shutdown.  And such husks will get freed as soon as
      shrink_dentry_list() of the list they are on gets to them.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1c18edd1
    • Al Viro's avatar
      d_prune_aliases(): use a shrink list · b4cc0734
      Al Viro authored
      
      Instead of dropping aliases one by one, restarting, etc., just
      collect them into a shrink list and kill them off in one pass.
      
      We don't really need the restarts - one alias can't pin another
      (directory has only one alias, and couldn't be its own ancestor
      anyway), so collecting everything that is not busy and taking it
      out would take care of everything evictable that had been there
      as we entered the function.  And new aliases added while we'd
      been dropping old ones could just as easily have appeared right
      as we return to caller...
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b4cc0734
    • Al Viro's avatar
    • Al Viro's avatar
      to_shrink_list(): call only if refcount is 0 · c2e5e29f
      Al Viro authored
      
      The only thing it does if refcount is not zero is d_lru_del(); no
      point, IMO, seeing that plain dput() does nothing of that sort...
      
      Note that 2 of 3 current callers are guaranteed that refcount is 0.
      
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c2e5e29f
    • Al Viro's avatar
      fold dentry_kill() into dput() · 5e7a5c8d
      Al Viro authored
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5e7a5c8d
    • Al Viro's avatar
      don't try to cut corners in shrink_lock_dentry() · 339e9e13
      Al Viro authored
      
      That is to say, do *not* treat the ->d_inode or ->d_parent changes
      as "it's hard, return false; somebody must have grabbed it, so
      even if has zero refcount, we don't need to bother killing it -
      final dput() from whoever grabbed it would've done everything".
      
      First of all, that is not guaranteed.  It might have been dropped
      by shrink_kill() handling of victim's parent, which would've found
      it already on a shrink list (ours) and decided that they don't need
      to put it on their shrink list.
      
      What's more, dentry_kill() is doing pretty much the same thing,
      cutting its own set of corners (it assumes that dentry can't
      go from positive to negative, so its inode can change but only once
      and only in one direction).
      
      Doing that right allows to get rid of that not-quite-duplication
      and removes the only reason for re-incrementing refcount before
      the call of dentry_kill().
      
      Replacement is called lock_for_kill(); called under rcu_read_lock
      and with ->d_lock held.  If it returns false, dentry has non-zero
      refcount and the same locks are held.  If it returns true,
      dentry has zero refcount and all locks required by __dentry_kill()
      are taken.
      
      Part of __lock_parent() had been lifted into lock_parent() to
      allow its reuse.  Now it's called with rcu_read_lock already
      held and dentry already unlocked.
      
      Note that this is not the final change - locking requirements for
      __dentry_kill() are going to change later in the series and the
      set of locks taken by lock_for_kill() will be adjusted.  Both
      lock_parent() and __lock_parent() will be gone once that happens.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      339e9e13
    • Al Viro's avatar
      fold the call of retain_dentry() into fast_dput() · f05441c7
      Al Viro authored
      
      Calls of retain_dentry() happen immediately after getting false
      from fast_dput() and getting true from retain_dentry() is
      treated the same way as non-zero refcount would be treated by
      fast_dput() - unlock dentry and bugger off.
      
      Doing that in fast_dput() itself is simpler.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f05441c7
    • Al Viro's avatar
      Call retain_dentry() with refcount 0 · 2f42f1eb
      Al Viro authored
      
      Instead of bumping it from 0 to 1, calling retain_dentry(), then
      decrementing it back to 0 (with ->d_lock held all the way through),
      just leave refcount at 0 through all of that.
      
      It will have a visible effect for ->d_delete() - now it can be
      called with refcount 0 instead of 1 and it can no longer play
      silly buggers with dropping/regaining ->d_lock.  Not that any
      in-tree instances tried to (it's pretty hard to get right).
      
      Any out-of-tree ones will have to adjust (assuming they need any
      changes).
      
      Note that we do not need to extend rcu-critical area here - we have
      verified that refcount is non-negative after having grabbed ->d_lock,
      so nobody will be able to free dentry until they get into __dentry_kill(),
      which won't happen until they manage to grab ->d_lock.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2f42f1eb
    • Al Viro's avatar
      dentry_kill(): don't bother with retain_dentry() on slow path · b06c684d
      Al Viro authored
      
      We have already checked it and dentry used to look not worthy
      of keeping.  The only hard obstacle to evicting dentry is
      non-zero refcount; everything else is advisory - e.g. memory
      pressure could evict any dentry found with refcount zero.
      On the slow path in dentry_kill() we had dropped and regained
      ->d_lock; we must recheck the refcount, but everything else
      is not worth bothering with.
      
      Note that filesystem can not count upon ->d_delete() being
      called for dentry - not even once.  Again, memory pressure
      (as well as d_prune_aliases(), or attempted rmdir() of ancestor,
      or...) will not call ->d_delete() at all.
      
      So from the correctness point of view we are fine doing the
      check only once.  And it makes things simpler down the road.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b06c684d
    • Al Viro's avatar
      __dentry_kill(): get consistent rules for victim's refcount · ee0c8250
      Al Viro authored
      
      Currently we call it with refcount equal to 1 when called from
      dentry_kill(); all other callers have it equal to 0.
      
      Make it always be called with zero refcount; on this step we
      just decrement it before the calls in dentry_kill().  That is
      safe, since all places that care about the value of refcount
      either do that under ->d_lock or hold a reference to dentry
      in question.  Either is sufficient to prevent observing a
      dentry immediately prior to __dentry_kill() getting called
      from dentry_kill().
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ee0c8250
    • Al Viro's avatar
      make retain_dentry() neutral with respect to refcounting · e9d130d0
      Al Viro authored
      
      retain_dentry() used to decrement refcount if and only if it returned
      true.  Lift those decrements into the callers.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e9d130d0
    • Al Viro's avatar
      __dput_to_list(): do decrement of refcount in the callers · 6511f6be
      Al Viro authored
      
      ... and rename it to to_shrink_list(), seeing that it no longer
      does dropping any references
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6511f6be
    • Al Viro's avatar
      fast_dput(): new rules for refcount · 15f23734
      Al Viro authored
      
      By now there is only one place in entire fast_dput() where we return
      false; that happens after refcount had been decremented and found (under
      ->d_lock) to be zero.  In that case, just prior to returning false to
      caller, fast_dput() forcibly changes the refcount from 0 to 1.
      
      Lift that resetting refcount to 1 into the callers; later in the series
      it will be massaged out of existence.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      15f23734
    • Al Viro's avatar
      fast_dput(): handle underflows gracefully · 504e08ce
      Al Viro authored
      
      If refcount is less than 1, we should just warn, unlock dentry and
      return true, so that the caller doesn't try to do anything else.
      
      Taking care of that leaves the rest of "lockref_put_return() has
      failed" case equivalent to "decrement refcount and rejoin the
      normal slow path after the point where we grab ->d_lock".
      
      NOTE: lockref_put_return() is strictly a fastpath thing - unlike
      the rest of lockref primitives, it does not contain a fallback.
      Caller (and it looks like fast_dput() is the only legitimate one
      in the entire kernel) has to do that itself.  Reasons for
      lockref_put_return() failures:
      	* ->d_lock held by somebody
      	* refcount <= 0
      	* ... or an architecture not supporting lockref use of
      cmpxchg - sparc, anything non-SMP, config with spinlock debugging...
      
      We could add a fallback, but it would be a clumsy API - we'd have
      to distinguish between:
      	(1) refcount > 1 - decremented, lock not held on return
      	(2) refcount < 1 - left alone, probably no sense to hold the lock
      	(3) refcount is 1, no cmphxcg - decremented, lock held on return
      	(4) refcount is 1, cmphxcg supported - decremented, lock *NOT* held
      	    on return.
      We want to return with no lock held in case (4); that's the whole point of that
      thing.  We very much do not want to have the fallback in case (3) return without
      a lock, since the caller might have to retake it in that case.
      So it wouldn't be more convenient than doing the fallback in the caller and
      it would be very easy to screw up, especially since the test coverage would
      suck - no way to test (3) and (4) on the same kernel build.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      504e08ce
    • Al Viro's avatar
      fast_dput(): having ->d_delete() is not reason to delay refcount decrement · 15220fbf
      Al Viro authored
      
      ->d_delete() is a way for filesystem to tell that dentry is not worth
      keeping cached.  It is not guaranteed to be called every time a dentry
      has refcount drop down to zero; it is not guaranteed to be called before
      dentry gets evicted.  In other words, it is not suitable for any kind
      of keeping track of dentry state.
      
      None of the in-tree filesystems attempt to use it that way, fortunately.
      
      So the contortions done by fast_dput() (as well as dentry_kill()) are
      not warranted.  fast_dput() certainly should treat having ->d_delete()
      instance as "can't assume we'll be keeping it", but that's not different
      from the way we treat e.g. DCACHE_DONTCACHE (which is rather similar
      to making ->d_delete() returns true when called).
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      15220fbf
    • Al Viro's avatar
      shrink_dentry_list(): no need to check that dentry refcount is marked dead · cd9f84f3
      Al Viro authored
      
      ... we won't see DCACHE_MAY_FREE on anything that is *not* dead
      and checking d_flags is just as cheap as checking refcount.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cd9f84f3
    • Al Viro's avatar
      centralize killing dentry from shrink list · 3fcf5356
      Al Viro authored
      
      new helper unifying identical bits of shrink_dentry_list() and
      shring_dcache_for_umount()
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3fcf5356
    • Al Viro's avatar
      dentry: switch the lists of children to hlist · da549bdd
      Al Viro authored
      
      Saves a pointer per struct dentry and actually makes the things less
      clumsy.  Cleaned the d_walk() and dcache_readdir() a bit by use
      of hlist_for_... iterators.
      
      A couple of new helpers - d_first_child() and d_next_sibling(),
      to make the expressions less awful.
      
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      da549bdd
  11. Nov 18, 2023
  12. Oct 18, 2023
  13. Sep 11, 2023
  14. Aug 19, 2023