Commits · 77bb4a501a7756e6f5428b72fb0e420deb7ae562 · BeagleBoard.org / Linux

Jul 05, 2024

vfs: don't mod negative dentry count when on shrinker list · aabfe57e

Brian Foster authored 8 months ago

The nr_dentry_negative counter is intended to only account negative
dentries that are present on the superblock LRU. Therefore, the LRU
add, remove and isolate helpers modify the counter based on whether
the dentry is negative, but the shrinker list related helpers do not
modify the counter, and the paths that change a dentry between
positive and negative only do so if DCACHE_LRU_LIST is set.

The problem with this is that a dentry on a shrinker list still has
DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional
DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a
shrink related list. Therefore if a relevant operation (i.e. unlink)
occurs while a dentry is present on a shrinker list, and the
associated codepath only checks for DCACHE_LRU_LIST, then it is
technically possible to modify the negative dentry count for a
dentry that is off the LRU. Since the shrinker list related helpers
do not modify the negative dentry count (because non-LRU dentries
should not be included in the count) when the dentry is ultimately
removed from the shrinker list, this can cause the negative dentry
count to become permanently inaccurate.

This problem can be reproduced via a heavy file create/unlink vs.
drop_caches workload. On an 80xcpu system, I start 80 tasks each
running a 1k file create/delete loop, and one task spinning on
drop_caches. After 10 minutes or so of runtime, the idle/clean cache
negative dentry count increases from somewhere in the range of 5-10
entries to several hundred (and increasingly grows beyond
nr_dentry_unused).

Tweak the logic in the paths that turn a dentry negative or positive
to filter out the case where the dentry is present on a shrink
related list. This allows the above workload to maintain an accurate
negative dentry count.

Fixes: af0c9af1

 ("fs/dcache: Track & report number of negative dentries")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20240703121301.247680-1-bfoster@redhat.com


Acked-by: Ian Kent <ikent@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

aabfe57e

Jul 02, 2024

fs: better handle deep ancestor chains in is_subdir() · 391b59b0

Christian Brauner authored 8 months ago


Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.

Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.

A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

391b59b0

May 29, 2024

Revert "vfs: Delete the associated dentry when deleting a file" · 4a4be1ad

Linus Torvalds authored 9 months ago

This reverts commit 681ce862.

We gave it a try, but it turns out the kernel test robot did in fact
find performance regressions for it, so we'll have to look at the more
involved alternative fixes for Yafang Shao's Elasticsearch load issue.

There were several alternatives discussed, they just weren't as simple
as this first attempt.

The report is of a -7.4% regression of filebench.sum_operations/s, which
appears significant enough to trigger my "this patch may get reverted if
somebody finds a performance regression on some other load" rule.

So it's still the case that we should end up deleting dentries more
aggressively - or just be better at pruning them later - but it needs a
bit more finesse than this simple thing.

Link: https://lore.kernel.org/all/202405291318.4dfbb352-oliver.sang@intel.com/


Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4a4be1ad

May 22, 2024

vfs: Delete the associated dentry when deleting a file · 681ce862

Yafang Shao authored 10 months ago


Our applications, built on Elasticsearch[0], frequently create and
delete files.  These applications operate within containers, some with a
memory limit exceeding 100GB.  Over prolonged periods, the accumulation
of negative dentries within these containers can amount to tens of
gigabytes.

Upon container exit, directories are deleted.  However, due to the
numerous associated dentries, this process can be time-consuming.  Our
users have expressed frustration with this prolonged exit duration,
which constitutes our first issue.

Simultaneously, other processes may attempt to access the parent
directory of the Elasticsearch directories.  Since the task responsible
for deleting the dentries holds the inode lock, processes attempting
directory lookup experience significant delays.  This issue, our second
problem, is easily demonstrated:

  - Task 1 generates negative dentries:
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file.  Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries.  Given our
current challenges, we're resubmitting the proposal.  All relevant
stakeholders from previous discussions have been included for reference.

Some alternative solutions are also under discussion[2][3], such as
shrinking child dentries outside of the parent inode lock or even
asynchronously shrinking child dentries.  However, given the
straightforward nature of the current solution, I believe this approach
is still necessary.

[ NOTE! This is a pretty fundamental change in how we deal with
  unlinking dentries, and it doesn't change the fact that you can have
  lots of negative dentries from just doing negative lookups.

  But the kernel test robot is at least initially happy with this from a
  performance angle, so I'm applying this ASAP just to get more testing
  and as a "known fix for an issue people hit in real life".

  Put another way: we should still look at the alternatives, and this
  patch may get reverted if somebody finds a performance regression on
  some other load.       - Linus ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://github.com/elastic/elasticsearch [0]
Link: https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com [1]
Link: https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/ [2]
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/

 [3]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Colin Walters <walters@verbum.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/


Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

681ce862

Apr 09, 2024

fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading · 8bfb40be

linke li authored 11 months ago


Currently, the __d_clear_type_and_inode() writes the value flags to
dentry->d_flags, then immediately re-reads it in order to use it in a if
statement. This re-read is useless because no other update to
dentry->d_flags can occur at this point.

This commit therefore re-use flags in the if statement instead of
re-reading dentry->d_flags.

Signed-off-by: linke li <lilinke99@qq.com>
Link: https://lore.kernel.org/r/tencent_5E187BD0A61BA28605E85405F15228254D0A@qq.com


Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>

8bfb40be

Feb 27, 2024

vfs: remove SLAB_MEM_SPREAD flag usage · c997d683

Chengming Zhou authored 1 year ago

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1 (see [1]), so it became a dead flag since the
commit 16a1d968

 ("mm/slab: remove mm/slab.c and slab_def.h"). And
the series[1] went on to mark it obsolete explicitly to avoid confusion
for users. Here we can just remove all its users, which has no any
functional change.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz [1]
Link: https://lore.kernel.org/r/20240224135315.830477-1-chengming.zhou@linux.dev


Signed-off-by: Christian Brauner <brauner@kernel.org>

c997d683

Feb 09, 2024

Revert "get rid of DCACHE_GENOCIDE" · 7e4a205f

Al Viro authored 1 year ago

This reverts commit 57851607

.

Unfortunately, while we only call that thing once, the callback
*can* be called more than once for the same dentry - all it
takes is rename_lock being touched while we are in d_walk().
For now let's revert it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7e4a205f

Dec 28, 2023

fs: Remove the now superfluous sentinel elements from ctl_table array · 9d5b9475

Joel Granados authored 1 year ago

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/

)

Remove sentinel elements ctl_table struct. Special attention was placed in
making sure that an empty directory for fs/verity was created when
CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the
register sysctl call that expects a size.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

9d5b9475

Dec 12, 2023

list_lru: allow explicit memcg and NUMA node selection · 0a97c01c

Nhat Pham authored 1 year ago

Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com


Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0a97c01c

Nov 25, 2023

dcache: remove unnecessary NULL check in dget_dlock() · 1b6ae9f6

Vegard Nossum authored 1 year ago

dget_dlock() requires dentry->d_lock to be held when called, yet
contains a NULL check for dentry.

An audit of all calls to dget_dlock() shows that it is never called
with a NULL pointer (as spin_lock()/spin_unlock() would crash in these
cases):

  $ git grep -W '\<dget_dlock\>'

  arch/powerpc/platforms/cell/spufs/inode.c-              spin_lock(&dentry->d_lock);
  arch/powerpc/platforms/cell/spufs/inode.c-              if (simple_positive(dentry)) {
  arch/powerpc/platforms/cell/spufs/inode.c:                      dget_dlock(dentry);

  fs/autofs/expire.c-             spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
  fs/autofs/expire.c-             if (simple_positive(child)) {
  fs/autofs/expire.c:                     dget_dlock(child);

  fs/autofs/root.c:                       dget_dlock(active);
  fs/autofs/root.c-                       spin_unlock(&active->d_lock);

  fs/autofs/root.c:                       dget_dlock(expiring);
  fs/autofs/root.c-                       spin_unlock(&expiring->d_lock);

  fs/ceph/dir.c-          if (!spin_trylock(&dentry->d_lock))
  fs/ceph/dir.c-                  continue;
  [...]
  fs/ceph/dir.c:                          dget_dlock(dentry);

  fs/ceph/mds_client.c-           spin_lock(&alias->d_lock);
  [...]
  fs/ceph/mds_client.c:                   dn = dget_dlock(alias);

  fs/configfs/inode.c-            spin_lock(&dentry->d_lock);
  fs/configfs/inode.c-            if (simple_positive(dentry)) {
  fs/configfs/inode.c:                    dget_dlock(dentry);

  fs/libfs.c:                             found = dget_dlock(d);
  fs/libfs.c-                     spin_unlock(&d->d_lock);

  fs/libfs.c:             found = dget_dlock(child);
  fs/libfs.c-     spin_unlock(&child->d_lock);

  fs/libfs.c:                             child = dget_dlock(d);
  fs/libfs.c-                     spin_unlock(&d->d_lock);

  fs/ocfs2/dcache.c:                      dget_dlock(dentry);
  fs/ocfs2/dcache.c-                      spin_unlock(&dentry->d_lock);

  include/linux/dcache.h:static inline struct dentry *dget_dlock(struct dentry *dentry)

After taking out the NULL check, dget_dlock() becomes almost identical
to __dget_dlock(); the only difference is that dget_dlock() returns the
dentry that was passed in. These are static inline helpers, so we can
rely on the compiler to discard unused return values. We can therefore
also remove __dget_dlock() and replace calls to it by dget_dlock().

Also fix up and improve the kerneldoc comments while we're at it.

Al Viro pointed out that we can also clean up some of the callers to
make use of the returned value and provided a bit more info for the
kerneldoc.

While preparing v2 I also noticed that the tabs used in the kerneldoc
comments were causing the kerneldoc to get parsed incorrectly so I also
fixed this up (including for d_unhashed, which is otherwise unrelated).

Testing: x86 defconfig build + boot; make htmldocs for the kerneldoc
warning. objdump shows there are code generation changes.

Link: https://lore.kernel.org/all/20231022164520.915013-1-vegard.nossum@oracle.com/


Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1b6ae9f6

kill DCACHE_MAY_FREE · 1b327b5a

Al Viro authored 1 year ago


With the new ordering in __dentry_kill() it has become redundant -
it's set if and only if both DCACHE_DENTRY_KILLED and DCACHE_SHRINK_LIST
are set.

We set it in __dentry_kill(), after having set DCACHE_DENTRY_KILLED
with the only condition being that DCACHE_SHRINK_LIST is there;
all of that is done without dropping ->d_lock and the only place
that checks that flag (shrink_dentry_list()) does so under ->d_lock,
after having found the victim on its shrink list.  Since DCACHE_SHRINK_LIST
is set only when placing dentry into shrink list and removed only by
shrink_dentry_list() itself, a check for DCACHE_DENTRY_KILLED in
there would be equivalent to check for DCACHE_MAY_FREE.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1b327b5a

__d_unalias() doesn't use inode argument · ef69f050
Al Viro authored 1 year ago
```
... and hasn't since 2015.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
ef69f050

d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant · f9f677c5

Al Viro authored 1 year ago


We only search in the damn thing under hlist_bl_lock(); RCU variant
of insertion was, IIRC, pretty much cargo-culted - mea culpa...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f9f677c5

get rid of DCACHE_GENOCIDE · 57851607

Al Viro authored 1 year ago


... now that we never call d_genocide() other than from kill_litter_super()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

57851607

d_alloc_pseudo(): move setting ->d_op there from the (sole) caller · 9024b4c9
Al Viro authored 1 year ago
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
9024b4c9

kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller · f2824db1

Al Viro authored 1 year ago


now that the only user of d_instantiate_anon() is gone...
[braino fix folded - kudos to Dan Carpenter]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f2824db1

retain_dentry(): introduce a trimmed-down lockless variant · 6367b491

Al Viro authored 1 year ago


	fast_dput() contains a small piece of code, preceded by scary
comments about 5 times longer than it.	What is actually done there is
a trimmed-down subset of retain_dentry() - in some situations we can
tell that retain_dentry() would have returned true without ever needing
->d_lock and that's what that code checks.  If these checks come true
fast_dput() can declare that we are done, without bothering with ->d_lock;
otherwise it has to take the lock and do full variant of retain_dentry()
checks.

	Trimmed-down variant of the checks is hard to follow and
it's asking for trouble - if we ever decide to change the rules in
retain_dentry(), we'll have to remember to update that code.  It turns
out that an equivalent variant of these checks more obviously parallel
to retain_dentry() is not just possible, but easy to unify with
retain_dentry() itself, passing it a new boolean argument ('locked')
to distinguish between the full semantics and trimmed down one.

	Note that in lockless case true is returned only when locked
variant would have returned true without ever needing the lock; false
means "punt to the locking path and recheck there".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6367b491

__dentry_kill(): new locking scheme · 1c18edd1

Al Viro authored 1 year ago


Currently we enter __dentry_kill() with parent (along with the victim
dentry and victim's inode) held locked.  Then we
	mark dentry refcount as dead
	call ->d_prune()
	remove dentry from hash
	remove it from the parent's list of children
	unlock the parent, don't need it from that point on
	detach dentry from inode, unlock dentry and drop the inode
(via ->d_iput())
	call ->d_release()
	regain the lock on dentry
	check if it's on a shrink list (in which case freeing its empty husk
has to be left to shrink_dentry_list()) or not (in which case we can free it
ourselves).  In the former case, mark it as an empty husk, so that
shrink_dentry_list() would know it can free the sucker.
	drop the lock on dentry
... and usually the caller proceeds to drop a reference on the parent,
possibly retaking the lock on it.

That is painful for a bunch of reasons, starting with the need to take locks
out of order, but not limited to that - the parent of positive dentry can
change if we drop its ->d_lock, so getting these locks has to be done with
care.  Moreover, as soon as dentry is out of the parent's list of children,
shrink_dcache_for_umount() won't see it anymore, making it appear as if
the parent is inexplicably busy.  We do work around that by having
shrink_dentry_list() decrement the parent's refcount first and put it on
shrink list to be evicted once we are done with __dentry_kill() of child,
but that may in some cases lead to ->d_iput() on child called after the
parent got killed.  That doesn't happen in cases where in-tree ->d_iput()
instances might want to look at the parent, but that's brittle as hell.

Solution: do removal from the parent's list of children in the very
end of __dentry_kill().  As the result, the callers do not need to
lock the parent and by the time we really need the parent locked,
dentry is negative and is guaranteed not to be moved around.

It does mean that ->d_prune() will be called with parent not locked.
It also means that we might see dentries in process of being torn
down while going through the parent's list of children; those dentries
will be unhashed, negative and with refcount marked dead.  In practice,
that's enough for in-tree code that looks through the list of children
to do the right thing as-is.  Out-of-tree code might need to be adjusted.

Calling conventions: __dentry_kill(dentry) is called with dentry->d_lock
held, along with ->i_lock of its inode (if any).  It either returns
the parent (locked, with refcount decremented to 0) or NULL (if there'd
been no parent or if refcount decrement for parent hadn't reached 0).

lock_for_kill() is adjusted for new requirements - it doesn't touch
the parent's ->d_lock at all.

Callers adjusted.  Note that for dput() we don't need to bother with
fast_dput() for the parent - we just need to check retain_dentry()
for it, since its ->d_lock is still held since the moment when
__dentry_kill() had taken it to remove the victim from the list of
children.

The kludge with early decrement of parent's refcount in
shrink_dentry_list() is no longer needed - shrink_dcache_for_umount()
sees the half-killed dentries in the list of children for as long
as they are pinning the parent.  They are easily recognized and
accounted for by select_collect(), so we know we are not done yet.

As the result, we always have the expected ordering for ->d_iput()/->d_release()
vs. __dentry_kill() of the parent, no exceptions.  Moreover, the current
rules for shrink lists (one must make sure that shrink_dcache_for_umount()
won't happen while any dentries from the superblock in question are on
any shrink lists) are gone - shrink_dcache_for_umount() will do the
right thing in all cases, taking such dentries out.  Their empty
husks (memory occupied by struct dentry itself + its external name,
if any) will remain on the shrink lists, but they are no obstacles
to filesystem shutdown.  And such husks will get freed as soon as
shrink_dentry_list() of the list they are on gets to them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1c18edd1

d_prune_aliases(): use a shrink list · b4cc0734

Al Viro authored 1 year ago


Instead of dropping aliases one by one, restarting, etc., just
collect them into a shrink list and kill them off in one pass.

We don't really need the restarts - one alias can't pin another
(directory has only one alias, and couldn't be its own ancestor
anyway), so collecting everything that is not busy and taking it
out would take care of everything evictable that had been there
as we entered the function.  And new aliases added while we'd
been dropping old ones could just as easily have appeared right
as we return to caller...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b4cc0734

switch select_collect{,2}() to use of to_shrink_list() · f5c8a8a4

Al Viro authored 1 year ago


Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f5c8a8a4

to_shrink_list(): call only if refcount is 0 · c2e5e29f

Al Viro authored 1 year ago


The only thing it does if refcount is not zero is d_lru_del(); no
point, IMO, seeing that plain dput() does nothing of that sort...

Note that 2 of 3 current callers are guaranteed that refcount is 0.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

c2e5e29f

fold dentry_kill() into dput() · 5e7a5c8d

Al Viro authored 1 year ago


Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5e7a5c8d

don't try to cut corners in shrink_lock_dentry() · 339e9e13

Al Viro authored 1 year ago


That is to say, do *not* treat the ->d_inode or ->d_parent changes
as "it's hard, return false; somebody must have grabbed it, so
even if has zero refcount, we don't need to bother killing it -
final dput() from whoever grabbed it would've done everything".

First of all, that is not guaranteed.  It might have been dropped
by shrink_kill() handling of victim's parent, which would've found
it already on a shrink list (ours) and decided that they don't need
to put it on their shrink list.

What's more, dentry_kill() is doing pretty much the same thing,
cutting its own set of corners (it assumes that dentry can't
go from positive to negative, so its inode can change but only once
and only in one direction).

Doing that right allows to get rid of that not-quite-duplication
and removes the only reason for re-incrementing refcount before
the call of dentry_kill().

Replacement is called lock_for_kill(); called under rcu_read_lock
and with ->d_lock held.  If it returns false, dentry has non-zero
refcount and the same locks are held.  If it returns true,
dentry has zero refcount and all locks required by __dentry_kill()
are taken.

Part of __lock_parent() had been lifted into lock_parent() to
allow its reuse.  Now it's called with rcu_read_lock already
held and dentry already unlocked.

Note that this is not the final change - locking requirements for
__dentry_kill() are going to change later in the series and the
set of locks taken by lock_for_kill() will be adjusted.  Both
lock_parent() and __lock_parent() will be gone once that happens.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

339e9e13

fold the call of retain_dentry() into fast_dput() · f05441c7

Al Viro authored 1 year ago


Calls of retain_dentry() happen immediately after getting false
from fast_dput() and getting true from retain_dentry() is
treated the same way as non-zero refcount would be treated by
fast_dput() - unlock dentry and bugger off.

Doing that in fast_dput() itself is simpler.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f05441c7

Call retain_dentry() with refcount 0 · 2f42f1eb

Al Viro authored 1 year ago


Instead of bumping it from 0 to 1, calling retain_dentry(), then
decrementing it back to 0 (with ->d_lock held all the way through),
just leave refcount at 0 through all of that.

It will have a visible effect for ->d_delete() - now it can be
called with refcount 0 instead of 1 and it can no longer play
silly buggers with dropping/regaining ->d_lock.  Not that any
in-tree instances tried to (it's pretty hard to get right).

Any out-of-tree ones will have to adjust (assuming they need any
changes).

Note that we do not need to extend rcu-critical area here - we have
verified that refcount is non-negative after having grabbed ->d_lock,
so nobody will be able to free dentry until they get into __dentry_kill(),
which won't happen until they manage to grab ->d_lock.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2f42f1eb

dentry_kill(): don't bother with retain_dentry() on slow path · b06c684d

Al Viro authored 1 year ago


We have already checked it and dentry used to look not worthy
of keeping.  The only hard obstacle to evicting dentry is
non-zero refcount; everything else is advisory - e.g. memory
pressure could evict any dentry found with refcount zero.
On the slow path in dentry_kill() we had dropped and regained
->d_lock; we must recheck the refcount, but everything else
is not worth bothering with.

Note that filesystem can not count upon ->d_delete() being
called for dentry - not even once.  Again, memory pressure
(as well as d_prune_aliases(), or attempted rmdir() of ancestor,
or...) will not call ->d_delete() at all.

So from the correctness point of view we are fine doing the
check only once.  And it makes things simpler down the road.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b06c684d

__dentry_kill(): get consistent rules for victim's refcount · ee0c8250

Al Viro authored 1 year ago


Currently we call it with refcount equal to 1 when called from
dentry_kill(); all other callers have it equal to 0.

Make it always be called with zero refcount; on this step we
just decrement it before the calls in dentry_kill().  That is
safe, since all places that care about the value of refcount
either do that under ->d_lock or hold a reference to dentry
in question.  Either is sufficient to prevent observing a
dentry immediately prior to __dentry_kill() getting called
from dentry_kill().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ee0c8250

make retain_dentry() neutral with respect to refcounting · e9d130d0

Al Viro authored 1 year ago


retain_dentry() used to decrement refcount if and only if it returned
true.  Lift those decrements into the callers.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e9d130d0

__dput_to_list(): do decrement of refcount in the callers · 6511f6be

Al Viro authored 1 year ago


... and rename it to to_shrink_list(), seeing that it no longer
does dropping any references

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6511f6be

fast_dput(): new rules for refcount · 15f23734

Al Viro authored 1 year ago


By now there is only one place in entire fast_dput() where we return
false; that happens after refcount had been decremented and found (under
->d_lock) to be zero.  In that case, just prior to returning false to
caller, fast_dput() forcibly changes the refcount from 0 to 1.

Lift that resetting refcount to 1 into the callers; later in the series
it will be massaged out of existence.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

15f23734

fast_dput(): handle underflows gracefully · 504e08ce

Al Viro authored 1 year ago


If refcount is less than 1, we should just warn, unlock dentry and
return true, so that the caller doesn't try to do anything else.

Taking care of that leaves the rest of "lockref_put_return() has
failed" case equivalent to "decrement refcount and rejoin the
normal slow path after the point where we grab ->d_lock".

NOTE: lockref_put_return() is strictly a fastpath thing - unlike
the rest of lockref primitives, it does not contain a fallback.
Caller (and it looks like fast_dput() is the only legitimate one
in the entire kernel) has to do that itself.  Reasons for
lockref_put_return() failures:
	* ->d_lock held by somebody
	* refcount <= 0
	* ... or an architecture not supporting lockref use of
cmpxchg - sparc, anything non-SMP, config with spinlock debugging...

We could add a fallback, but it would be a clumsy API - we'd have
to distinguish between:
	(1) refcount > 1 - decremented, lock not held on return
	(2) refcount < 1 - left alone, probably no sense to hold the lock
	(3) refcount is 1, no cmphxcg - decremented, lock held on return
	(4) refcount is 1, cmphxcg supported - decremented, lock *NOT* held
	    on return.
We want to return with no lock held in case (4); that's the whole point of that
thing.  We very much do not want to have the fallback in case (3) return without
a lock, since the caller might have to retake it in that case.
So it wouldn't be more convenient than doing the fallback in the caller and
it would be very easy to screw up, especially since the test coverage would
suck - no way to test (3) and (4) on the same kernel build.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

504e08ce

fast_dput(): having ->d_delete() is not reason to delay refcount decrement · 15220fbf

Al Viro authored 1 year ago


->d_delete() is a way for filesystem to tell that dentry is not worth
keeping cached.  It is not guaranteed to be called every time a dentry
has refcount drop down to zero; it is not guaranteed to be called before
dentry gets evicted.  In other words, it is not suitable for any kind
of keeping track of dentry state.

None of the in-tree filesystems attempt to use it that way, fortunately.

So the contortions done by fast_dput() (as well as dentry_kill()) are
not warranted.  fast_dput() certainly should treat having ->d_delete()
instance as "can't assume we'll be keeping it", but that's not different
from the way we treat e.g. DCACHE_DONTCACHE (which is rather similar
to making ->d_delete() returns true when called).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

15220fbf

shrink_dentry_list(): no need to check that dentry refcount is marked dead · cd9f84f3

Al Viro authored 1 year ago


... we won't see DCACHE_MAY_FREE on anything that is *not* dead
and checking d_flags is just as cheap as checking refcount.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

cd9f84f3

centralize killing dentry from shrink list · 3fcf5356

Al Viro authored 1 year ago


new helper unifying identical bits of shrink_dentry_list() and
shring_dcache_for_umount()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3fcf5356

dentry: switch the lists of children to hlist · da549bdd

Al Viro authored 1 year ago


Saves a pointer per struct dentry and actually makes the things less
clumsy.  Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.

A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

da549bdd

Nov 18, 2023

kill d_{is,set}_fallthru() · 8219cb58

Al Viro authored 1 year ago


Introduced in 2015 and never had any in-tree users...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8219cb58

get rid of __dget() · 6d73c9ce

Al Viro authored 1 year ago


fold into the sole remaining caller

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

6d73c9ce

Oct 18, 2023

treewide: mark stuff as __ro_after_init · 68279f9c

Alexey Dobriyan authored 1 year ago

__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.

Also, mark some stuff as "const" and "__init" while I'm at it.

[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183


Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

68279f9c

Sep 11, 2023

fs: factor out d_mark_tmpfile() · 771eb4fe

Kent Overstreet authored 6 years ago

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>

771eb4fe

Aug 19, 2023

fs/dcache: Replace printk and WARN_ON by WARN · 8c8e7dba

Anh Tuan Phan authored 1 year ago


Use WARN instead of printk + WARN_ON as reported from coccinelle:
./fs/dcache.c:1667:1-7: SUGGESTION: printk + WARN_ON can be just WARN

Signed-off-by: Anh Tuan Phan <tuananhlfc@gmail.com>
Message-Id: <20230817163142.117706-1-tuananhlfc@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

8c8e7dba

Admin message