Commits · 4811f7af6090e8f5a398fbdd766f903ef6c0d787 · BeagleBoard.org / Linux

Jul 26, 2024

nilfs2: handle inconsistent state in nilfs_btnode_create_block() · 4811f7af

Ryusuke Konishi authored 7 months ago

Syzbot reported that a buffer state inconsistency was detected in
nilfs_btnode_create_block(), triggering a kernel bug.

It is not appropriate to treat this inconsistency as a bug; it can occur
if the argument block address (the buffer index of the newly created
block) is a virtual block number and has been reallocated due to
corruption of the bitmap used to manage its allocation state.

So, modify nilfs_btnode_create_block() and its callers to treat it as a
possible filesystem error, rather than triggering a kernel bug.

Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com
Fixes: a60be987

 ("nilfs2: B-tree node cache")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by:  <syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60


Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4811f7af

Jul 24, 2024

sysctl: treewide: constify the ctl_table argument of proc_handlers · 78eb4ea2

Joel Granados authored 7 months ago

const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - stru...

78eb4ea2

hostfs: fix folio conversion · e44be002

Linus Torvalds authored 8 months ago

Commit e3ec0fe9

 ("hostfs: Convert hostfs_read_folio() to use a
folio") simplified hostfs_read_folio(), but in the process of converting
to using folios natively also mis-used the folio_zero_tail() function
due to the very confusing API of that function.

Very arguably it's folio_zero_tail() API itself that is buggy, since it
would make more sense (and the documentation kind of implies) that the
third argument would be the pointer to the beginning of the folio
buffer.

But no, the third argument to folio_zero_tail() is where we should start
zeroing the tail (even if we already also pass in the offset separately
as the second argument).

So fix the hostfs caller, and we can leave any folio_zero_tail() sanity
cleanup for later.

Reported-and-tested-by: Maciej Żenczykowski <maze@google.com>
Fixes: e3ec0fe9 ("hostfs: Convert hostfs_read_folio() to use a folio")
Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/


Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e44be002

inode: clarify what's locked · f5e5e97c

Christian Brauner authored 8 months ago


In __wait_on_freeing_inode() we warn in case the inode_hash_lock is held
but the inode is unhashed. We then release the inode_lock. So using
"locked" as parameter name is confusing. Use is_inode_hash_locked as
parameter name instead.

Signed-off-by: Christian Brauner <brauner@kernel.org>

f5e5e97c

vfs: Fix potential circular locking through setxattr() and removexattr() · c3a5e3e8

David Howells authored 8 months ago


When using cachefiles, lockdep may emit something similar to the circular
locking dependency notice below.  The problem appears to stem from the
following:

 (1) Cachefiles manipulates xattrs on the files in its cache when called
     from ->writepages().

 (2) The setxattr() and removexattr() system call handlers get the name
     (and value) from userspace after taking the sb_writers lock, putting
     accesses of the vma->vm_lock and mm->mmap_lock inside of that.

 (3) The afs filesystem uses a per-inode lock to prevent multiple
     revalidation RPCs and in writeback vs truncate to prevent parallel
     operations from deadlocking against the server on one side and local
     page locks on the other.

Fix this by moving the getting of the name and value in {get,remove}xattr()
outside of the sb_writers lock.  This also has the minor benefits that we
don't need to reget these in the event of a retry and we never try to take
the sb_writers lock in the event we can't pull the name and value into the
kernel.

Alternative approaches that might fix this include moving the dispatch of a
write to the cache off to a workqueue or trying to do without the
validation lock in afs.  Note that this might also affect other filesystems
that use netfslib and/or cachefiles.

 ======================================================
 WARNING: possible circular locking dependency detected
 6.10.0-build2+ #956 Not tainted
 ------------------------------------------------------
 fsstress/6050 is trying to acquire lock:
 ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0

 but task is already holding lock:
 ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #4 (&vma->vm_lock->lock){++++}-{3:3}:
        __lock_acquire+0xaf0/0xd80
        lock_acquire.part.0+0x103/0x280
        down_write+0x3b/0x50
        vma_start_write+0x6b/0xa0
        vma_link+0xcc/0x140
        insert_vm_struct+0xb7/0xf0
        alloc_bprm+0x2c1/0x390
        kernel_execve+0x65/0x1a0
        call_usermodehelper_exec_async+0x14d/0x190
        ret_from_fork+0x24/0x40
        ret_from_fork_asm+0x1a/0x30

 -> #3 (&mm->mmap_lock){++++}-{3:3}:
        __lock_acquire+0xaf0/0xd80
        lock_acquire.part.0+0x103/0x280
        __might_fault+0x7c/0xb0
        strncpy_from_user+0x25/0x160
        removexattr+0x7f/0x100
        __do_sys_fremovexattr+0x7e/0xb0
        do_syscall_64+0x9f/0x100
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #2 (sb_writers#14){.+.+}-{0:0}:
        __lock_acquire+0xaf0/0xd80
        lock_acquire.part.0+0x103/0x280
        percpu_down_read+0x3c/0x90
        vfs_iocb_iter_write+0xe9/0x1d0
        __cachefiles_write+0x367/0x430
        cachefiles_issue_write+0x299/0x2f0
        netfs_advance_write+0x117/0x140
        netfs_write_folio.isra.0+0x5ca/0x6e0
        netfs_writepages+0x230/0x2f0
        afs_writepages+0x4d/0x70
        do_writepages+0x1e8/0x3e0
        filemap_fdatawrite_wbc+0x84/0xa0
        __filemap_fdatawrite_range+0xa8/0xf0
        file_write_and_wait_range+0x59/0x90
        afs_release+0x10f/0x270
        __fput+0x25f/0x3d0
        __do_sys_close+0x43/0x70
        do_syscall_64+0x9f/0x100
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #1 (&vnode->validate_lock){++++}-{3:3}:
        __lock_acquire+0xaf0/0xd80
        lock_acquire.part.0+0x103/0x280
        down_read+0x95/0x200
        afs_writepages+0x37/0x70
        do_writepages+0x1e8/0x3e0
        filemap_fdatawrite_wbc+0x84/0xa0
        filemap_invalidate_inode+0x167/0x1e0
        netfs_unbuffered_write_iter+0x1bd/0x2d0
        vfs_write+0x22e/0x320
        ksys_write+0xbc/0x130
        do_syscall_64+0x9f/0x100
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (mapping.invalidate_lock#3){++++}-{3:3}:
        check_noncircular+0x119/0x160
        check_prev_add+0x195/0x430
        __lock_acquire+0xaf0/0xd80
        lock_acquire.part.0+0x103/0x280
        down_read+0x95/0x200
        filemap_fault+0x26e/0x8b0
        __do_fault+0x57/0xd0
        do_pte_missing+0x23b/0x320
        __handle_mm_fault+0x2d4/0x320
        handle_mm_fault+0x14f/0x260
        do_user_addr_fault+0x2a2/0x500
        exc_page_fault+0x71/0x90
        asm_exc_page_fault+0x22/0x30

 other info that might help us debug this:

 Chain exists of:
   mapping.invalidate_lock#3 --> &mm->mmap_lock --> &vma->vm_lock->lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   rlock(&vma->vm_lock->lock);
                                lock(&mm->mmap_lock);
                                lock(&vma->vm_lock->lock);
   rlock(mapping.invalidate_lock#3);

  *** DEADLOCK ***

 1 lock held by fsstress/6050:
  #0: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250

 stack backtrace:
 CPU: 0 PID: 6050 Comm: fsstress Not tainted 6.10.0-build2+ #956
 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x80
  check_noncircular+0x119/0x160
  ? queued_spin_lock_slowpath+0x4be/0x510
  ? __pfx_check_noncircular+0x10/0x10
  ? __pfx_queued_spin_lock_slowpath+0x10/0x10
  ? mark_lock+0x47/0x160
  ? init_chain_block+0x9c/0xc0
  ? add_chain_block+0x84/0xf0
  check_prev_add+0x195/0x430
  __lock_acquire+0xaf0/0xd80
  ? __pfx___lock_acquire+0x10/0x10
  ? __lock_release.isra.0+0x13b/0x230
  lock_acquire.part.0+0x103/0x280
  ? filemap_fault+0x26e/0x8b0
  ? __pfx_lock_acquire.part.0+0x10/0x10
  ? rcu_is_watching+0x34/0x60
  ? lock_acquire+0xd7/0x120
  down_read+0x95/0x200
  ? filemap_fault+0x26e/0x8b0
  ? __pfx_down_read+0x10/0x10
  ? __filemap_get_folio+0x25/0x1a0
  filemap_fault+0x26e/0x8b0
  ? __pfx_filemap_fault+0x10/0x10
  ? find_held_lock+0x7c/0x90
  ? __pfx___lock_release.isra.0+0x10/0x10
  ? __pte_offset_map+0x99/0x110
  __do_fault+0x57/0xd0
  do_pte_missing+0x23b/0x320
  __handle_mm_fault+0x2d4/0x320
  ? __pfx___handle_mm_fault+0x10/0x10
  handle_mm_fault+0x14f/0x260
  do_user_addr_fault+0x2a2/0x500
  exc_page_fault+0x71/0x90
  asm_exc_page_fault+0x22/0x30

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/2136178.1721725194@warthog.procyon.org.uk


cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Gao Xiang <xiang@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
[brauner: fix minor issues]
Signed-off-by: Christian Brauner <brauner@kernel.org>

c3a5e3e8

filelock: Fix fcntl/close race recovery compat path · f8138f2a

Jann Horn authored 8 months ago

When I wrote commit 3cad1bc0 ("filelock: Remove locks reliably when
fcntl/close race is detected"), I missed that there are two copies of the
code I was patching: The normal version, and the version for 64-bit offsets
on 32-bit kernels.
Thanks to Greg KH for stumbling over this while doing the stable
backport...

Apply exactly the same fix to the compat path for 32-bit kernels.

Fixes: c293621b ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563


Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240723-fs-lock-recover-compatfix-v1-1-148096719529@google.com


Signed-off-by: Christian Brauner <brauner@kernel.org>

f8138f2a

fs: use all available ids · 8eac5358

Christian Brauner authored 8 months ago

The counter is unconditionally incremented for each mount allocation.
If we set it to 1ULL << 32 we're losing 4294967296 as the first valid
non-32 bit mount id.

Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-1-834113cab0d2@kernel.org


Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

8eac5358

cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT · 51d37982

David Howells authored 8 months ago


Set the maximum size of a subrequest that writes to cachefiles to be
MAX_RW_COUNT so that we don't overrun the maximum write we can make to the
backing filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1599005.1721398742@warthog.procyon.org.uk


cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

51d37982

netfs: Fix writeback that needs to go to both server and cache · 212be98a

David Howells authored 8 months ago

When netfslib is performing writeback (ie. ->writepages), it maintains two
parallel streams of writes, one to the server and one to the cache, but it
doesn't mark either stream of writes as active until it gets some data that
needs to be written to that stream.

This is done because some folios will only be written to the cache
(e.g. copying to the cache on read is done by marking the folios and
letting writeback do the actual work) and sometimes we'll only be writing
to the server (e.g. if there's no cache).

Now, since we don't actually dispatch uploads and cache writes in parallel,
but rather flip between the streams, depending on which has the lowest
so-far-issued offset, and don't wait for the subreqs to finish before
flipping, we can end up in a situation where, say, we issue a write to the
server and this completes before we start the write to the cache.

But because we only activate a stream when we first add a subreq to it, the
result...

212be98a

pidfs: handle kernels without namespaces cleanly · 9b3e1504

Christian Brauner authored 8 months ago

The nsproxy structure contains nearly all of the namespaces associated
with a task. When a given namespace type is not supported by this kernel
the rules whether the corresponding pointer in struct nsproxy is NULL or
always init_<ns_type>_ns differ per namespace. Ideally, that wouldn't be
the case and for all namespace types we'd always set it to
init_<ns_type>_ns when the corresponding namespace type isn't supported.

Make sure we handle all namespaces where the pointer in struct nsproxy
can be NULL when the namespace type isn't supported.

Link: https://lore.kernel.org/r/20240722-work-pidfs-e6a83030f63e@brauner
Fixes: 5b08bd40

 ("pidfs: allow retrieval of namespace file descriptors") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>

9b3e1504

pidfs: when time ns disabled add check for ioctl · f60d38cb

Edward Adam Davis authored 8 months ago

syzbot call pidfd_ioctl() with cmd "PIDFD_GET_TIME_NAMESPACE" and disabled
CONFIG_TIME_NS, since time_ns is NULL, it will make NULL ponter deref in
open_namespace.

Fixes: 5b08bd40

 ("pidfs: allow retrieval of namespace file descriptors") # mainline only
Reported-and-tested-by:  <syzbot+34a0ee986f61f15da35d@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=34a0ee986f61f15da35d


Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Link: https://lore.kernel.org/r/tencent_7FAE8DB725EE0DD69236DDABDDDE195E4F07@qq.com


Signed-off-by: Christian Brauner <brauner@kernel.org>

f60d38cb

vfs: correct the comments of vfs_*() helpers · b40c8e7a

Congjie Zhou authored 8 months ago

correct the comments of vfs_*() helpers in fs/namei.c, including:
1. vfs_create()
2. vfs_mknod()
3. vfs_mkdir()
4. vfs_rmdir()
5. vfs_symlink()

All of them come from the same commit:
6521f891

 "namei: prepare for idmapped mounts"

The @dentry is actually the dentry of child directory rather than
base directory(parent directory), and thus the @dir has to be
modified due to the change of @dentry.

Signed-off-by: Congjie Zhou <zcjie0802@qq.com>
Link: https://lore.kernel.org/r/tencent_2FCF6CC9E10DC8A27AE58A5A0FE4FCE96D0A@qq.com


Signed-off-by: Christian Brauner <brauner@kernel.org>

b40c8e7a

vfs: handle __wait_on_freeing_inode() and evict() race · 5bc9ad78

Mateusz Guzik authored 8 months ago

Lockless hash lookup can find and lock the inode after it gets the
I_FREEING flag set, at which point it blocks waiting for teardown in
evict() to finish.

However, the flag is still set even after evict() wakes up all waiters.

This results in a race where if the inode lock is taken late enough, it
can happen after both hash removal and wakeups, meaning there is nobody
to wake the racing thread up.

This worked prior to RCU-based lookup because the entire ordeal was
synchronized with the inode hash lock.

Since unhashing requires the inode lock, we can safely check whether it
happened after acquiring it.

Link: https://lore.kernel.org/v9fs/20240717102458.649b60be@kernel.org/


Reported-by: Dominique Martinet <asmadeus@codewreck.org>
Fixes: 7180f8d9

 ("vfs: add rcu-based find_inode variants for iget ops")
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240718151838.611807-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <j...

5bc9ad78

netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG · fcad9336

David Howells authored 8 months ago


CONFIG_FSCACHE_DEBUG should have been renamed to CONFIG_NETFS_DEBUG, so do
that now.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1410796.1721333406@warthog.procyon.org.uk


cc: Uwe Kleine-König <ukleinek@kernel.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

fcad9336

netfs: Revert "netfs: Switch debug logging to pr_debug()" · a9d47a50

David Howells authored 8 months ago

Revert commit 163eae0f

 to get back the
original operation of the debugging macros.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240608151352.22860-2-ukleinek@kernel.org
Link: https://lore.kernel.org/r/1410685.1721333252@warthog.procyon.org.uk


cc: Uwe Kleine-König <ukleinek@kernel.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

a9d47a50

Jul 22, 2024

execve: Move KUnit tests to tests/ subdirectory · b6f5ee4d

Kees Cook authored 8 months ago


Move the exec KUnit tests into a separate directory to avoid polluting
the local directory namespace. Additionally update MAINTAINERS for the
new files.

Reviewed-by: David Gow <davidgow@google.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240720170310.it.942-kees@kernel.org


Signed-off-by: Kees Cook <kees@kernel.org>

b6f5ee4d

bcachefs: Fix printbuf usage while atomic · 737759fc

Kent Overstreet authored 8 months ago


Reported-by:  <syzbot+f765e51170cf13493f0b@syzkaller.appspotmail.com>
Fixes: f12410bb

 ("bcachefs: Add an error message for insufficient rw journal devs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

737759fc

bcachefs: More informative error message in reattach_inode() · 7a086baa
Kent Overstreet authored 8 months ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
7a086baa

Jul 20, 2024

cifs: Fix missing fscache invalidation · a07d38af

David Howells authored 8 months ago


A network filesystem needs to implement a netfslib hook to invalidate
fscache if it's to be able to use the cache.

Fix cifs to implement the cache invalidation hook.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 3ee1a1fc

 ("cifs: Cut over to using netfslib")
Signed-off-by: Steve French <stfrench@microsoft.com>

a07d38af

Jul 19, 2024

mm: add MAP_DROPPABLE for designating always lazily freeable mappings · 9651fced

Jason A. Donenfeld authored 2 years ago


The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
  page faulting in memory if none is available
  * Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.
e) If there's not enough memory to service a page fault, it's not fatal,
   and no signal is sent.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

In order to let vDSO getrandom() use this, expose these via mmap(2) as
MAP_DROPPABLE.

Note that this involves removing the MADV_FREE special case from
sort_folio(), which according to Yu Zhao is unnecessary and will simply
result in an extra call to shrink_folio_list() in the worst case. The
chunk removed reenables the swapbacked flag, which we don't want for
VM_DROPPABLE, and we can't conditionalize it here because there isn't a
vma reference available.

Finally, the provided self test ensures that this is working as desired.

Cc: linux-mm@kvack.org
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

9651fced

cifs: Add a tracepoint to track credits involved in R/W requests · 519be989

David Howells authored 10 months ago


Add a tracepoint to track the credit changes and server in_flight value
involved in the lifetime of a R/W request, logging it against the
request/subreq debugging ID.  This requires the debugging IDs to be
recorded in the cifs_credits struct.

The tracepoint can be enabled with:

	echo 1 >/sys/kernel/debug/tracing/events/cifs/smb3_rw_credits/enable

Also add a three-state flag to struct cifs_credits to note if we're
interested in determining when the in_flight contribution ends and, if so,
to track whether we've decremented the contribution yet.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

519be989

cifs: Fix setting of zero_point after DIO write · 61ea6b3a

David Howells authored 8 months ago

At the moment, at the end of a DIO write, cifs calls netfs_resize_file() to
adjust the size of the file if it needs it.  This will reduce the
zero_point (the point above which we assume a read will just return zeros)
if it's more than the new i_size, but won't increase it.

With DIO writes, however, we definitely want to increase it as we have
clobbered the local pagecache and then written some data that's not
available locally.

Fix cifs to make the zero_point above the end of a DIO or unbuffered write.

This fixes corruption seen occasionally with the generic/708 xfs-test.  In
that case, the read-back of some of the written data is being
short-circuited and replaced with zeroes.

Fixes: 3ee1a1fc

 ("cifs: Cut over to using netfslib")
Cc: stable@vger.kernel.org
Reported-by: Steve French <sfrench@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

61ea6b3a

cifs: Fix missing error code set · d2c5eb57

David Howells authored 8 months ago

In cifs_strict_readv(), the default rc (-EACCES) is accidentally cleared by
a successful return from netfs_start_io_direct(), such that if
cifs_find_lock_conflict() fails, we don't return an error.

Fix this by resetting the default error code.

Fixes: 14b1cd25

 ("cifs: Fix locking in cifs_strict_readv()")
Cc: stable@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

d2c5eb57

cifs: Fix server re-repick on subrequest retry · de40579b

David Howells authored 8 months ago

When a subrequest is marked for needing retry, netfs will call
cifs_prepare_write() which will make cifs repick the server for the op
before renegotiating credits; it then calls cifs_issue_write() which
invokes smb2_async_writev() - which re-repicks the server.

If a different server is then selected, this causes the increment of
server->in_flight to happen against one record and the decrement to happen
against another, leading to misaccounting.

Fix this by just removing the repick code in smb2_async_writev().  As this
is only called from netfslib-driven code, cifs_prepare_write() should
always have been called first, and so server should never be NULL and the
preparatory step is repeated in the event that we do a retry.

The problem manifests as a warning looking something like:

 WARNING: CPU: 4 PID: 72896 at fs/smb/client/smb2ops.c:97 smb2_add_credits+0x3f0/0x9e0 [cifs]
 ...
 RIP: 0010:smb2_add_credits+0x3f0/0x9e0 [cifs]
 ...
  smb2_writev_callback+0x334/0x560 [cifs]
  cifs_demultiplex_thread+0x77a/0x11b0 [cifs]
  kthread+0x187/0x1d0
  ret_from_fork+0x34/0x60
  ret_from_fork_asm+0x1a/0x30

Which may be triggered by a number of different xfstests running against an
Azure server in multichannel mode.  generic/249 seems the most repeatable,
but generic/215, generic/249 and generic/308 may also show it.

Fixes: 3ee1a1fc

 ("cifs: Cut over to using netfslib")
Cc: stable@vger.kernel.org
Reported-by: Steve French <smfrench@gmail.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Aurelien Aptel <aaptel@suse.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

de40579b

cifs: fix noisy message on copy_file_range · ae4ccca4

Steve French authored 8 months ago


There are common cases where copy_file_range can noisily
log "source and target of copy not on same server"
e.g. the mv command across mounts to two different server's shares.
Change this to informational rather than logging as an error.

A followon patch will add dynamic trace points e.g. for
cifs_file_copychunk_range

Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

ae4ccca4

btrfs: change BTRFS_MOUNT_* flags to 64bit type · c3ece6b7

Qu Wenruo authored 8 months ago

Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
going to cause compilation errors for some 32 bit systems, as their
unsigned long is only 32 bits long, thus flag
BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.

Fix the problem by:

- Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
- Migrate all mount option related variables to unsigned long long
  * btrfs_fs_info::mount_opt
  * btrfs_fs_context::mount_opt
  * mount_opt parameter of btrfs_check_options()
  * old_opts parameter of btrfs_remount_begin()
  * old_opts parameter of btrfs_remount_cleanup()
  * mount_opt parameter of btrfs_check_mountopts_zoned()
  * mount_opt and opt parameters of check_ro_option()

Fixes: 32e62165

 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

c3ece6b7

Jul 18, 2024

bcachefs: kill btree_trans_too_many_iters() in bch2_bucket_alloc_freelist() · 2fa88b19

Kent Overstreet authored 8 months ago


When we're called via
trans commit -> btree split -> allocator

We may have already arbitrarily many btree_paths, for the transaction
commit we're trying to do; when this happens, the
btree_trans_too_many_iters() call causes us to livelock.

Since the allocator calls btree_iter_dontneed to release paths as it
iterates, this shouldn't cause any problems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2fa88b19

bcachefs: mean_and_variance: Avoid too-large shift amounts · 73f88592

Tavian Barnes authored 9 months ago


Shifting a value by the width of its type or more is undefined.

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

73f88592

bcachefs: Fix integer overflow on trans->nr_updates · 6f719cbe

Kent Overstreet authored 8 months ago


We can't have more updates than paths, so btree_path_idx_t is the
correct type to use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6f719cbe

bcachefs: silence silly kdoc warning · f05a0b9c
Kent Overstreet authored 8 months ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
f05a0b9c

bcachefs: Fix fsck warning about btree_trans not passed to fsck error · 2c4c17fe

Kent Overstreet authored 8 months ago

If a btree_trans is in use it's supposed to be passed to fsck_err so
that it can be unlocked if we're waiting on userspace input; but the
btree IO paths do call fsck errors where a btree_trans exists on the
stack but it's not passed through.

But it's ok, because it's unlocked while doing IO.

Fixes: a850bde6

 ("bcachefs: fsck_err() may now take a btree_trans")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2c4c17fe

bcachefs: Add an error message for insufficient rw journal devs · f12410bb

Kent Overstreet authored 8 months ago


This causes us to go read-only - need an error message saying why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f12410bb

bcachefs: varint: Avoid left-shift of a negative value · ee1b8dc1

Tavian Barnes authored 9 months ago


Shifting a negative value left is undefined.

Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ee1b8dc1

nsfs: use cleanup guard · 280e36f0

Christian Brauner authored 8 months ago

Ensure that rcu read lock is given up before returning.

Link: https://lore.kernel.org/r/20240716-elixier-fliesen-1ab342151a61@brauner
Fixes: ca567df7

 ("nsfs: add pid translation ioctls")
Reported-by:  <syzbot+a3e82ae343b26b4d2335@syzkaller.appspotmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

280e36f0

fs/adfs: add MODULE_DESCRIPTION · 400e4064

Jeff Johnson authored 10 months ago


Fix the 'make W=1' issue:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/adfs/adfs.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240523-md-adfs-v1-1-364268e38370@quicinc.com


Signed-off-by: Christian Brauner <brauner@kernel.org>

400e4064

hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr · dffe24e9

Donet Tom authored 8 months ago

generic_hugetlb_get_unmapped_area() was returning an address less than
mmap_min_addr if the mmap argument addr, after alignment, was less than
mmap_min_addr, causing mmap to fail.

This is because current generic_hugetlb_get_unmapped_area() code does not
take into account mmap_min_addr.

This patch ensures that generic_hugetlb_get_unmapped_area() always returns
an address that is greater than mmap_min_addr.  Additionally, similar to
generic_get_unmapped_area(), vm_end_gap() checks are included to maintain
stack gap.

How to reproduce
================

 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <unistd.h>

 #define HUGEPAGE_SIZE (16 * 1024 * 1024)

 int main() {

    void *addr = mmap((void *)-1, HUGEPAGE_SIZE,
                 PROT_READ | PROT_WRITE,
                 MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap");
        exit(EXIT_FAILURE);
    }

    snprintf((char *)addr, HUGEPAGE_SIZE, "Hello, Huge Pages!");

    printf("%s\n", (char *)addr);

    if (munmap(addr, HUGEPAGE_SIZE) == -1) {
        perror("munmap");
        exit(EXIT_FAILURE);
    }

    return 0;
 }

Result without fix
==================
 # cat /proc/meminfo |grep -i HugePages_Free
 HugePages_Free:       20
 # ./test
 mmap: Permission denied
 #

Result with fix
===============
 # cat /proc/meminfo |grep -i HugePages_Free
 HugePages_Free:       20
 # ./test
 Hello, Huge Pages!
 #

Link: https://lkml.kernel.org/r/20240710051912.4681-1-donettom@linux.ibm.com


Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reported-by Pavithra Prakash <pavrampu@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

dffe24e9

Jul 17, 2024

nfs: split nfs_read_folio · a308996e

Christoph Hellwig authored 8 months ago


nfs_read_folio is a bit hard to follow because it mixes highlevel logic
with the actual data read.  Split the latter into a helper and update
the comments to be more accurate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

a308996e

nfs: pass explicit offset/count to trace events · fada32ed

Christoph Hellwig authored 8 months ago

nfs_folio_length is unsafe to use without having the folio locked and a
check for a NULL ->f_mapping that protects against truncations and can
lead to kernel crashes.  E.g. when running xfstests generic/065 with
all nfs trace points enabled.

Follow the model of the XFS trace points and pass in an explіcit offset
and length.  This has the additional benefit that these values can
be more accurate as some of the users touch partial folio ranges.

Fixes: eb5654b3

 ("NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

fada32ed

virtio: rename virtio_find_vqs_info() to virtio_find_vqs() · 6c85d6b6

Jiri Pirko authored 8 months ago


Since the original virtio_find_vqs() is no longer present, rename
virtio_find_vqs_info() back to virtio_find_vqs().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-20-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

6c85d6b6

virtiofs: convert to use virtio_find_vqs_info() · fc496dcd

Jiri Pirko authored 8 months ago


Instead of passing separate names and callbacks arrays
to virtio_find_vqs(), allocate one of virtual_queue_info structs and
pass it to virtio_find_vqs_info().

Suggested-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-16-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

fc496dcd

Admin message