Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 20, 2024
  2. Jul 19, 2024
    • David Howells's avatar
      cifs: Add a tracepoint to track credits involved in R/W requests · 519be989
      David Howells authored
      
      Add a tracepoint to track the credit changes and server in_flight value
      involved in the lifetime of a R/W request, logging it against the
      request/subreq debugging ID.  This requires the debugging IDs to be
      recorded in the cifs_credits struct.
      
      The tracepoint can be enabled with:
      
      	echo 1 >/sys/kernel/debug/tracing/events/cifs/smb3_rw_credits/enable
      
      Also add a three-state flag to struct cifs_credits to note if we're
      interested in determining when the in_flight contribution ends and, if so,
      to track whether we've decremented the contribution yet.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      519be989
    • David Howells's avatar
      cifs: Fix setting of zero_point after DIO write · 61ea6b3a
      David Howells authored
      At the moment, at the end of a DIO write, cifs calls netfs_resize_file() to
      adjust the size of the file if it needs it.  This will reduce the
      zero_point (the point above which we assume a read will just return zeros)
      if it's more than the new i_size, but won't increase it.
      
      With DIO writes, however, we definitely want to increase it as we have
      clobbered the local pagecache and then written some data that's not
      available locally.
      
      Fix cifs to make the zero_point above the end of a DIO or unbuffered write.
      
      This fixes corruption seen occasionally with the generic/708 xfs-test.  In
      that case, the read-back of some of the written data is being
      short-circuited and replaced with zeroes.
      
      Fixes: 3ee1a1fc
      
       ("cifs: Cut over to using netfslib")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarSteve French <sfrench@samba.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      61ea6b3a
    • David Howells's avatar
      cifs: Fix missing error code set · d2c5eb57
      David Howells authored
      In cifs_strict_readv(), the default rc (-EACCES) is accidentally cleared by
      a successful return from netfs_start_io_direct(), such that if
      cifs_find_lock_conflict() fails, we don't return an error.
      
      Fix this by resetting the default error code.
      
      Fixes: 14b1cd25
      
       ("cifs: Fix locking in cifs_strict_readv()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      d2c5eb57
    • David Howells's avatar
      cifs: Fix server re-repick on subrequest retry · de40579b
      David Howells authored
      When a subrequest is marked for needing retry, netfs will call
      cifs_prepare_write() which will make cifs repick the server for the op
      before renegotiating credits; it then calls cifs_issue_write() which
      invokes smb2_async_writev() - which re-repicks the server.
      
      If a different server is then selected, this causes the increment of
      server->in_flight to happen against one record and the decrement to happen
      against another, leading to misaccounting.
      
      Fix this by just removing the repick code in smb2_async_writev().  As this
      is only called from netfslib-driven code, cifs_prepare_write() should
      always have been called first, and so server should never be NULL and the
      preparatory step is repeated in the event that we do a retry.
      
      The problem manifests as a warning looking something like:
      
       WARNING: CPU: 4 PID: 72896 at fs/smb/client/smb2ops.c:97 smb2_add_credits+0x3f0/0x9e0 [cifs]
       ...
       RIP: 0010:smb2_add_credits+0x3f0/0x9e0 [cifs]
       ...
        smb2_writev_callback+0x334/0x560 [cifs]
        cifs_demultiplex_thread+0x77a/0x11b0 [cifs]
        kthread+0x187/0x1d0
        ret_from_fork+0x34/0x60
        ret_from_fork_asm+0x1a/0x30
      
      Which may be triggered by a number of different xfstests running against an
      Azure server in multichannel mode.  generic/249 seems the most repeatable,
      but generic/215, generic/249 and generic/308 may also show it.
      
      Fixes: 3ee1a1fc
      
       ("cifs: Cut over to using netfslib")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarSteve French <smfrench@gmail.com>
      Reviewed-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      Acked-by: default avatarTom Talpey <tom@talpey.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Aurelien Aptel <aaptel@suse.com>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      de40579b
    • Steve French's avatar
      cifs: fix noisy message on copy_file_range · ae4ccca4
      Steve French authored
      
      There are common cases where copy_file_range can noisily
      log "source and target of copy not on same server"
      e.g. the mv command across mounts to two different server's shares.
      Change this to informational rather than logging as an error.
      
      A followon patch will add dynamic trace points e.g. for
      cifs_file_copychunk_range
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarShyam Prasad N <sprasad@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      ae4ccca4
  3. Jul 13, 2024
    • Steve French's avatar
      cifs: fix setting SecurityFlags to true · d2346e28
      Steve French authored
      
      If you try to set /proc/fs/cifs/SecurityFlags to 1 it
      will set them to CIFSSEC_MUST_NTLMV2 which no longer is
      relevant (the less secure ones like lanman have been removed
      from cifs.ko) and is also missing some flags (like for
      signing and encryption) and can even cause mount to fail,
      so change this to set it to Kerberos in this case.
      
      Also change the description of the SecurityFlags to remove mention
      of flags which are no longer supported.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarShyam Prasad N <sprasad@microsoft.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      d2346e28
  4. Jul 11, 2024
    • Kent Overstreet's avatar
      bcachefs: bch2_gc_btree() should not use btree_root_lock · 1841027c
      Kent Overstreet authored
      
      btree_root_lock is for the root keys in btree_root, not the pointers to
      the nodes themselves; this fixes a lock ordering issue between
      btree_root_lock and btree node locks.
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1841027c
    • Kent Overstreet's avatar
      bcachefs: Set PF_MEMALLOC_NOFS when trans->locked · f236ea4b
      Kent Overstreet authored
      
      proper lock ordering is: fs_reclaim -> btree node locks
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f236ea4b
    • Kent Overstreet's avatar
      bcachefs; Use trans_unlock_long() when waiting on allocator · f0f3e511
      Kent Overstreet authored
      
      not using unlock_long() blocks key cache reclaim, and the allocator may
      take awhile
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f0f3e511
    • Kent Overstreet's avatar
      Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT" · aacd897d
      Kent Overstreet authored
      This reverts commit 86d81ec5
      
      .
      
      This wasn't tested with memcg enabled, it immediately hits a null ptr
      deref in list_lru_add().
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      aacd897d
    • Filipe Manana's avatar
      btrfs: avoid races when tracking progress for extent map shrinking · 44849405
      Filipe Manana authored
      
      We store the progress (root and inode numbers) of the extent map shrinker
      in fs_info without any synchronization but we can have multiple tasks
      calling into the shrinker during memory allocations when there's enough
      memory pressure for example.
      
      This can result in a task A reading fs_info->extent_map_shrinker_last_ino
      after another task B updates it, and task A reading
      fs_info->extent_map_shrinker_last_root before task B updates it, making
      task A see an odd state that isn't necessarily harmful but may make it
      skip certain inode ranges or do more work than necessary by going over
      the same inodes again. These unprotected accesses would also trigger
      warnings from tools like KCSAN.
      
      So add a lock to protect access to these progress fields.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44849405
    • Filipe Manana's avatar
      btrfs: stop extent map shrinker if reschedule is needed · b3ebb9b7
      Filipe Manana authored
      
      The extent map shrinker can be called in a variety of contexts where we
      are under memory pressure, and of them is when a task is trying to
      allocate memory. For this reason the shrinker is typically called with a
      value of struct shrink_control::nr_to_scan that is much smaller than what
      we return in the nr_cached_objects callback of struct super_operations
      (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
      not take a long time and cause high latencies. However we can still take
      a lot of time in the shrinker even for a limited amount of nr_to_scan:
      
      1) When traversing the red black tree that tracks open inodes in a root,
         as for example with millions of open inodes we get a deep tree which
         takes time searching for an inode;
      
      2) Iterating over the extent map tree, which is a red black tree, of an
         inode when doing the rb_next() calls and when removing an extent map
         from the tree, since often that requires rebalancing the red black
         tree;
      
      3) When trying to write lock an inode's extent map tree we may wait for a
         significant amount of time, because there's either another task about
         to do IO and searching for an extent map in the tree or inserting an
         extent map in the tree, and we can have thousands or even millions of
         extent maps for an inode. Furthermore, there can be concurrent calls
         to the shrinker so the lock might be busy simply because there is
         already another task shrinking extent maps for the same inode;
      
      4) We often reschedule if we need to, which further increases latency.
      
      So improve on this by stopping the extent map shrinking code whenever we
      need to reschedule and make it skip an inode if we can't immediately lock
      its extent map tree.
      
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reported-by: default avatarAndrea Gelmini <andrea.gelmini@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
      
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3ebb9b7
    • Filipe Manana's avatar
      btrfs: use delayed iput during extent map shrinking · 68a3ebd1
      Filipe Manana authored
      When putting an inode during extent map shrinking we're doing a standard
      iput() but that may take a long time in case the inode is dirty and we are
      doing the final iput that triggers eviction - the VFS will have to wait
      for writeback before calling the btrfs evict callback (see
      fs/inode.c:evict()).
      
      This slows down the task running the shrinker which may have been
      triggered while updating some tree for example, meaning locks are held
      as well as an open transaction handle.
      
      Also if the iput() ends up triggering eviction and the inode has no links
      anymore, then we trigger item truncation which requires flushing delayed
      items, space reservation to start a transaction and that may trigger the
      space reclaim task and wait for it, resulting in deadlocks in case the
      reclaim task needs for example to commit a transaction and the shrinker
      is being triggered from a path holding a transaction handle.
      
      Syzbot reported such a case with the following stack traces:
      
         ======================================================
         WARNING: possible circular locking dependency detected
         6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted
         ------------------------------------------------------
         kswapd0/111 is trying to acquire lock:
         ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
      
         but task is already holding lock:
         ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #3 (fs_reclaim){+.+.}-{0:0}:
                __fs_reclaim_acquire mm/page_alloc.c:3783 [inline]
                fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797
                might_alloc include/linux/sched/mm.h:334 [inline]
                slab_pre_alloc_hook mm/slub.c:3890 [inline]
                slab_alloc_node mm/slub.c:3980 [inline]
                kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019
                btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
                alloc_inode+0x5d/0x230 fs/inode.c:261
                iget5_locked fs/inode.c:1235 [inline]
                iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
                btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
                btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
                btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
                create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911
                btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114
                btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373
                __btrfs_balance fs/btrfs/volumes.c:4157 [inline]
                btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534
                btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline]
                btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742
                __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
                join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
                join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #0 (sb_internal#3){.+.+}-{0:0}:
                check_prev_add kernel/locking/lockdep.c:3134 [inline]
                check_prevs_add kernel/locking/lockdep.c:3253 [inline]
                validate_chain kernel/locking/lockdep.c:3869 [inline]
                __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
                lock_acquire kernel/locking/lockdep.c:5754 [inline]
                lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
                percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
                __sb_start_write include/linux/fs.h:1655 [inline]
                sb_start_intwrite include/linux/fs.h:1838 [inline]
                start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
                btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
                btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
                evict+0x2ed/0x6c0 fs/inode.c:667
                iput_final fs/inode.c:1741 [inline]
                iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
                iput+0x5c/0x80 fs/inode.c:1757
                btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
                btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
                super_cache_scan+0x409/0x550 fs/super.c:227
                do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
                shrink_slab+0x18a/0x1310 mm/shrinker.c:662
                shrink_one+0x493/0x7c0 mm/vmscan.c:4790
                shrink_many mm/vmscan.c:4851 [inline]
                lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
                shrink_node mm/vmscan.c:5910 [inline]
                kswapd_shrink_node mm/vmscan.c:6720 [inline]
                balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
                kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
                kthread+0x2c1/0x3a0 kernel/kthread.c:389
                ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
                ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
         other info that might help us debug this:
      
         Chain exists of:
           sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim
      
          Possible unsafe locking scenario:
      
                CPU0                    CPU1
                ----                    ----
           lock(fs_reclaim);
                                        lock(btrfs_trans_num_extwriters);
                                        lock(fs_reclaim);
           rlock(sb_internal#3);
      
          *** DEADLOCK ***
      
         2 locks held by kswapd0/111:
          #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline]
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196
      
         stack backtrace:
         CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
         Call Trace:
          <TASK>
          __dump_stack lib/dump_stack.c:88 [inline]
          dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
          check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
          check_prev_add kernel/locking/lockdep.c:3134 [inline]
          check_prevs_add kernel/locking/lockdep.c:3253 [inline]
          validate_chain kernel/locking/lockdep.c:3869 [inline]
          __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
          lock_acquire kernel/locking/lockdep.c:5754 [inline]
          lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
          percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
          __sb_start_write include/linux/fs.h:1655 [inline]
          sb_start_intwrite include/linux/fs.h:1838 [inline]
          start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
          btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
          btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
          evict+0x2ed/0x6c0 fs/inode.c:667
          iput_final fs/inode.c:1741 [inline]
          iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
          iput+0x5c/0x80 fs/inode.c:1757
          btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
          btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
          super_cache_scan+0x409/0x550 fs/super.c:227
          do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
          shrink_slab+0x18a/0x1310 mm/shrinker.c:662
          shrink_one+0x493/0x7c0 mm/vmscan.c:4790
          shrink_many mm/vmscan.c:4851 [inline]
          lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
          shrink_node mm/vmscan.c:5910 [inline]
          kswapd_shrink_node mm/vmscan.c:6720 [inline]
          balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
          kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
          kthread+0x2c1/0x3a0 kernel/kthread.c:389
          ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
          </TASK>
      
      So fix this by using btrfs_add_delayed_iput() so that the final iput is
      delegated to the cleaner kthread.
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/
      
      
      Reported-by: default avatar <syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com>
      Fixes: 956a17d9
      
       ("btrfs: add a shrinker for extent maps")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68a3ebd1
  5. Jul 10, 2024
  6. Jul 06, 2024
    • Edward Adam Davis's avatar
      hfsplus: fix uninit-value in copy_name · 0570730c
      Edward Adam Davis authored
      
      [syzbot reported]
      BUG: KMSAN: uninit-value in sized_strscpy+0xc4/0x160
       sized_strscpy+0xc4/0x160
       copy_name+0x2af/0x320 fs/hfsplus/xattr.c:411
       hfsplus_listxattr+0x11e9/0x1a50 fs/hfsplus/xattr.c:750
       vfs_listxattr fs/xattr.c:493 [inline]
       listxattr+0x1f3/0x6b0 fs/xattr.c:840
       path_listxattr fs/xattr.c:864 [inline]
       __do_sys_listxattr fs/xattr.c:876 [inline]
       __se_sys_listxattr fs/xattr.c:873 [inline]
       __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873
       x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Uninit was created at:
       slab_post_alloc_hook mm/slub.c:3877 [inline]
       slab_alloc_node mm/slub.c:3918 [inline]
       kmalloc_trace+0x57b/0xbe0 mm/slub.c:4065
       kmalloc include/linux/slab.h:628 [inline]
       hfsplus_listxattr+0x4cc/0x1a50 fs/hfsplus/xattr.c:699
       vfs_listxattr fs/xattr.c:493 [inline]
       listxattr+0x1f3/0x6b0 fs/xattr.c:840
       path_listxattr fs/xattr.c:864 [inline]
       __do_sys_listxattr fs/xattr.c:876 [inline]
       __se_sys_listxattr fs/xattr.c:873 [inline]
       __x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873
       x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      [Fix]
      When allocating memory to strbuf, initialize memory to 0.
      
      Reported-and-tested-by: default avatar <syzbot+efde959319469ff8d4d7@syzkaller.appspotmail.com>
      Signed-off-by: default avatarEdward Adam Davis <eadavis@qq.com>
      Link: https://lore.kernel.org/r/tencent_8BBB6433BC9E1C1B7B4BDF1BF52574BA8808@qq.com
      
      
      Reported-and-tested-by: default avatar <syzbot+01ade747b16e9c8030e0@syzkaller.appspotmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      0570730c
  7. Jul 05, 2024
  8. Jul 04, 2024
  9. Jul 03, 2024
    • Boris Burkov's avatar
      btrfs: fix folio refcount in __alloc_dummy_extent_buffer() · a56c85fa
      Boris Burkov authored
      Another improper use of __folio_put() in an error path after freshly
      allocating pages/folios which returns them with the refcount initialized
      to 1. The refactor from __free_pages() -> __folio_put() (instead of
      folio_put) removed a refcount decrement found in __free_pages() and
      folio_put but absent from __folio_put().
      
      Fixes: 13df3775
      
       ("btrfs: cleanup metadata page pointer usage")
      CC: stable@vger.kernel.org # 6.8+
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a56c85fa
    • Boris Burkov's avatar
      btrfs: fix folio refcount in btrfs_do_encoded_write() · da0386c1
      Boris Burkov authored
      The conversion to folios switched __free_page() to __folio_put() in the
      error path in btrfs_do_encoded_write().
      
      However, this gets the page refcounting wrong. If we do hit that error
      path (I reproduced by modifying btrfs_do_encoded_write to pretend to
      always fail in a way that jumps to out_folios and running the fstests
      case btrfs/281), then we always hit the following BUG freeing the folio:
      
        BUG: Bad page state in process btrfs  pfn:40ab0b
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
         flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
        raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
        page dumped because: nonzero _refcount
        Call Trace:
        <TASK>
        dump_stack_lvl+0x3d/0xe0
        bad_page+0xea/0xf0
        free_unref_page+0x8e1/0x900
        ? __mem_cgroup_uncharge+0x69/0x90
        __folio_put+0xe6/0x190
        btrfs_do_encoded_write+0x445/0x780
        ? current_time+0x25/0xd0
        btrfs_do_write_iter+0x2cc/0x4b0
        btrfs_ioctl_encoded_write+0x2b6/0x340
      
      It turns out __free_page() decreases the page reference count while
      __folio_put() does not. Switch __folio_put() to folio_put() which
      decreases the folio reference count first.
      
      Fixes: 400b172b
      
       ("btrfs: compression: migrate compression/decompression paths to folios")
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da0386c1
    • Ryusuke Konishi's avatar
      nilfs2: fix incorrect inode allocation from reserved inodes · 93aef9ed
      Ryusuke Konishi authored
      If the bitmap block that manages the inode allocation status is corrupted,
      nilfs_ifile_create_inode() may allocate a new inode from the reserved
      inode area where it should not be allocated.
      
      Previous fix commit d325dc6e ("nilfs2: fix use-after-free bug of
      struct nilfs_root"), fixed the problem that reserved inodes with inode
      numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
      bitmap corruption, but since the start number of non-reserved inodes is
      read from the super block and may change, in which case inode allocation
      may occur from the extended reserved inode area.
      
      If that happens, access to that inode will cause an IO error, causing the
      file system to degrade to an error state.
      
      Fix this potential issue by adding a wraparound option to the common
      metadata object allocation routine and by modifying
      nilfs_ifile_create_inode() to disable the option so that it only allocates
      inodes with inode numbers greater than or equal to the inode number read
      in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
      inodes.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93aef9ed
    • Ryusuke Konishi's avatar
      nilfs2: add missing check for inode numbers on directory entries · bb76c6c2
      Ryusuke Konishi authored
      Syzbot reported that mounting and unmounting a specific pattern of
      corrupted nilfs2 filesystem images causes a use-after-free of metadata
      file inodes, which triggers a kernel bug in lru_add_fn().
      
      As Jan Kara pointed out, this is because the link count of a metadata file
      gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
      tries to delete that inode (ifile inode in this case).
      
      The inconsistency occurs because directories containing the inode numbers
      of these metadata files that should not be visible in the namespace are
      read without checking.
      
      Fix this issue by treating the inode numbers of these internal files as
      errors in the sanity check helper when reading directory folios/pages.
      
      Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
      analysis.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8
      
      
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3
      
      
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb76c6c2
    • Ryusuke Konishi's avatar
      nilfs2: fix inode number range checks · e2fec219
      Ryusuke Konishi authored
      Patch series "nilfs2: fix potential issues related to reserved inodes".
      
      This series fixes one use-after-free issue reported by syzbot, caused by
      nilfs2's internal inode being exposed in the namespace on a corrupted
      filesystem, and a couple of flaws that cause problems if the starting
      number of non-reserved inodes written in the on-disk super block is
      intentionally (or corruptly) changed from its default value.  
      
      
      This patch (of 3):
      
      In the current implementation of nilfs2, "nilfs->ns_first_ino", which
      gives the first non-reserved inode number, is read from the superblock,
      but its lower limit is not checked.
      
      As a result, if a number that overlaps with the inode number range of
      reserved inodes such as the root directory or metadata files is set in the
      super block parameter, the inode number test macros (NILFS_MDT_INODE and
      NILFS_VALID_INODE) will not function properly.
      
      In addition, these test macros use left bit-shift calculations using with
      the inode number as the ...
      e2fec219
    • Jingbo Xu's avatar
      cachefiles: add missing lock protection when polling · cf5bb09e
      Jingbo Xu authored
      Add missing lock protection in poll routine when iterating xarray,
      otherwise:
      
      Even with RCU read lock held, only the slot of the radix tree is
      ensured to be pinned there, while the data structure (e.g. struct
      cachefiles_req) stored in the slot has no such guarantee.  The poll
      routine will iterate the radix tree and dereference cachefiles_req
      accordingly.  Thus RCU read lock is not adequate in this case and
      spinlock is needed here.
      
      Fixes: b817e22b
      
       ("cachefiles: narrow the scope of triggering EPOLLIN events in ondemand mode")
      Signed-off-by: default avatarJingbo Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Link: https://lore.kernel.org/r/20240628062930.2467993-10-libaokun@huaweicloud.com
      
      
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarJia Zhu <zhujia.zj@bytedance.com>
      Reviewed-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      cf5bb09e
    • Baokun Li's avatar
      cachefiles: cyclic allocation of msg_id to avoid reuse · 19f4f399
      Baokun Li authored
      Reusing the msg_id after a maliciously completed reopen request may cause
      a read request to remain unprocessed and result in a hung, as shown below:
      
             t1       |      t2       |      t3
      -------------------------------------------------
      cachefiles_ondemand_select_req
       cachefiles_ondemand_object_is_close(A)
       cachefiles_ondemand_set_object_reopening(A)
       queue_work(fscache_object_wq, &info->work)
                      ondemand_object_worker
                       cachefiles_ondemand_init_object(A)
                        cachefiles_ondemand_send_req(OPEN)
                          // get msg_id 6
                          wait_for_completion(&req_A->done)
      cachefiles_ondemand_daemon_read
       // read msg_id 6 req_A
       cachefiles_ondemand_get_fd
       copy_to_user
                                      // Malicious completion msg_id 6
                                      copen 6,-1
                                      cachefiles_ondemand_copen
                                       complete(&req_A->done)
                                       // will not set the object to close
                                       // because ondemand_id && fd is valid.
      
                      // ondemand_object_worker() is done
                      // but the object is still reopening.
      
                                      // new open req_B
                                      cachefiles_ondemand_init_object(B)
                                       cachefiles_ondemand_send_req(OPEN)
                                       // reuse msg_id 6
      process_open_req
       copen 6,A.size
       // The expected failed copen was executed successfully
      
      Expect copen to fail, and when it does, it closes fd, which sets the
      object to close, and then close triggers reopen again. However, due to
      msg_id reuse resulting in a successful copen, the anonymous fd is not
      closed until the daemon exits. Therefore read requests waiting for reopen
      to complete may trigger hung task.
      
      To avoid this issue, allocate the msg_id cyclically to avoid reusing the
      msg_id for a very short duration of time.
      
      Fixes: c8383054
      
       ("cachefiles: notify the user daemon when looking up cookie")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Link: https://lore.kernel.org/r/20240628062930.2467993-9-libaokun@huaweicloud.com
      
      
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Reviewed-by: default avatarJia Zhu <zhujia.zj@bytedance.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      19f4f399