Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 02, 2024
    • David Howells's avatar
      cifs: Fix read-performance regression by dropping readahead expansion · 08f70c0a
      David Howells authored
      cifs_expand_read() is causing a performance regression of around 30% by
      causing extra pagecache to be allocated for an inode in the readahead path
      before we begin actually dispatching RPC requests, thereby delaying the
      actual I/O.  The expansion is sized according to the rsize parameter, which
      seems to be 4MiB on my test system; this is a big step up from the first
      requests made by the fio test program.
      
      Simple repro (look at read bandwidth number):
           fio --name=writetest --filename=/xfstest.test/foo --time_based --runtime=60 --size=16M --numjobs=1 --rw=read
      
      Fix this by removing cifs_expand_readahead().  Readahead expansion is
      mostly useful for when we're using the local cache if the local cache has a
      block size greater than PAGE_SIZE, so we can dispense with it when not
      caching.
      
      Fixes: 69c3c023
      
       ("cifs: Implement netfslib hooks")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      08f70c0a
  2. Jun 26, 2024
    • Darrick J. Wong's avatar
      xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs · 673cd885
      Darrick J. Wong authored
      xfs_init_new_inode ignores the init_xattrs parameter for filesystems
      that do not have ATTR enabled.  As a result, the first init_xattrs file
      to be created by the kernel will not have an attr fork created to store
      acls.  Storing that first acl will add ATTR to the superblock flags, so
      subsequent files will be created with attr forks.  The overhead of this
      is so small that chances are that nobody has noticed this behavior.
      
      However, this is disastrous on a filesystem with parent pointers because
      it requires that a new linkable file /must/ have a pre-existing attr
      fork, and the parent pointers code uses init_xattrs to create that fork.
      The preproduction version of mkfs.xfs used to set this, but the V5 sb
      verifier only requires ATTR2, not ATTR.  There is no guard for
      filesystems with (PARENT && !ATTR).
      
      It turns out that I misunderstood the two flags -- ATTR means that we at
      some point created an attr fork to store xattrs in a file; ATTR2
      apparently means only that inodes have dynamic fork offsets or that the
      filesystem was mounted with the "attr2" option.
      
      Fixes: 2442ee15
      
       ("xfs: eager inode attr fork init needs attr feature awareness")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      673cd885
    • Darrick J. Wong's avatar
      xfs: fix direction in XFS_IOC_EXCHANGE_RANGE · dc5e1cba
      Darrick J. Wong authored
      
      The kernel reads userspace's buffer but does not write it back.
      Therefore this is really an _IOW ioctl.  Change this before 6.10 final
      releases.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      dc5e1cba
    • Darrick J. Wong's avatar
      xfs: allow unlinked symlinks and dirs with zero size · 1ec9307f
      Darrick J. Wong authored
      For a very very long time, inode inactivation has set the inode size to
      zero before unmapping the extents associated with the data fork.
      Unfortunately, commit 3c6f46ea changed the inode verifier to
      prohibit zero-length symlinks and directories.  If an inode happens to
      get logged in this state and the system crashes before freeing the
      inode, log recovery will also fail on the broken inode.
      
      Therefore, allow zero-size symlinks and directories as long as the link
      count is zero; nobody will be able to open these files by handle so
      there isn't any risk of data exposure.
      
      Fixes: 3c6f46ea
      
       ("xfs: sanity check directory inode di_size")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      1ec9307f
    • Darrick J. Wong's avatar
      xfs: restrict when we try to align cow fork delalloc to cowextsz hints · 288e1f69
      Darrick J. Wong authored
      xfs/205 produces the following failure when always_cow is enabled:
      
        --- a/tests/xfs/205.out	2024-02-28 16:20:24.437887970 -0800
        +++ b/tests/xfs/205.out.bad	2024-06-03 21:13:40.584000000 -0700
        @@ -1,4 +1,5 @@
         QA output created by 205
         *** one file
        +   !!! disk full (expected)
         *** one file, a few bytes at a time
         *** done
      
      This is the result of overly aggressive attempts to align cow fork
      delalloc reservations to the CoW extent size hint.  Looking at the trace
      data, we're trying to append a single fsblock to the "fred" file.
      Trying to create a speculative post-eof reservation fails because
      there's not enough space.
      
      We then set @prealloc_blocks to zero and try again, but the cowextsz
      alignment code triggers, which expands our request for a 1-fsblock
      reservation into a 39-block reservation.  There's not enough space for
      that, so the whole write fails with ENOSPC even though there's
      sufficient space in the filesystem to allocate the single block that we
      need to land the write.
      
      There are two things wrong here -- first, we shouldn't be attempting
      speculative preallocations beyond what was requested when we're low on
      space.  Second, if we've already computed a posteof preallocation, we
      shouldn't bother trying to align that to the cowextsize hint.
      
      Fix both of these problems by adding a flag that only enables the
      expansion of the delalloc reservation to the cowextsize if we're doing a
      non-extending write, and only if we're not doing an ENOSPC retry.  This
      requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
      
      I probably should have caught this six years ago when 6ca30729 was
      being reviewed, but oh well.  Update the comments to reflect what the
      code does now.
      
      Fixes: 6ca30729
      
       ("xfs: bmap code cleanup")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      288e1f69
    • Christoph Hellwig's avatar
      xfs: fix freeing speculative preallocations for preallocated files · 610b2916
      Christoph Hellwig authored
      
      xfs_can_free_eofblocks returns false for files that have persistent
      preallocations unless the force flag is passed and there are delayed
      blocks.  This means it won't free delalloc reservations for files
      with persistent preallocations unless the force flag is set, and it
      will also free the persistent preallocations if the force flag is
      set and the file happens to have delayed allocations.
      
      Both of these are bad, so do away with the force flag and always free
      only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
      or APPEND flags set.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      610b2916
  3. Jun 25, 2024
  4. Jun 24, 2024
    • Christoph Hellwig's avatar
      nfs: drop the incorrect assertion in nfs_swap_rw() · 54e7d598
      Christoph Hellwig authored
      Since commit 2282679f ("mm: submit multipage write for SWP_FS_OPS
      swap-space"), we can plug multiple pages then unplug them all together. 
      That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
      actually equals the size of iov_iter_npages(iter, INT_MAX).
      
      Note this issue has nothing to do with large folios as we don't support
      THP_SWPOUT to non-block devices.
      
      [v-songbaohua@oppo.com: figure out the cause and correct the commit message]
      Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com
      Fixes: 2282679f
      
       ("mm: submit multipage write for SWP_FS_OPS swap-space")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/
      
      
      Reviewed-by: default avatarMartin Wege <martin.l.wege@gmail.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Steve French <sfrench@samba.org>
      Cc: Trond Myklebust <trondmy@kernel.or...
      54e7d598
    • Jan Kara's avatar
      ocfs2: fix DIO failure due to insufficient transaction credits · be346c1a
      Jan Kara authored
      The code in ocfs2_dio_end_io_write() estimates number of necessary
      transaction credits using ocfs2_calc_extend_credits().  This however does
      not take into account that the IO could be arbitrarily large and can
      contain arbitrary number of extents.
      
      Extent tree manipulations do often extend the current transaction but not
      in all of the cases.  For example if we have only single block extents in
      the tree, ocfs2_mark_extent_written() will end up calling
      ocfs2_replace_extent_rec() all the time and we will never extend the
      current transaction and eventually exhaust all the transaction credits if
      the IO contains many single block extents.  Once that happens a
      WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
      jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
      this error.  This was actually triggered by one of our customers on a
      heavily fragmented OCFS2 filesystem.
      
      To fix the issue make sure the transaction always has enough credits for
      one extent insert before each call of ocfs2_mark_extent_written().
      
      Heming Zhao said:
      
      ------
      PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"
      
      PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
        #0 machine_kexec at ffffffff8c069932
        #1 __crash_kexec at ffffffff8c1338fa
        #2 panic at ffffffff8c1d69b9
        #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
        #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
        #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
        #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
        #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
        #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
        #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
      #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
      #11 dio_complete at ffffffff8c2b9fa7
      #12 do_blockdev_direct_IO at ffffffff8c2bc09f
      #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
      #14 generic_file_direct_write at ffffffff8c1dcf14
      #15 __generic_file_write_iter at ffffffff8c1dd07b
      #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
      #17 aio_write at ffffffff8c2cc72e
      #18 kmem_cache_alloc at ffffffff8c248dde
      #19 do_io_submit at ffffffff8c2ccada
      #20 do_syscall_64 at ffffffff8c004984
      #21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba
      
      Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz
      Fixes: c15471f7
      
       ("ocfs2: fix sparse file & data ordering issue in direct io")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be346c1a
    • Jeff Xu's avatar
      /proc/pid/smaps: add mseal info for vma · 399ab86e
      Jeff Xu authored
      Add sl in /proc/pid/smaps to indicate vma is sealed
      
      Link: https://lkml.kernel.org/r/20240614232014.806352-2-jeffxu@google.com
      Fixes: 8be7258a
      
       ("mseal: add mseal syscall")
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stephen Röttger <sroettger@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      399ab86e
    • Filipe Manana's avatar
      btrfs: qgroup: fix quota root leak after quota disable failure · a7e4c6a3
      Filipe Manana authored
      If during the quota disable we fail when cleaning the quota tree or when
      deleting the root from the root tree, we jump to the 'out' label without
      ever dropping the reference on the quota root, resulting in a leak of the
      root since fs_info->quota_root is no longer pointing to the root (we have
      set it to NULL just before those steps).
      
      Fix this by always doing a btrfs_put_root() call under the 'out' label.
      This is a problem that exists since qgroups were first added in 2012 by
      commit bed92eae
      
       ("Btrfs: qgroup implementation and prototypes"), but
      back then we missed a kfree on the quota root and free_extent_buffer()
      calls on its root and commit root nodes, since back then roots were not
      yet reference counted.
      
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7e4c6a3
    • Qu Wenruo's avatar
      btrfs: scrub: handle RST lookup error correctly · 2c499086
      Qu Wenruo authored
      
      [BUG]
      When running btrfs/060 with forced RST feature, it would crash the
      following ASSERT() inside scrub_read_endio():
      
      	ASSERT(sector_nr < stripe->nr_sectors);
      
      Before that, we would have tree dump from
      btrfs_get_raid_extent_offset(), as we failed to find the RST entry for
      the range.
      
      [CAUSE]
      Inside scrub_submit_extent_sector_read() every time we allocated a new
      bbio we immediately called btrfs_map_block() to make sure there was some
      RST range covering the scrub target.
      
      But if btrfs_map_block() fails, we immediately call endio for the bbio,
      while the bbio is newly allocated, it's completely empty.
      
      Then inside scrub_read_endio(), we go through the bvecs to find
      the sector number (as bi_sector is no longer reliable if the bio is
      submitted to lower layers).
      
      And since the bio is empty, such bvecs iteration would not find any
      sector matching the sector, and return sector_nr == stripe->nr_sectors,
      triggering the ASSERT().
      
      [FIX]
      Instead of calling btrfs_map_block() after allocating a new bbio, call
      btrfs_map_block() first.
      
      Since our only objective of calling btrfs_map_block() is only to update
      stripe_len, there is really no need to do that after btrfs_alloc_bio().
      
      This new timing would avoid the problem of handling empty bbio
      completely, and in fact fixes a possible race window for the old code,
      where if the submission thread is the only owner of the pending_io, the
      scrub would never finish (since we didn't decrease the pending_io
      counter).
      
      Although the root cause of RST lookup failure still needs to be
      addressed.
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c499086
    • Naohiro Aota's avatar
      btrfs: zoned: fix initial free space detection · b9fd2aff
      Naohiro Aota authored
      When creating a new block group, it calls btrfs_add_new_free_space() to add
      the entire block group range into the free space accounting.
      __btrfs_add_free_space_zoned() checks if size == block_group->length to
      detect the initial free space adding, and proceed that case properly.
      
      However, if the zone_capacity == zone_size and the over-write speed is fast
      enough, the entire zone can be over-written within one transaction. That
      confuses __btrfs_add_free_space_zoned() to handle it as an initial free
      space accounting. As a result, that block group becomes a strange state: 0
      used bytes, 0 zone_unusable bytes, but alloc_offset == zone_capacity (no
      allocation anymore).
      
      The initial free space accounting can properly be checked by checking
      alloc_offset too.
      
      Fixes: 98173255
      
       ("btrfs: zoned: calculate free space from zone capacity")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9fd2aff
    • Filipe Manana's avatar
      btrfs: use NOFS context when getting inodes during logging and log replay · d1825752
      Filipe Manana authored
      
      During inode logging (and log replay too), we are holding a transaction
      handle and we often need to call btrfs_iget(), which will read an inode
      from its subvolume btree if it's not loaded in memory and that results in
      allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode()
      callback - and this may recurse into the filesystem in case we are under
      memory pressure and attempt to commit the current transaction, resulting
      in a deadlock since the logging (or log replay) task is holding a
      transaction handle open.
      
      Syzbot reported this with the following stack traces:
      
        WARNING: possible circular locking dependency detected
        6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted
        ------------------------------------------------------
        syz-executor.1/9919 is trying to acquire lock:
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline]
        ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
      
        but task is already holding lock:
        ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&ei->log_mutex){+.+.}-{3:3}:
               __mutex_lock_common kernel/locking/mutex.c:608 [inline]
               __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752
               btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
               btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079
               btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
               btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
               vfs_fsync_range+0x141/0x230 fs/sync.c:188
               generic_write_sync include/linux/fs.h:2794 [inline]
               btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
               new_sync_write fs/read_write.c:497 [inline]
               vfs_write+0x6b6/0x1140 fs/read_write.c:590
               ksys_write+0x12f/0x260 fs/read_write.c:643
               do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
               __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
               join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
               start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
               btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170
               close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324
               generic_shutdown_super+0x159/0x3d0 fs/super.c:642
               kill_anon_super+0x3a/0x60 fs/super.c:1226
               btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096
               deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
               deactivate_super+0xde/0x100 fs/super.c:506
               cleanup_mnt+0x222/0x450 fs/namespace.c:1267
               task_work_run+0x14e/0x250 kernel/task_work.c:180
               resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
               exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
               exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
               __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
               syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218
               __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
               __lock_release kernel/locking/lockdep.c:5468 [inline]
               lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774
               percpu_up_read include/linux/percpu-rwsem.h:99 [inline]
               __sb_end_write include/linux/fs.h:1650 [inline]
               sb_end_intwrite include/linux/fs.h:1767 [inline]
               __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071
               btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301
               btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
               evict+0x2ed/0x6c0 fs/inode.c:667
               iput_final fs/inode.c:1741 [inline]
               iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
               iput+0x5c/0x80 fs/inode.c:1757
               dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
               __dentry_kill+0x1d0/0x600 fs/dcache.c:603
               dput.part.0+0x4b1/0x9b0 fs/dcache.c:845
               dput+0x1f/0x30 fs/dcache.c:835
               ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132
               ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182
               destroy_inode+0xc4/0x1b0 fs/inode.c:311
               iput_final fs/inode.c:1741 [inline]
               iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
               iput+0x5c/0x80 fs/inode.c:1757
               dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
               __dentry_kill+0x1d0/0x600 fs/dcache.c:603
               shrink_kill fs/dcache.c:1048 [inline]
               shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075
               prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156
               super_cache_scan+0x32a/0x550 fs/super.c:221
               do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
               shrink_slab_memcg mm/shrinker.c:548 [inline]
               shrink_slab+0xa87/0x1310 mm/shrinker.c:626
               shrink_one+0x493/0x7c0 mm/vmscan.c:4790
               shrink_many mm/vmscan.c:4851 [inline]
               lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
               shrink_node mm/vmscan.c:5910 [inline]
               kswapd_shrink_node mm/vmscan.c:6720 [inline]
               balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
               kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
               kthread+0x2c1/0x3a0 kernel/kthread.c:389
               ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
               ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
        -> #0 (fs_reclaim){+.+.}-{0:0}:
               check_prev_add kernel/locking/lockdep.c:3134 [inline]
               check_prevs_add kernel/locking/lockdep.c:3253 [inline]
               validate_chain kernel/locking/lockdep.c:3869 [inline]
               __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
               lock_acquire kernel/locking/lockdep.c:5754 [inline]
               lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
               __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
               fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
               might_alloc include/linux/sched/mm.h:334 [inline]
               slab_pre_alloc_hook mm/slub.c:3891 [inline]
               slab_alloc_node mm/slub.c:3981 [inline]
               kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
               btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
               alloc_inode+0x5d/0x230 fs/inode.c:261
               iget5_locked fs/inode.c:1235 [inline]
               iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
               btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
               btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
               btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
               add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
               copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
               btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
               log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
               btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
               btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
               btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
               btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
               btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
               vfs_fsync_range+0x141/0x230 fs/sync.c:188
               generic_write_sync include/linux/fs.h:2794 [inline]
               btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
               do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
               vfs_writev+0x36f/0xde0 fs/read_write.c:971
               do_pwritev+0x1b2/0x260 fs/read_write.c:1072
               __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
               __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
               __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
               do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
               __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
               do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
               entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
        other info that might help us debug this:
      
        Chain exists of:
          fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&ei->log_mutex);
                                       lock(btrfs_trans_num_extwriters);
                                       lock(&ei->log_mutex);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
        7 locks held by syz-executor.1/9919:
         #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072
         #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline]
         #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385
         #2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388
         #3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952
         #4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
         #5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
         #6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
      
        stack backtrace:
        CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
         check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
         check_prev_add kernel/locking/lockdep.c:3134 [inline]
         check_prevs_add kernel/locking/lockdep.c:3253 [inline]
         validate_chain kernel/locking/lockdep.c:3869 [inline]
         __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
         lock_acquire kernel/locking/lockdep.c:5754 [inline]
         lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
         __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
         fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
         might_alloc include/linux/sched/mm.h:334 [inline]
         slab_pre_alloc_hook mm/slub.c:3891 [inline]
         slab_alloc_node mm/slub.c:3981 [inline]
         kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
         btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
         alloc_inode+0x5d/0x230 fs/inode.c:261
         iget5_locked fs/inode.c:1235 [inline]
         iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
         btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
         btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
         btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
         add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
         copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
         btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
         log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
         btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
         btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
         btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
         btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
         btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
         vfs_fsync_range+0x141/0x230 fs/sync.c:188
         generic_write_sync include/linux/fs.h:2794 [inline]
         btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
         do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
         vfs_writev+0x36f/0xde0 fs/read_write.c:971
         do_pwritev+0x1b2/0x260 fs/read_write.c:1072
         __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
         __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
         __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
         do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
         __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
         do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
         entry_SYSENTER_compat_after_hwframe+0x84/0x8e
        RIP: 0023:0xf7334579
        Code: b8 01 10 06 03 (...)
        RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
        R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Fix this by ensuring we are under a NOFS scope whenever we call
      btrfs_iget() during inode logging and log replay.
      
      Reported-by: default avatar <syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/
      Fixes: 712e36c5
      
       ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1825752
    • Arnd Bergmann's avatar
      ftruncate: pass a signed offset · 4b8e88e5
      Arnd Bergmann authored
      The old ftruncate() syscall, using the 32-bit off_t misses a sign
      extension when called in compat mode on 64-bit architectures.  As a
      result, passing a negative length accidentally succeeds in truncating
      to file size between 2GiB and 4GiB.
      
      Changing the type of the compat syscall to the signed compat_off_t
      changes the behavior so it instead returns -EINVAL.
      
      The native entry point, the truncate() syscall and the corresponding
      loff_t based variants are all correct already and do not suffer
      from this mistake.
      
      Fixes: 3f6d078d
      
       ("fix compat truncate/ftruncate")
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      4b8e88e5
  5. Jun 23, 2024
  6. Jun 22, 2024
  7. Jun 21, 2024
  8. Jun 20, 2024
    • David Howells's avatar
      cifs: Move the 'pid' from the subreq to the req · 3f591385
      David Howells authored
      
      Move the reference pid from the cifs_io_subrequest struct to the
      cifs_io_request struct as it's the same for all subreqs of a particular
      request.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      3f591385
    • David Howells's avatar
      cifs: Only pick a channel once per read request · 969b3010
      David Howells authored
      
      In cifs, only pick a channel when setting up a read request rather than
      doing so individually for every subrequest and instead use that channel for
      all.  This mirrors what the code in v6.9 does.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Steve French <sfrench@samba.org>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      969b3010
    • David Howells's avatar
      cifs: Defer read completion · ce5291e5
      David Howells authored
      Defer read completion from the I/O thread to the cifsiod thread so as not
      to slow down the I/O thread.  This restores the behaviour of v6.9.
      
      Fixes: 3ee1a1fc
      
       ("cifs: Cut over to using netfslib")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      ce5291e5
    • Youling Tang's avatar
      bcachefs: fix alignment of VMA for memory mapped files on THP · c6cab97c
      Youling Tang authored
      With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs
      for read-only mmapped files, such as shared libraries. However, the
      kernel makes no attempt to actually align those mappings on 2MB
      boundaries, which makes it impossible to use those THPs most of the
      time. This issue applies to general file mapping THP as well as
      existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily
      fixed by using thp_get_unmapped_area for the unmapped_area function
      in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use.
      
      Similar to commit b0c58223
      
       ("btrfs: fix alignment of VMA for
      memory mapped files on THP").
      
      Signed-off-by: default avatarYouling Tang <tangyouling@kylinos.cn>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c6cab97c
    • Kent Overstreet's avatar
      bcachefs: Fix safe errors by default · 33dfafa9
      Kent Overstreet authored
      
      i.e. the start of automatic self healing:
      
      If errors=continue or fix_safe, we now automatically fix simple errors
      without user intervention.
      
      New error action option: fix_safe
      
      This replaces the existing errors=ro option, which gets a new slot, i.e.
      existing errors=ro users now get errors=fix_safe.
      
      This is currently only enabled for a limited set of errors - initially
      just disk accounting; errors we would never not want to fix, and we
      don't want to require user intervention (i.e. to make sure a bug report
      gets filed).
      
      Errors will still be counted in the superblock, so we (developers) will
      still know they've been occuring if a bug report gets filed (as bug
      reports typically include the errors superblock section).
      
      Eventually we'll be enabling this for a much wider set of errors, after
      we've done thorough error injection testing.
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      33dfafa9
  9. Jun 19, 2024