Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 26, 2024
  2. Jun 18, 2024
    • NeilBrown's avatar
      vfs: generate FS_CREATE before FS_OPEN when ->atomic_open used. · 7d1cf5e6
      NeilBrown authored
      When a file is opened and created with open(..., O_CREAT) we get
      both the CREATE and OPEN fsnotify events and would expect them in that
      order.   For most filesystems we get them in that order because
      open_last_lookups() calls fsnofify_create() and then do_open() (from
      path_openat()) calls vfs_open()->do_dentry_open() which calls
      fsnotify_open().
      
      However when ->atomic_open is used, the
         do_dentry_open() -> fsnotify_open()
      call happens from finish_open() which is called from the ->atomic_open
      handler in lookup_open() which is called *before* open_last_lookups()
      calls fsnotify_create.  So we get the "open" notification before
      "create" - which is backwards.  ltp testcase inotify02 tests this and
      reports the inconsistency.
      
      This patch lifts the fsnotify_open() call out of do_dentry_open() and
      places it higher up the call stack.  There are three callers of
      do_dentry_open().
      
      For vfs_open() and kernel_file_open() the fsnotify_open() is placed
      directly in that caller so there should be no behavioural change.
      
      For finish_open() there are two cases:
       - finish_open is used in ->atomic_open handlers.  For these we add a
         call to fsnotify_open() at open_last_lookups() if FMODE_OPENED is
         set - which means do_dentry_open() has been called.
       - finish_open is used in ->tmpfile() handlers.  For these a similar
         call to fsnotify_open() is added to vfs_tmpfile()
      
      With this patch NFSv3 is restored to its previous behaviour (before
      ->atomic_open support was added) of generating CREATE notifications
      before OPEN, and NFSv4 now has that same correct ordering that is has
      not had before.  I haven't tested other filesystems.
      
      Fixes: 7c6c5249
      
       ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.")
      Reported-by: default avatarJames Clark <james.clark@arm.com>
      Closes: https://lore.kernel.org/all/01c3bf2e-eb1f-4b7f-a54f-d2a05dd3d8c8@arm.com
      
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Link: https://lore.kernel.org/r/171817619547.14261.975798725161704336@noble.neil.brown.name
      Fixes: 7b8c9d7b
      
       ("fsnotify: move fsnotify_open() hook into do_dentry_open()")
      Tested-by: default avatarJames Clark <james.clark@arm.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20240617162303.1596-2-jack@suse.cz
      
      
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      7d1cf5e6
    • Christian Brauner's avatar
      fs: don't misleadingly warn during thaw operations · 2ae4db56
      Christian Brauner authored
      The block device may have been frozen before it was claimed by a
      filesystem. Concurrently another process might try to mount that
      frozen block device and has temporarily claimed the block device for
      that purpose causing a concurrent fs_bdev_thaw() to end up here. The
      mounter is already about to abort mounting because they still saw an
      elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
      NULL in that case.
      
      For example, P1 calls dm_suspend() which calls into bdev_freeze() before
      the block device has been claimed by the filesystem. This brings
      bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is
      required.
      
      Now P2 tries to mount that frozen block device. It claims it and checks
      bdev->bd_fsfreeze_count. As it's elevated it aborts mounting.
      
      In the meantime P3 called dm_resume(). P3 sees that the block device is
      already claimed by a filesystem and calls into fs_bdev_thaw().
      
      P3 takes a passive reference and realizes that the filesystem isn't
      ready yet. P3 puts itself to sleep to wait for the filesystem to become
      ready.
      
      P2 now puts the last active reference to the filesystem and marks it as
      dying. P3 gets woken, sees that the filesystem is dying and
      get_bdev_super() fails.
      
      Fixes: 49ef8832
      
       ("bdev: implement freeze and thaw holder operations")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu
      Link: https://lore.kernel.org/r/20240613-lackmantel-einsehen-90f0d727358d@brauner
      
      
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      2ae4db56
  3. Jun 07, 2024
  4. Jun 06, 2024
    • Qu Wenruo's avatar
      btrfs: protect folio::private when attaching extent buffer folios · f3a5367c
      Qu Wenruo authored
      [BUG]
      Since v6.8 there are rare kernel crashes reported by various people,
      the common factor is bad page status error messages like this:
      
        BUG: Bad page state in process kswapd0  pfn:d6e840
        page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
        pfn:0xd6e840
        aops:btree_aops ino:1
        flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
        page_type: 0xffffffff()
        raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
        raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: non-NULL mapping
      
      [CAUSE]
      Commit 09e6cef1 ("btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method") changes the sequence when allocating a new
      extent buffer.
      
      Previously we always called grab_extent_buffer() under
      mapping->i_private_lock, to ensure the safety on modification on
      folio::private (which is a pointer to extent buffer for regular
      sectorsize).
      
      This can lead to the following race:
      
      Thread A is trying to allocate an extent buffer at bytenr X, with 4
      4K pages, meanwhile thread B is trying to release the page at X + 4K
      (the second page of the extent buffer at X).
      
                 Thread A                |                 Thread B
      -----------------------------------+-------------------------------------
                                         | btree_release_folio()
      				   | | This is for the page at X + 4K,
      				   | | Not page X.
      				   | |
      alloc_extent_buffer()              | |- release_extent_buffer()
      |- filemap_add_folio() for the     | |  |- atomic_dec_and_test(eb->refs)
      |  page at bytenr X (the first     | |  |
      |  page).                          | |  |
      |  Which returned -EEXIST.         | |  |
      |                                  | |  |
      |- filemap_lock_folio()            | |  |
      |  Returned the first page locked. | |  |
      |                                  | |  |
      |- grab_extent_buffer()            | |  |
      |  |- atomic_inc_not_zero()        | |  |
      |  |  Returned false               | |  |
      |  |- folio_detach_private()       | |  |- folio_detach_private() for X
      |     |- folio_test_private()      | |     |- folio_test_private()
            |  Returned true             | |     |  Returned true
            |- folio_put()               |       |- folio_put()
      
      Now there are two puts on the same folio at folio X, leading to refcount
      underflow of the folio X, and eventually causing the BUG_ON() on the
      page->mapping.
      
      The condition is not that easy to hit:
      
      - The release must be triggered for the middle page of an eb
        If the release is on the same first page of an eb, page lock would kick
        in and prevent the race.
      
      - folio_detach_private() has a very small race window
        It's only between folio_test_private() and folio_clear_private().
      
      That's exactly when mapping->i_private_lock is used to prevent such race,
      and commit 09e6cef1
      
       ("btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method") screwed that up.
      
      At that time, I thought the page lock would kick in as
      filemap_release_folio() also requires the page to be locked, but forgot
      the filemap_release_folio() only locks one page, not all pages of an
      extent buffer.
      
      [FIX]
      Move all the code requiring i_private_lock into
      attach_eb_folio_to_filemap(), so that everything is done with proper
      lock protection.
      
      Furthermore to prevent future problems, add an extra
      lockdep_assert_locked() to ensure we're holding the proper lock.
      
      To reproducer that is able to hit the race (takes a few minutes with
      instrumented code inserting delays to alloc_extent_buffer()):
      
        #!/bin/sh
        drop_caches () {
      	  while(true); do
      		  echo 3 > /proc/sys/vm/drop_caches
      		  echo 1 > /proc/sys/vm/compact_memory
      	  done
        }
      
        run_tar () {
      	  while(true); do
      		  for x in `seq 1 80` ; do
      			  tar cf /dev/zero /mnt > /dev/null &
      		  done
      		  wait
      	  done
        }
      
        mkfs.btrfs -f -d single -m single /dev/vda
        mount -o noatime /dev/vda /mnt
        # create 200,000 files, 1K each
        ./simoop -n 200000 -E -f 1k /mnt
        drop_caches &
        (run_tar)
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
      
      
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
      
      
      Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
      
      
      Reported-by: default avatar <syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com>
      Fixes: 09e6cef1
      
       ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
      CC: stable@vger.kernel.org # 6.8+
      CC: Chris Mason <clm@fb.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3a5367c
  5. Jun 05, 2024
    • Ryusuke Konishi's avatar
      nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errors · 7373a51e
      Ryusuke Konishi authored
      The error handling in nilfs_empty_dir() when a directory folio/page read
      fails is incorrect, as in the old ext2 implementation, and if the
      folio/page cannot be read or nilfs_check_folio() fails, it will falsely
      determine the directory as empty and corrupt the file system.
      
      In addition, since nilfs_empty_dir() does not immediately return on a
      failed folio/page read, but continues to loop, this can cause a long loop
      with I/O if i_size of the directory's inode is also corrupted, causing the
      log writer thread to wait and hang, as reported by syzbot.
      
      Fix these issues by making nilfs_empty_dir() immediately return a false
      value (0) if it fails to get a directory folio/page.
      
      Link: https://lkml.kernel.org/r/20240604134255.7165-1-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+c8166c541d3971bf6c87@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=c8166c541d3971bf6c87
      Fixes: 2ba466d7
      
       ("nilfs2: directory entry operations")
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7373a51e
    • Chengming Zhou's avatar
      mm/ksm: fix ksm_zero_pages accounting · c2dc78b8
      Chengming Zhou authored
      We normally ksm_zero_pages++ in ksmd when page is merged with zero page,
      but ksm_zero_pages-- is done from page tables side, where there is no any
      accessing protection of ksm_zero_pages.
      
      So we can read very exceptional value of ksm_zero_pages in rare cases,
      such as -1, which is very confusing to users.
      
      Fix it by changing to use atomic_long_t, and the same case with the
      mm->ksm_zero_pages.
      
      Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev
      Fixes: e2942062 ("ksm: count all zero pages placed by KSM")
      Fixes: 6080d19f
      
       ("ksm: add ksm zero pages for each process")
      Signed-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: xu xin <xu.xin16@zte.com.cn>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2dc78b8
    • Ryusuke Konishi's avatar
      nilfs2: fix potential kernel bug due to lack of writeback flag waiting · a4ca369c
      Ryusuke Konishi authored
      Destructive writes to a block device on which nilfs2 is mounted can cause
      a kernel bug in the folio/page writeback start routine or writeback end
      routine (__folio_start_writeback in the log below):
      
       kernel BUG at mm/page-writeback.c:3070!
       Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
       ...
       RIP: 0010:__folio_start_writeback+0xbaa/0x10e0
       Code: 25 ff 0f 00 00 0f 84 18 01 00 00 e8 40 ca c6 ff e9 17 f6 ff ff
        e8 36 ca c6 ff 4c 89 f7 48 c7 c6 80 c0 12 84 e8 e7 b3 0f 00 90 <0f>
        0b e8 1f ca c6 ff 4c 89 f7 48 c7 c6 a0 c6 12 84 e8 d0 b3 0f 00
       ...
       Call Trace:
        <TASK>
        nilfs_segctor_do_construct+0x4654/0x69d0 [nilfs2]
        nilfs_segctor_construct+0x181/0x6b0 [nilfs2]
        nilfs_segctor_thread+0x548/0x11c0 [nilfs2]
        kthread+0x2f0/0x390
        ret_from_fork+0x4b/0x80
        ret_from_fork_asm+0x1a/0x30
        </TASK>
      
      This is because when the log writer starts a writeback for segment summary
      blocks or a super root block that use the backing device's page cache, it
      does not wait for the ongoing folio/page writeback, resulting in an
      inconsistent writeback state.
      
      Fix this issue by waiting for ongoing writebacks when putting
      folios/pages on the backing device into writeback state.
      
      Link: https://lkml.kernel.org/r/20240530141556.4411-1-konishi.ryusuke@gmail.com
      Fixes: 9ff05123
      
       ("nilfs2: segment constructor")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4ca369c
    • Filipe Manana's avatar
      btrfs: fix leak of qgroup extent records after transaction abort · fb33eb2e
      Filipe Manana authored
      
      Qgroup extent records are created when delayed ref heads are created and
      then released after accounting extents at btrfs_qgroup_account_extents(),
      called during the transaction commit path.
      
      If a transaction is aborted we free the qgroup records by calling
      btrfs_qgroup_destroy_extent_records() at btrfs_destroy_delayed_refs(),
      unless we don't have delayed references. We are incorrectly assuming
      that no delayed references means we don't have qgroup extents records.
      
      We can currently have no delayed references because we ran them all
      during a transaction commit and the transaction was aborted after that
      due to some error in the commit path.
      
      So fix this by ensuring we btrfs_qgroup_destroy_extent_records() at
      btrfs_destroy_delayed_refs() even if we don't have any delayed references.
      
      Reported-by: default avatar <syzbot+0fecc032fa134afd49df@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/0000000000004e7f980619f91835@google.com/
      Fixes: 81f7eb00
      
       ("btrfs: destroy qgroup extent records on transaction abort")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb33eb2e
    • Omar Sandoval's avatar
      btrfs: fix crash on racing fsync and size-extending write into prealloc · 9d274c19
      Omar Sandoval authored
      We have been seeing crashes on duplicate keys in
      btrfs_set_item_key_safe():
      
        BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:2620!
        invalid opcode: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
        RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
      
      With the following stack trace:
      
        #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
        #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
        #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
        #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
        #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
        #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
        #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
        #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
        #8  vfs_fsync_range (fs/sync.c:188:9)
        #9  vfs_fsync (fs/sync.c:202:9)
        #10 do_fsync (fs/sync.c:212:9)
        #11 __do_sys_fdatasync (fs/sync.c:225:9)
        #12 __se_sys_fdatasync (fs/sync.c:223:1)
        #13 __x64_sys_fdatasync (fs/sync.c:223:1)
        #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
        #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
        #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
      
      So we're logging a changed extent from fsync, which is splitting an
      extent in the log tree. But this split part already exists in the tree,
      triggering the BUG().
      
      This is the state of the log tree at the time of the crash, dumped with
      drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py
      
      )
      to get more details than btrfs_print_leaf() gives us:
      
        >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
        leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
        leaf 33439744 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
                item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                        generation 7 transid 9 size 8192 nbytes 8473563889606862198
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 204 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417704.983333333 (2024-05-22 15:41:44)
                        mtime 1716417704.983333333 (2024-05-22 15:41:44)
                        otime 17592186044416.000000000 (559444-03-08 01:40:16)
                item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                        index 195 namelen 3 name: 193
                item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 4096 ram 12288
                        extent compression 0 (none)
                item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 4096 nr 8192
                item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
        ...
      
      So the real problem happened earlier: notice that items 4 (4k-12k) and 5
      (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
      item 5 starts at i_size.
      
      Here is the state of the filesystem tree at the time of the crash:
      
        >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
        >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
        >>> print_extent_buffer(nodes[0])
        leaf 30425088 level 0 items 184 generation 9 owner 5
        leaf 30425088 flags 0x100000000000000
        fs uuid e5bd3946-400c-4223-8923-190ef1f18677
        chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
        	...
                item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                        generation 7 transid 7 size 4096 nbytes 12288
                        block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                        sequence 6 flags 0x10(PREALLOC)
                        atime 1716417703.220000000 (2024-05-22 15:41:43)
                        ctime 1716417703.220000000 (2024-05-22 15:41:43)
                        mtime 1716417703.220000000 (2024-05-22 15:41:43)
                        otime 1716417703.220000000 (2024-05-22 15:41:43)
                item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                        index 195 namelen 3 name: 193
                item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                        location key (0 UNKNOWN.0 0) type XATTR
                        transid 7 data_len 1 name_len 6
                        name: user.a
                        data a
                item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                        generation 9 type 1 (regular)
                        extent data disk byte 303144960 nr 12288
                        extent data offset 0 nr 8192 ram 12288
                        extent compression 0 (none)
                item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                        generation 9 type 2 (prealloc)
                        prealloc data disk byte 303144960 nr 12288
                        prealloc data offset 8192 nr 4096
      
      Item 5 in the log tree corresponds to item 183 in the filesystem tree,
      but nothing matches item 4. Furthermore, item 183 is the last item in
      the leaf.
      
      btrfs_log_prealloc_extents() is responsible for logging prealloc extents
      beyond i_size. It first truncates any previously logged prealloc extents
      that start beyond i_size. Then, it walks the filesystem tree and copies
      the prealloc extent items to the log tree.
      
      If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
      unlocks the tree and does another search. However, while the filesystem
      tree is unlocked, an ordered extent completion may modify the tree. In
      particular, it may insert an extent item that overlaps with an extent
      item that was already copied to the log tree.
      
      This may manifest in several ways depending on the exact scenario,
      including an EEXIST error that is silently translated to a full sync,
      overlapping items in the log tree, or this crash. This particular crash
      is triggered by the following sequence of events:
      
      - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
        prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
        the last item in its B-tree leaf.
      - The file is fsync'd, which copies its inode item and both extent items
        to the log tree.
      - An xattr is set on the file, which sets the
        BTRFS_INODE_COPY_EVERYTHING flag.
      - The range 4k-8k in the file is written using direct I/O. i_size is
        extended to 8k, but the ordered extent is still in flight.
      - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
        calls copy_inode_items_to_log(), which calls
        btrfs_log_prealloc_extents().
      - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
        filesystem tree. Since it starts before i_size, it skips it. Since it
        is the last item in its B-tree leaf, it calls btrfs_next_leaf().
      - btrfs_next_leaf() unlocks the path.
      - The ordered extent completion runs, which converts the 4k-8k part of
        the prealloc extent to written and inserts the remaining prealloc part
        from 8k-12k.
      - btrfs_next_leaf() does a search and finds the new prealloc extent
        8k-12k.
      - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
        the log tree. Note that it overlaps with the 4k-12k prealloc extent
        that was copied to the log tree by the first fsync.
      - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
        extent that was written.
      - This tries to drop the range 4k-8k in the log tree, which requires
        adjusting the start of the 4k-12k prealloc extent in the log tree to
        8k.
      - btrfs_set_item_key_safe() sees that there is already an extent
        starting at 8k in the log tree and calls BUG().
      
      Fix this by detecting when we're about to insert an overlapping file
      extent item in the log tree and truncating the part that would overlap.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d274c19
    • Kent Overstreet's avatar
      bcachefs: Fix trans->locked assert · 319fef29
      Kent Overstreet authored
      
      in bch2_move_data_btree, we might start with the trans unlocked from a
      previous loop iteration - we need a trans_begin() before iter_init().
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      319fef29
    • Kent Overstreet's avatar
      bcachefs: Rereplicate now moves data off of durability=0 devices · fdccb243
      Kent Overstreet authored
      
      This fixes an issue where setting a device to durability=0 after it's
      been used makes it impossible to remove.
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      fdccb243
    • Kent Overstreet's avatar
  6. May 31, 2024
    • Steve French's avatar
      cifs: fix creating sockets when using sfu mount options · 518549c1
      Steve French authored
      
      When running fstest generic/423 with sfu mount option, it
      was being skipped due to inability to create sockets:
      
        generic/423  [not run] cifs does not support mknod/mkfifo
      
      which can also be easily reproduced with their af_unix tool:
      
        ./src/af_unix /mnt1/socket-two bind: Operation not permitted
      
      Fix sfu mount option to allow creating and reporting sockets.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      518549c1
  7. May 29, 2024
  8. May 28, 2024
  9. May 27, 2024
    • Ritesh Harjani (IBM)'s avatar
      xfs: Add cond_resched to block unmap range and reflink remap path · b0c6bcd5
      Ritesh Harjani (IBM) authored
      
      An async dio write to a sparse file can generate a lot of extents
      and when we unlink this file (using rm), the kernel can be busy in umapping
      and freeing those extents as part of transaction processing.
      
      Similarly xfs reflink remapping path can also iterate over a million
      extent entries in xfs_reflink_remap_blocks().
      
      Since we can busy loop in these two functions, so let's add cond_resched()
      to avoid softlockup messages like these.
      
      watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
      CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
      Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
      NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
      LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
      Call Trace:
        xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
        xfs_alloc_compute_aligned+0x5c/0x144
        xfs_alloc_ag_vextent_size+0x238/0x8d4
        xfs_alloc_fix_freelist+0x540/0x694
        xfs_free_extent_fix_freelist+0x84/0xe0
        __xfs_free_extent+0x74/0x1ec
        xfs_extent_free_finish_item+0xcc/0x214
        xfs_defer_finish_one+0x194/0x388
        xfs_defer_finish_noroll+0x1b4/0x5c8
        xfs_defer_finish+0x2c/0xc4
        xfs_bunmapi_range+0xa4/0x100
        xfs_itruncate_extents_flags+0x1b8/0x2f4
        xfs_inactive_truncate+0xe0/0x124
        xfs_inactive+0x30c/0x3e0
        xfs_inodegc_worker+0x140/0x234
        process_scheduled_works+0x240/0x57c
        worker_thread+0x198/0x468
        kthread+0x138/0x140
        start_kernel_thread+0x14/0x18
      
      run fstests generic/175 at 2024-02-02 04:40:21
      [   C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
       watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
       CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G X 6.4.0
       NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs]
       LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs]
       Call Trace:
        0xc000000014107c00 (unreliable)
        __xfs_btree_updkeys+0x8c/0x2c0 [xfs]
        xfs_btree_update_keys+0x150/0x170 [xfs]
        xfs_btree_lshift+0x534/0x660 [xfs]
        xfs_btree_make_block_unfull+0x19c/0x240 [xfs]
        xfs_btree_insrec+0x4e4/0x630 [xfs]
        xfs_btree_insert+0x104/0x2d0 [xfs]
        xfs_rmap_insert+0xc4/0x260 [xfs]
        xfs_rmap_map_shared+0x228/0x630 [xfs]
        xfs_rmap_finish_one+0x2d4/0x350 [xfs]
        xfs_rmap_update_finish_item+0x44/0xc0 [xfs]
        xfs_defer_finish_noroll+0x2e4/0x740 [xfs]
        __xfs_trans_commit+0x1f4/0x400 [xfs]
        xfs_reflink_remap_extent+0x2d8/0x650 [xfs]
        xfs_reflink_remap_blocks+0x154/0x320 [xfs]
        xfs_file_remap_range+0x138/0x3a0 [xfs]
        do_clone_file_range+0x11c/0x2f0
        vfs_clone_file_range+0x60/0x1c0
        ioctl_file_clone+0x78/0x140
        sys_ioctl+0x934/0x1270
        system_call_exception+0x158/0x320
        system_call_vectored_common+0x15c/0x2ec
      
      Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
      Signed-off-by: default avatarRitesh Harjani (IBM) <ritesh.list@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Tested-by: default avatarDisha <Goel&lt;disgoel@linux.ibm.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      b0c6bcd5
    • David Howells's avatar
      netfs, 9p: Fix race between umount and async request completion · f89ea63f
      David Howells authored
      
      There's a problem in 9p's interaction with netfslib whereby a crash occurs
      because the 9p_fid structs get forcibly destroyed during client teardown
      (without paying attention to their refcounts) before netfslib has finished
      with them.  However, it's not a simple case of deferring the clunking that
      p9_fid_put() does as that requires the p9_client record to still be
      present.
      
      The problem is that netfslib has to unlock pages and clear the IN_PROGRESS
      flag before destroying the objects involved - including the fid - and, in
      any case, nothing checks to see if writeback completed barring looking at
      the page flags.
      
      Fix this by keeping a count of outstanding I/O requests (of any type) and
      waiting for it to quiesce during inode eviction.
      
      Reported-by: default avatar <syzbot+df038d463cca332e8414@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/all/0000000000005be0aa061846f8d6@google.com/
      
      
      Reported-by: default avatar <syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com>
      Link: h...
      f89ea63f
    • Darrick J. Wong's avatar
      xfs: don't open-code u64_to_user_ptr · 95b19e2f
      Darrick J. Wong authored
      
      Don't open-code what the kernel already provides.
      
      Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      95b19e2f