Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 18, 2024
  2. Jul 15, 2024
  3. Jul 14, 2024
  4. Jul 13, 2024
  5. Jul 12, 2024
  6. Jul 11, 2024
    • Kent Overstreet's avatar
      bcachefs: bch2_gc_btree() should not use btree_root_lock · 1841027c
      Kent Overstreet authored
      
      btree_root_lock is for the root keys in btree_root, not the pointers to
      the nodes themselves; this fixes a lock ordering issue between
      btree_root_lock and btree node locks.
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1841027c
    • Kent Overstreet's avatar
      bcachefs: Set PF_MEMALLOC_NOFS when trans->locked · f236ea4b
      Kent Overstreet authored
      
      proper lock ordering is: fs_reclaim -> btree node locks
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f236ea4b
    • Kent Overstreet's avatar
      bcachefs; Use trans_unlock_long() when waiting on allocator · f0f3e511
      Kent Overstreet authored
      
      not using unlock_long() blocks key cache reclaim, and the allocator may
      take awhile
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      f0f3e511
    • Kent Overstreet's avatar
      Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT" · aacd897d
      Kent Overstreet authored
      This reverts commit 86d81ec5
      
      .
      
      This wasn't tested with memcg enabled, it immediately hits a null ptr
      deref in list_lru_add().
      
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      aacd897d
    • Filipe Manana's avatar
      btrfs: avoid races when tracking progress for extent map shrinking · 44849405
      Filipe Manana authored
      
      We store the progress (root and inode numbers) of the extent map shrinker
      in fs_info without any synchronization but we can have multiple tasks
      calling into the shrinker during memory allocations when there's enough
      memory pressure for example.
      
      This can result in a task A reading fs_info->extent_map_shrinker_last_ino
      after another task B updates it, and task A reading
      fs_info->extent_map_shrinker_last_root before task B updates it, making
      task A see an odd state that isn't necessarily harmful but may make it
      skip certain inode ranges or do more work than necessary by going over
      the same inodes again. These unprotected accesses would also trigger
      warnings from tools like KCSAN.
      
      So add a lock to protect access to these progress fields.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44849405
    • Filipe Manana's avatar
      btrfs: stop extent map shrinker if reschedule is needed · b3ebb9b7
      Filipe Manana authored
      
      The extent map shrinker can be called in a variety of contexts where we
      are under memory pressure, and of them is when a task is trying to
      allocate memory. For this reason the shrinker is typically called with a
      value of struct shrink_control::nr_to_scan that is much smaller than what
      we return in the nr_cached_objects callback of struct super_operations
      (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
      not take a long time and cause high latencies. However we can still take
      a lot of time in the shrinker even for a limited amount of nr_to_scan:
      
      1) When traversing the red black tree that tracks open inodes in a root,
         as for example with millions of open inodes we get a deep tree which
         takes time searching for an inode;
      
      2) Iterating over the extent map tree, which is a red black tree, of an
         inode when doing the rb_next() calls and when removing an extent map
         from the tree, since often that requires rebalancing the red black
         tree;
      
      3) When trying to write lock an inode's extent map tree we may wait for a
         significant amount of time, because there's either another task about
         to do IO and searching for an extent map in the tree or inserting an
         extent map in the tree, and we can have thousands or even millions of
         extent maps for an inode. Furthermore, there can be concurrent calls
         to the shrinker so the lock might be busy simply because there is
         already another task shrinking extent maps for the same inode;
      
      4) We often reschedule if we need to, which further increases latency.
      
      So improve on this by stopping the extent map shrinking code whenever we
      need to reschedule and make it skip an inode if we can't immediately lock
      its extent map tree.
      
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reported-by: default avatarAndrea Gelmini <andrea.gelmini@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
      
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3ebb9b7
    • Filipe Manana's avatar
      btrfs: use delayed iput during extent map shrinking · 68a3ebd1
      Filipe Manana authored
      When putting an inode during extent map shrinking we're doing a standard
      iput() but that may take a long time in case the inode is dirty and we are
      doing the final iput that triggers eviction - the VFS will have to wait
      for writeback before calling the btrfs evict callback (see
      fs/inode.c:evict()).
      
      This slows down the task running the shrinker which may have been
      triggered while updating some tree for example, meaning locks are held
      as well as an open transaction handle.
      
      Also if the iput() ends up triggering eviction and the inode has no links
      anymore, then we trigger item truncation which requires flushing delayed
      items, space reservation to start a transaction and that may trigger the
      space reclaim task and wait for it, resulting in deadlocks in case the
      reclaim task needs for example to commit a transaction and the shrinker
      is being triggered from a path holding a transaction handle.
      
      Syzbot reported such a case with the following stack traces:
      
         ======================================================
         WARNING: possible circular locking dependency detected
         6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted
         ------------------------------------------------------
         kswapd0/111 is trying to acquire lock:
         ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
      
         but task is already holding lock:
         ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #3 (fs_reclaim){+.+.}-{0:0}:
                __fs_reclaim_acquire mm/page_alloc.c:3783 [inline]
                fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797
                might_alloc include/linux/sched/mm.h:334 [inline]
                slab_pre_alloc_hook mm/slub.c:3890 [inline]
                slab_alloc_node mm/slub.c:3980 [inline]
                kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019
                btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
                alloc_inode+0x5d/0x230 fs/inode.c:261
                iget5_locked fs/inode.c:1235 [inline]
                iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
                btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
                btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
                btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
                create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911
                btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114
                btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373
                __btrfs_balance fs/btrfs/volumes.c:4157 [inline]
                btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534
                btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline]
                btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742
                __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
                join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
                join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314
                start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
                btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
                btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
                open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
                btrfs_fill_super fs/btrfs/super.c:946 [inline]
                btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
                btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                fc_mount+0x16/0xc0 fs/namespace.c:1125
                btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
                btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
                vfs_get_tree+0x8f/0x380 fs/super.c:1780
                do_new_mount fs/namespace.c:3352 [inline]
                path_mount+0x6e1/0x1f10 fs/namespace.c:3679
                do_mount fs/namespace.c:3692 [inline]
                __do_sys_mount fs/namespace.c:3898 [inline]
                __se_sys_mount fs/namespace.c:3875 [inline]
                __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
                do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
                __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
                do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
                entry_SYSENTER_compat_after_hwframe+0x84/0x8e
      
         -> #0 (sb_internal#3){.+.+}-{0:0}:
                check_prev_add kernel/locking/lockdep.c:3134 [inline]
                check_prevs_add kernel/locking/lockdep.c:3253 [inline]
                validate_chain kernel/locking/lockdep.c:3869 [inline]
                __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
                lock_acquire kernel/locking/lockdep.c:5754 [inline]
                lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
                percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
                __sb_start_write include/linux/fs.h:1655 [inline]
                sb_start_intwrite include/linux/fs.h:1838 [inline]
                start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
                btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
                btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
                evict+0x2ed/0x6c0 fs/inode.c:667
                iput_final fs/inode.c:1741 [inline]
                iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
                iput+0x5c/0x80 fs/inode.c:1757
                btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
                btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
                super_cache_scan+0x409/0x550 fs/super.c:227
                do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
                shrink_slab+0x18a/0x1310 mm/shrinker.c:662
                shrink_one+0x493/0x7c0 mm/vmscan.c:4790
                shrink_many mm/vmscan.c:4851 [inline]
                lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
                shrink_node mm/vmscan.c:5910 [inline]
                kswapd_shrink_node mm/vmscan.c:6720 [inline]
                balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
                kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
                kthread+0x2c1/0x3a0 kernel/kthread.c:389
                ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
                ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
         other info that might help us debug this:
      
         Chain exists of:
           sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim
      
          Possible unsafe locking scenario:
      
                CPU0                    CPU1
                ----                    ----
           lock(fs_reclaim);
                                        lock(btrfs_trans_num_extwriters);
                                        lock(fs_reclaim);
           rlock(sb_internal#3);
      
          *** DEADLOCK ***
      
         2 locks held by kswapd0/111:
          #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline]
          #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196
      
         stack backtrace:
         CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
         Call Trace:
          <TASK>
          __dump_stack lib/dump_stack.c:88 [inline]
          dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
          check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
          check_prev_add kernel/locking/lockdep.c:3134 [inline]
          check_prevs_add kernel/locking/lockdep.c:3253 [inline]
          validate_chain kernel/locking/lockdep.c:3869 [inline]
          __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
          lock_acquire kernel/locking/lockdep.c:5754 [inline]
          lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
          percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
          __sb_start_write include/linux/fs.h:1655 [inline]
          sb_start_intwrite include/linux/fs.h:1838 [inline]
          start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
          btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
          btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
          evict+0x2ed/0x6c0 fs/inode.c:667
          iput_final fs/inode.c:1741 [inline]
          iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
          iput+0x5c/0x80 fs/inode.c:1757
          btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
          btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
          super_cache_scan+0x409/0x550 fs/super.c:227
          do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
          shrink_slab+0x18a/0x1310 mm/shrinker.c:662
          shrink_one+0x493/0x7c0 mm/vmscan.c:4790
          shrink_many mm/vmscan.c:4851 [inline]
          lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
          shrink_node mm/vmscan.c:5910 [inline]
          kswapd_shrink_node mm/vmscan.c:6720 [inline]
          balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
          kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
          kthread+0x2c1/0x3a0 kernel/kthread.c:389
          ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
          </TASK>
      
      So fix this by using btrfs_add_delayed_iput() so that the final iput is
      delegated to the cleaner kthread.
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/
      
      
      Reported-by: default avatar <syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com>
      Fixes: 956a17d9
      
       ("btrfs: add a shrinker for extent maps")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68a3ebd1
    • Filipe Manana's avatar
      btrfs: fix extent map use-after-free when adding pages to compressed bio · 8e786054
      Filipe Manana authored
      
      At add_ra_bio_pages() we are accessing the extent map to calculate
      'add_size' after we dropped our reference on the extent map, resulting
      in a use-after-free. Fix this by computing 'add_size' before dropping our
      extent map reference.
      
      Reported-by: default avatar <syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com>
      Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/
      Fixes: 6a404910
      
       ("btrfs: subpage: make add_ra_bio_pages() compatible")
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e786054
    • Kees Cook's avatar
      affs: struct slink_front: Replace 1-element array with flexible array · 0aef1d41
      Kees Cook authored
      Replace the deprecated[1] use of a 1-element array in
      struct slink_front with a modern flexible array.
      
      No binary differences are present after this conversion.
      
      Link: https://github.com/KSPP/linux/issues/79
      
       [1]
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0aef1d41
    • Kees Cook's avatar
      affs: struct affs_data_head: Replace 1-element array with flexible array · e5f5ee82
      Kees Cook authored
      Replace the deprecated[1] use of a 1-element array in
      struct affs_data_head with a modern flexible array.
      
      No binary differences are present after this conversion.
      
      Link: https://github.com/KSPP/linux/issues/79
      
       [1]
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5f5ee82
    • Kees Cook's avatar
      affs: struct affs_head: Replace 1-element array with flexible array · 38a381a0
      Kees Cook authored
      AFFS uses struct affs_head's "table" array as a flexible array. Switch
      this to a proper flexible array[1]. There are no sizeof() uses; struct
      affs_head is only ever uses via direct casts. No binary output
      differences were found after this change.
      
      Link: https://github.com/KSPP/linux/issues/79
      
       [1]
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarKees Cook <kees@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38a381a0
    • Filipe Manana's avatar
      btrfs: fix bitmap leak when loading free space cache on duplicate entry · 320d8dc6
      Filipe Manana authored
      
      If we failed to link a free space entry because there's already a
      conflicting entry for the same offset, we free the free space entry but
      we don't free the associated bitmap that we had just allocated before.
      Fix that by freeing the bitmap before freeing the entry.
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      320d8dc6
    • Qu Wenruo's avatar
      btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() · a3948437
      Qu Wenruo authored
      
      Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
      we expected all involved folios to be still locked, thus no folio should be
      missing.
      
      However for extent_range_clear_dirty_for_io() itself, we can skip the
      missing folio and handle the remaining ones, and return an error if
      there is anything wrong.
      
      Remove the BUG_ON() and let the caller to handle the error.
      In the caller we do not have a quick way to cleanup the error, but all
      the compression routines would handle the missing folio as an error and
      properly error out, so we only need to do an ASSERT() for developers,
      while for non-debug build the compression routine would handle the
      error correctly.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3948437
    • Qu Wenruo's avatar
      btrfs: move extent_range_clear_dirty_for_io() into inode.c · af61081f
      Qu Wenruo authored
      
      The function is only used inside inode.c by compress_file_range(),
      so move it to inode.c and unexport it.
      
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af61081f
    • David Sterba's avatar
      btrfs: enhance compression error messages · be9438f0
      David Sterba authored
      
      Add more verbose and specific messages to all main error points in
      compression code for all algorithms. Currently there's no way to know
      which inode is affected or where in the data errors happened.
      
      The messages follow a common format:
      
      - what happened
      - error code if relevant
      - root and inode
      - additional data like offsets or lengths
      
      There's no helper for the messages as they differ in some details and
      that would be cumbersome to generalize to a single function. As all the
      errors are "almost never happens" there are the unlikely annotations
      done as compression is hot path.
      
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be9438f0
    • Filipe Manana's avatar
      btrfs: fix data race when accessing the last_trans field of a root · ca84529a
      Filipe Manana authored
      
      KCSAN complains about a data race when accessing the last_trans field of a
      root:
      
        [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
      
        [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
        [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
        [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
        [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
        [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
        [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
        [  199.559123]  touch_atime+0x16c/0x1e0
        [  199.559151]  pipe_read+0x6a8/0x7d0
        [  199.559179]  vfs_read+0x466/0x498
        [  199.559204]  ksys_read+0x108/0x150
        [  199.559230]  __s390x_sys_read+0x68/0x88
        [  199.559257]  do_syscall+0x1c6/0x210
        [  199.559286]  __do_syscall+0xc8/0xf0
        [  199.559318]  system_call+0x70/0x98
      
        [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
        [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
        [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
        [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
        [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
        [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
        [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
        [  199.564277]  file_update_time+0xb8/0xf0
        [  199.564301]  pipe_write+0x8ac/0xab8
        [  199.564326]  vfs_write+0x33c/0x588
        [  199.564349]  ksys_write+0x108/0x150
        [  199.564372]  __s390x_sys_write+0x68/0x88
        [  199.564397]  do_syscall+0x1c6/0x210
        [  199.564424]  __do_syscall+0xc8/0xf0
        [  199.564452]  system_call+0x70/0x98
      
      This is because we update and read last_trans concurrently without any
      type of synchronization. This should be generally harmless and in the
      worst case it can make us do extra locking (btrfs_record_root_in_trans())
      trigger some warnings at ctree.c or do extra work during relocation - this
      would probably only happen in case of load or store tearing.
      
      So fix this by always reading and updating the field using READ_ONCE()
      and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca84529a
    • Qu Wenruo's avatar
      btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() · 0fbf6cbd
      Qu Wenruo authored
      
      There is only one caller utilizing the @extra_gfp parameter,
      alloc_eb_folio_array().  And in that case the extra_gfp is only assigned
      to __GFP_NOFAIL.
      
      Rename the @extra_gfp parameter to @nofail to indicate that.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0fbf6cbd
    • Qu Wenruo's avatar
      btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array() · fea91134
      Qu Wenruo authored
      
      The function btrfs_alloc_folio_array() is only utilized in
      btrfs_submit_compressed_read() and no other location, and the only
      caller is not utilizing the @extra_gfp parameter.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fea91134
    • Qu Wenruo's avatar
      btrfs: introduce new "rescue=ignoresuperflags" mount option · 32e62165
      Qu Wenruo authored
      
      This new mount option allows the kernel to skip the super flags check,
      it's mostly to allow the kernel to do a rescue mount of an interrupted
      checksum conversion.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32e62165
    • Qu Wenruo's avatar
      btrfs: introduce new "rescue=ignoremetacsums" mount option · 169aaaf2
      Qu Wenruo authored
      
      Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the
      other metadata sanity checks are still kept as is.
      
      This new mount option is mostly to allow the kernel to mount an
      interrupted checksum conversion (at the metadata csum overwrite stage).
      
      And since the main part of metadata sanity checks is inside
      tree-checker, we shouldn't lose much safety, and the new mount option is
      rescue mount option it requires full read-only mount.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      169aaaf2
    • Qu Wenruo's avatar
      btrfs: output the unrecognized super block flags as hex · cf31b271
      Qu Wenruo authored
      
      Most of the extra super block flags are beyond 32bits (from
      CHANGING_FSID_V2 to CHANGING_*_CSUMS), thus using %llu is not only too
      long and pretty hard to read.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf31b271
    • Qu Wenruo's avatar
      btrfs: remove unused Opt enums · 14114c98
      Qu Wenruo authored
      
      The following three Opt_* enums haven't been utilized since the port to
      new mount API:
      
      - Opt_ignorebadroots
      - Opt_ignoredatacsums
      - Opt_rescue_all
      
      All those enums are from the old day where we have dedicated mount
      options, nowadays they have been moved to "rescue=" mount option
      groups, and no more global tokens for them.
      
      So we can safely remove them now.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14114c98
    • Qu Wenruo's avatar
      btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check · 5fc070a9
      Qu Wenruo authored
      
      This is to ensure non-compressed file extents (both regular and
      prealloc) should have matching ram_bytes and disk_num_bytes.
      
      This is only for CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT case,
      furthermore this will not return error, but just a kernel warning to
      inform developers.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fc070a9
    • Qu Wenruo's avatar
      btrfs: fix the ram_bytes assignment for truncated ordered extents · 896c8b92
      Qu Wenruo authored
      
      [HICCUP]
      After adding extra checks on btrfs_file_extent_item::ram_bytes to
      tree-checker, running fsstress leads to tree-checker warning at write time,
      as we created file extent items with an invalid ram_bytes.
      
      All those offending file extents have offset 0, and ram_bytes matching
      num_bytes, and smaller than disk_num_bytes.
      
      This would also trigger the recently enhanced btrfs-check, which catches
      such mismatches and report them as minor errors.
      
      [CAUSE]
      When a folio/page is invalidated and it is part of a submitted OE, we
      mark the OE truncated just to the beginning of the folio/page.
      
      And for truncated OE, we insert the file extent item with incorrect
      value for ram_bytes (using num_bytes instead of the usual value).
      
      This is not a big deal for end users, as we do not utilize the ram_bytes
      field for regular non-compressed extents.
      This mismatch is just a small violation against on-disk format.
      
      [FIX]
      Fix it by removing the override on btrfs_file_extent_item::ram_bytes.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      896c8b92
    • Qu Wenruo's avatar
      btrfs: make validate_extent_map() catch ram_bytes mismatch · 1b87d26a
      Qu Wenruo authored
      
      Previously validate_extent_map() is only to catch bugs related to
      extent_map member cleanups.
      
      But with recent btrfs-check enhancement to catch ram_bytes mismatch with
      disk_num_bytes, it would be much better to catch such extent maps
      earlier.
      
      So this patch adds extra ram_bytes validation for extent maps.
      
      Please note that, older filesystems with such mismatch won't trigger this error:
      
      - extent_map::ram_bytes is already fixed
        Previous patch has already fixed the ram_bytes for affected file
        extents.
      
      So this enhanced sanity check should not affect end users.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b87d26a
    • Qu Wenruo's avatar
      btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes · 88e2e6d7
      Qu Wenruo authored
      
      [HICCUP]
      Kernels can create file extent items with incorrect ram_bytes like this:
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
      		generation 7 type 1 (regular)
      		extent data disk byte 13631488 nr 32768
      		extent data offset 0 nr 4096 ram 4096
      		extent compression 0 (none)
      
      Thankfully kernel can handle them properly, as in that case ram_bytes is
      not utilized at all.
      
      [ENHANCEMENT]
      Since the hiccup is not going to cause any data-loss and is only a minor
      violation of on-disk format, here we only need to ignore the incorrect
      ram_bytes value, and use the correct one from
      btrfs_file_extent_item::disk_num_bytes.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88e2e6d7
    • Qu Wenruo's avatar
      btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map() · 0edeb6ea
      Qu Wenruo authored
      
      [HICCUP]
      Before commit 85de2be7129c ("btrfs: remove extent_map::block_start
      member"), we utilized @bytenr variable inside
      btrfs_extent_item_to_extent_map() to calculate block_start.
      
      But that commit removed block_start completely, we have no need to
      advance @bytenr at all.
      
      [ENHANCEMENT]
      - Rename @bytenr as @disk_bytenr
      - Only declare @disk_bytenr inside the if branch
      - Make @disk_bytenr const and remove the modification on it
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0edeb6ea
    • Mark Harmstone's avatar
      btrfs: fix typo in error message in btrfs_validate_super() · 0102ab54
      Mark Harmstone authored
      
      There's a typo in an error message when checking the block group tree
      feature, it mentions fres-space-tree instead of free-space-tree. Fix
      that.
      
      Signed-off-by: default avatarMark Harmstone <maharmstone@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0102ab54
    • Filipe Manana's avatar
      btrfs: move the direct IO code into its own file · 9aa29a20
      Filipe Manana authored
      
      The direct IO code is over a thousand lines and it's currently spread
      between file.c and inode.c, which makes it not easy to locate some parts
      of it sometimes. Also inode.c is about 11 thousand lines and file.c about
      4 thousand lines, both too big. So move all the direct IO code into a
      dedicated file, so that it's easy to locate all its code and reduce the
      sizes of inode.c and file.c.
      
      This is a pure move of code without any other changes except export a
      a couple functions from inode.c (get_extent_allocation_hint() and
      create_io_em()) because they are used in inode.c and the new direct-io.c
      file, and a couple functions from file.c (btrfs_buffered_write() and
      btrfs_write_check()) because they are used both in file.c and in the new
      direct-io.c file.
      
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9aa29a20
    • David Sterba's avatar
      btrfs: pass a btrfs_inode to btrfs_set_prop() · 0d9b7e16
      David Sterba authored
      
      Pass a struct btrfs_inode to btrfs_set_prop() as it's an
      internal interface, allowing to remove some use of BTRFS_I.
      
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d9b7e16
    • David Sterba's avatar
      btrfs: pass a btrfs_inode to btrfs_compress_heuristic() · e2877c2a
      David Sterba authored
      
      Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an
      internal interface, allowing to remove some use of BTRFS_I.
      
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2877c2a
    • David Sterba's avatar
      btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode · a1f4e3d7
      David Sterba authored
      
      The structure is internal so we should use struct btrfs_inode for that,
      allowing to remove some use of BTRFS_I.
      
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1f4e3d7