Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 26, 2024
  2. Jul 13, 2024
  3. Jul 11, 2024
    • Gao Xiang's avatar
      erofs: avoid refcounting short-lived pages · 1001042e
      Gao Xiang authored
      
      LZ4 always reuses the decompressed buffer as its LZ77 sliding window
      (dynamic dictionary) for optimal performance.  However, in specific
      cases, the output buffer may not fully contain valid page cache pages,
      resulting in the use of short-lived pages for temporary purposes.
      
      Due to the limited sliding window size, LZ4 shortlived bounce pages can
      also be reused in a sliding manner, so each bounce page can be vmapped
      multiple times in different relative positions by design.  In order to
      avoiding double frees, currently, reuse counts are recorded via page
      refcount, but it will no longer be used as-is in the future world of
      Memdescs.
      
      Just maintain a lookup table to check if a shortlived page is reused.
      
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20240711053659.1364989-1-hsiangkao@linux.alibaba.com
      1001042e
  4. Jul 10, 2024
  5. Jul 09, 2024
  6. Jul 08, 2024
  7. Jul 03, 2024
    • Boris Burkov's avatar
      btrfs: fix folio refcount in __alloc_dummy_extent_buffer() · a56c85fa
      Boris Burkov authored
      Another improper use of __folio_put() in an error path after freshly
      allocating pages/folios which returns them with the refcount initialized
      to 1. The refactor from __free_pages() -> __folio_put() (instead of
      folio_put) removed a refcount decrement found in __free_pages() and
      folio_put but absent from __folio_put().
      
      Fixes: 13df3775
      
       ("btrfs: cleanup metadata page pointer usage")
      CC: stable@vger.kernel.org # 6.8+
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a56c85fa
    • Boris Burkov's avatar
      btrfs: fix folio refcount in btrfs_do_encoded_write() · da0386c1
      Boris Burkov authored
      The conversion to folios switched __free_page() to __folio_put() in the
      error path in btrfs_do_encoded_write().
      
      However, this gets the page refcounting wrong. If we do hit that error
      path (I reproduced by modifying btrfs_do_encoded_write to pretend to
      always fail in a way that jumps to out_folios and running the fstests
      case btrfs/281), then we always hit the following BUG freeing the folio:
      
        BUG: Bad page state in process btrfs  pfn:40ab0b
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
         flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
        raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
        raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
        page dumped because: nonzero _refcount
        Call Trace:
        <TASK>
        dump_stack_lvl+0x3d/0xe0
        bad_page+0xea/0xf0
        free_unref_page+0x8e1/0x900
        ? __mem_cgroup_uncharge+0x69/0x90
        __folio_put+0xe6/0x190
        btrfs_do_encoded_write+0x445/0x780
        ? current_time+0x25/0xd0
        btrfs_do_write_iter+0x2cc/0x4b0
        btrfs_ioctl_encoded_write+0x2b6/0x340
      
      It turns out __free_page() decreases the page reference count while
      __folio_put() does not. Switch __folio_put() to folio_put() which
      decreases the folio reference count first.
      
      Fixes: 400b172b
      
       ("btrfs: compression: migrate compression/decompression paths to folios")
      Tested-by: default avatarEd Tomlinson <edtoml@gmail.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da0386c1
    • Ryusuke Konishi's avatar
      nilfs2: fix incorrect inode allocation from reserved inodes · 93aef9ed
      Ryusuke Konishi authored
      If the bitmap block that manages the inode allocation status is corrupted,
      nilfs_ifile_create_inode() may allocate a new inode from the reserved
      inode area where it should not be allocated.
      
      Previous fix commit d325dc6e ("nilfs2: fix use-after-free bug of
      struct nilfs_root"), fixed the problem that reserved inodes with inode
      numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
      bitmap corruption, but since the start number of non-reserved inodes is
      read from the super block and may change, in which case inode allocation
      may occur from the extended reserved inode area.
      
      If that happens, access to that inode will cause an IO error, causing the
      file system to degrade to an error state.
      
      Fix this potential issue by adding a wraparound option to the common
      metadata object allocation routine and by modifying
      nilfs_ifile_create_inode() to disable the option so that it only allocates
      inodes with inode numbers greater than or equal to the inode number read
      in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
      inodes.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93aef9ed
    • Ryusuke Konishi's avatar
      nilfs2: add missing check for inode numbers on directory entries · bb76c6c2
      Ryusuke Konishi authored
      Syzbot reported that mounting and unmounting a specific pattern of
      corrupted nilfs2 filesystem images causes a use-after-free of metadata
      file inodes, which triggers a kernel bug in lru_add_fn().
      
      As Jan Kara pointed out, this is because the link count of a metadata file
      gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
      tries to delete that inode (ifile inode in this case).
      
      The inconsistency occurs because directories containing the inode numbers
      of these metadata files that should not be visible in the namespace are
      read without checking.
      
      Fix this issue by treating the inode numbers of these internal files as
      errors in the sanity check helper when reading directory folios/pages.
      
      Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
      analysis.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar <syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8
      
      
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3
      
      
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb76c6c2
    • Ryusuke Konishi's avatar
      nilfs2: fix inode number range checks · e2fec219
      Ryusuke Konishi authored
      Patch series "nilfs2: fix potential issues related to reserved inodes".
      
      This series fixes one use-after-free issue reported by syzbot, caused by
      nilfs2's internal inode being exposed in the namespace on a corrupted
      filesystem, and a couple of flaws that cause problems if the starting
      number of non-reserved inodes written in the on-disk super block is
      intentionally (or corruptly) changed from its default value.  
      
      
      This patch (of 3):
      
      In the current implementation of nilfs2, "nilfs->ns_first_ino", which
      gives the first non-reserved inode number, is read from the superblock,
      but its lower limit is not checked.
      
      As a result, if a number that overlaps with the inode number range of
      reserved inodes such as the root directory or metadata files is set in the
      super block parameter, the inode number test macros (NILFS_MDT_INODE and
      NILFS_VALID_INODE) will not function properly.
      
      In addition, these test macros use left bit-shift calculations using with
      the inode number as the shift count via the BIT macro, but the result of a
      shift calculation that exceeds the bit width of an integer is undefined in
      the C specification, so if "ns_first_ino" is set to a large value other
      than the default value NILFS_USER_INO (=11), the macros may potentially
      malfunction depending on the environment.
      
      Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
      by preventing bit shifts equal to or greater than the NILFS_USER_INO
      constant in the inode number test macros.
      
      Also, change the type of "ns_first_ino" from signed integer to unsigned
      integer to avoid the need for type casting in comparisons such as the
      lower bound check introduced this time.
      
      Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
      Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com
      
      
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2fec219
  8. Jul 02, 2024
    • David Howells's avatar
      cifs: Fix read-performance regression by dropping readahead expansion · 08f70c0a
      David Howells authored
      cifs_expand_read() is causing a performance regression of around 30% by
      causing extra pagecache to be allocated for an inode in the readahead path
      before we begin actually dispatching RPC requests, thereby delaying the
      actual I/O.  The expansion is sized according to the rsize parameter, which
      seems to be 4MiB on my test system; this is a big step up from the first
      requests made by the fio test program.
      
      Simple repro (look at read bandwidth number):
           fio --name=writetest --filename=/xfstest.test/foo --time_based --runtime=60 --size=16M --numjobs=1 --rw=read
      
      Fix this by removing cifs_expand_readahead().  Readahead expansion is
      mostly useful for when we're using the local cache if the local cache has a
      block size greater than PAGE_SIZE, so we can dispense with it when not
      caching.
      
      Fixes: 69c3c023
      
       ("cifs: Implement netfslib hooks")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarPaulo Alcantara (Red Hat) <pc@manguebit.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      cc: linux-fsdevel@vger.kernel.org
      cc: linux-mm@kvack.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      08f70c0a
    • Chen Ni's avatar
    • Christian Brauner's avatar
      fs: better handle deep ancestor chains in is_subdir() · 391b59b0
      Christian Brauner authored
      
      Jan reported that 'cd ..' may take a long time in deep directory
      hierarchies under a bind-mount. If concurrent renames happen it is
      possible to livelock in is_subdir() because it will keep retrying.
      
      Change is_subdir() from simply retrying over and over to retry once and
      then acquire the rename lock to handle deep ancestor chains better. The
      list of alternatives to this approach were less then pleasant. Change
      the scope of rcu lock to cover the whole walk while at it.
      
      A big thanks to Jan and Linus. Both Jan and Linus had proposed
      effectively the same thing just that one version ended up being slightly
      more elegant.
      
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      391b59b0
    • Jann Horn's avatar
      filelock: Remove locks reliably when fcntl/close race is detected · 3cad1bc0
      Jann Horn authored
      When fcntl_setlk() races with close(), it removes the created lock with
      do_lock_file_wait().
      However, LSMs can allow the first do_lock_file_wait() that created the lock
      while denying the second do_lock_file_wait() that tries to remove the lock.
      In theory (but AFAIK not in practice), posix_lock_file() could also fail to
      remove a lock due to GFP_KERNEL allocation failure (when splitting a range
      in the middle).
      
      After the bug has been triggered, use-after-free reads will occur in
      lock_get_status() when userspace reads /proc/locks. This can likely be used
      to read arbitrary kernel memory, but can't corrupt kernel memory.
      This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
      enforcing mode and only works from some security contexts.
      
      Fix it by calling locks_remove_posix() instead, which is designed to
      reliably get rid of POSIX locks associated with the given file and
      files_struct and is also used by filp_flush().
      
      Fixes: c293621b ("[PATCH] stale POSIX lock handling")
      Cc: stable@kernel.org
      Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
      
      
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com
      
      
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      3cad1bc0
    • Filipe Manana's avatar
      btrfs: fix uninitialized return value in the ref-verify tool · 9da45c88
      Filipe Manana authored
      
      In the ref-verify tool, when processing the inline references of an extent
      item, we may end up returning with uninitialized return value, because:
      
      1) The 'ret' variable is not initialized if there are no inline extent
         references ('ptr' == 'end' before the while loop starts);
      
      2) If we find an extent owner inline reference we don't initialize 'ret'.
      
      So fix these cases by initializing 'ret' to 0 when declaring the variable
      and set it to -EINVAL if we find an extent owner inline references and
      simple quotas are not enabled (as well as print an error message).
      
      Reported-by: default avatarMirsad Todorovac <mtodorovac69@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/
      
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9da45c88
    • Qu Wenruo's avatar
      btrfs: always do the basic checks for btrfs_qgroup_inherit structure · 724d8042
      Qu Wenruo authored
      [BUG]
      Syzbot reports the following regression detected by KASAN:
      
        BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
        Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171
      
        CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
         print_address_description mm/kasan/report.c:377 [inline]
         print_report+0x169/0x550 mm/kasan/report.c:488
         kasan_report+0x143/0x180 mm/kasan/report.c:601
         btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
         create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854
         create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922
         btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382
         create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875
         btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029
         btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075
         __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340
         btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422
         btrfs_ioctl+0x99e/0xc60
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:907 [inline]
         __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7fcbf1992509
        RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509
        RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003
        RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660
        R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e
         </TASK>
      
      And it also pinned it down to commit b5357cb2
      
       ("btrfs: qgroup: do not
      check qgroup inherit if qgroup is disabled").
      
      [CAUSE]
      That offending commit skips the whole qgroup inherit check if qgroup is
      not enabled.
      
      But that also skips the very basic checks like
      num_ref_copies/num_excl_copies and the structure size checks.
      
      Meaning if a qgroup enable/disable race is happening at the background,
      and we pass a btrfs_qgroup_inherit structure when the qgroup is
      disabled, the check would be completely skipped.
      
      Then at the time of transaction commitment, qgroup is re-enabled and
      btrfs_qgroup_inherit() is going to use the incorrect structure and
      causing the above KASAN error.
      
      [FIX]
      Make btrfs_qgroup_check_inherit() only skip the source qgroup checks.
      So that even if invalid btrfs_qgroup_inherit structure is passed in, we
      can still reject invalid ones no matter if qgroup is enabled or not.
      
      Furthermore we do already have an extra safety inside
      btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources,
      so even if we only skip the qgroup source check we're still safe.
      
      Reported-by: default avatar <syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com>
      Fixes: b5357cb2
      
       ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled")
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarJeongjun Park <aha310510@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      724d8042
    • Naohiro Aota's avatar
      btrfs: zoned: fix calc_available_free_space() for zoned mode · 64d2c847
      Naohiro Aota authored
      calc_available_free_space() returns the total size of metadata (or
      system) block groups, which can be allocated from unallocated disk
      space. The logic is wrong on zoned mode in two places.
      
      First, the calculation of data_chunk_size is wrong. We always allocate
      one zone as one chunk, and no partial allocation of a zone. So, we
      should use zone_size (= data_sinfo->chunk_size) as it is.
      
      Second, the result "avail" may not be zone aligned. Since we always
      allocate one zone as one chunk on zoned mode, returning non-zone size
      aligned bytes will result in less pressure on the async metadata reclaim
      process.
      
      This is serious for the nearly full state with a large zone size device.
      Allowing over-commit too much will result in less async reclaim work and
      end up in ENOSPC. We can align down to the zone size to avoid that.
      
      Fixes: cb6cbab7
      
       ("btrfs: adjust overcommit logic when very close to full")
      CC: stable@vger.kernel.org # 6.9
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64d2c847
  9. Jul 01, 2024
    • Naohiro Aota's avatar
      btrfs: fix adding block group to a reclaim list and the unused list during reclaim · 48f091fd
      Naohiro Aota authored
      
      There is a potential parallel list adding for retrying in
      btrfs_reclaim_bgs_work and adding to the unused list. Since the block
      group is removed from the reclaim list and it is on a relocation work,
      it can be added into the unused list in parallel. When that happens,
      adding it to the reclaim list will corrupt the list head and trigger
      list corruption like below.
      
      Fix it by taking fs_info->unused_bgs_lock.
      
        [177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
        [177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
        [177.529][T2585409] ------------[ cut here ]------------
        [177.537][T2585409] kernel BUG at lib/list_debug.c:65!
        [177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
        [177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G        W          6.10.0-rc5-kts #1
        [177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
        [177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
        [177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
        [177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
        [177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
        [177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
        [177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
        [177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
        [177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
        [177.687][T2585409] FS:  0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
        [177.700][T2585409] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
        [177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
        [177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
        [177.742][T2585409] PKRU: 55555554
        [177.748][T2585409] Call Trace:
        [177.753][T2585409]  <TASK>
        [177.759][T2585409]  ? __die_body.cold+0x19/0x27
        [177.766][T2585409]  ? die+0x2e/0x50
        [177.772][T2585409]  ? do_trap+0x1ea/0x2d0
        [177.779][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.788][T2585409]  ? do_error_trap+0xa3/0x160
        [177.795][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.805][T2585409]  ? handle_invalid_op+0x2c/0x40
        [177.812][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.820][T2585409]  ? exc_invalid_op+0x2d/0x40
        [177.827][T2585409]  ? asm_exc_invalid_op+0x1a/0x20
        [177.834][T2585409]  ? __list_del_entry_valid_or_report.cold+0x70/0x72
        [177.843][T2585409]  btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
      
      There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
      safe, AFAICS. Since the block group was in the unused list, the used bytes
      should be 0 when it was added to the unused list. Then, it checks
      block_group->{used,reserved,pinned} are still 0 under the
      block_group->lock. So, they should be still eligible for the unused list,
      not the reclaim list.
      
      The reason it is safe there it's because because we're holding
      space_info->groups_sem in write mode.
      
      That means no other task can allocate from the block group, so while we
      are at deleted_unused_bgs() it's not possible for other tasks to
      allocate and deallocate extents from the block group, so it can't be
      added to the unused list or the reclaim list by anyone else.
      
      The bug can be reproduced by btrfs/166 after a few rounds. In practice
      this can be hit when relocation cannot find more chunk space and ends
      with ENOSPC.
      
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Suggested-by: default avatarJohannes Thumshirn <Johannes.Thumshirn@wdc.com>
      Fixes: 4eb4e85c
      
       ("btrfs: retry block group reclaim without infinite loop")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48f091fd
  10. Jun 29, 2024
  11. Jun 26, 2024
  12. Jun 25, 2024