Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Aug 21, 2023
  2. Apr 17, 2023
  3. Mar 15, 2023
  4. Jan 11, 2023
    • Naohiro Aota's avatar
      btrfs: zoned: enable metadata over-commit for non-ZNS setup · 85e79ec7
      Naohiro Aota authored
      The commit 79417d04 ("btrfs: zoned: disable metadata overcommit for
      zoned") disabled the metadata over-commit to track active zones properly.
      
      However, it also introduced a heavy overhead by allocating new metadata
      block groups and/or flushing dirty buffers to release the space
      reservations. Specifically, a workload (write only without any sync
      operations) worsen its performance from 343.77 MB/sec (v5.19) to 182.89
      MB/sec (v6.0).
      
      The performance is still bad on current misc-next which is 187.95 MB/sec.
      And, with this patch applied, it improves back to 326.70 MB/sec (+73.82%).
      
      This patch introduces a new fs_info->flag BTRFS_FS_NO_OVERCOMMIT to
      indicate it needs to disable the metadata over-commit. The flag is enabled
      when a device with max active zones limit is loaded into a file-system.
      
      Fixes: 79417d04
      
       ("btrfs: zoned: disable metadata overcommit for zoned")
      CC: stable@vger.kernel.org # 6.0+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      85e79ec7
  5. Dec 05, 2022
    • David Sterba's avatar
      btrfs: simplify percent calculation helpers, rename div_factor · 428c8e03
      David Sterba authored
      
      The div_factor* helpers calculate fraction or percentage fraction. The
      name is a bit confusing, we use it only for percentage calculations and
      there are two helpers.
      
      There's a helper mult_frac that's for general fractions, that tries to
      be accurate but we multiply and divide by small numbers so we can use
      the div_u64 helper.
      
      Rename the div_factor* helpers and use 1..100 percentage range, also drop
      the case checking for percentage == 100, it's never hit.
      
      The conversions:
      
      * div_factor calculates tenths and the numbers need to be adjusted
      * div_factor_fine is direct replacement
      
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      428c8e03
    • David Sterba's avatar
      btrfs: update function comments · 43dd529a
      David Sterba authored
      
      Update, reformat or reword function comments. This also removes the kdoc
      marker so we don't get reports when the function name is missing.
      
      Changes made:
      
      - remove kdoc markers
      - reformat the brief description to be a proper sentence
      - reword to imperative voice
      - align parameter list
      - fix typos
      
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43dd529a
    • Josef Bacik's avatar
      btrfs: move extent-tree helpers into their own header file · a0231804
      Josef Bacik authored
      
      Move all the extent tree related prototypes to extent-tree.h out of
      ctree.h, and then go include it everywhere needed so everything
      compiles.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0231804
    • Josef Bacik's avatar
      btrfs: move btrfs_account_ro_block_groups_free_space into space-info.c · e2f13b34
      Josef Bacik authored
      
      This was prototyped in ctree.h and the code existed in extent-tree.c,
      but it's space-info related so move it into space-info.c.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e2f13b34
    • Josef Bacik's avatar
      btrfs: move accessor helpers into accessors.h · 07e81dc9
      Josef Bacik authored
      
      This is a large patch, but because they're all macros it's impossible to
      split up.  Simply copy all of the item accessors in ctree.h and paste
      them in accessors.h, and then update any files to include the header so
      everything compiles.
      
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e81dc9
    • Josef Bacik's avatar
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik authored
      
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7f13d42
    • Josef Bacik's avatar
      btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY · 765c3fe9
      Josef Bacik authored
      
      Inside of FB, as well as some user reports, we've had a consistent
      problem of occasional ENOSPC transaction aborts.  Inside FB we were
      seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
      low occurrence rate given the size of our fleet, but it's not nothing.
      
      There are two causes of this particular problem.
      
      First is delayed allocation.  The reservation system for delalloc
      assumes that contiguous dirty ranges will result in 1 file extent item.
      However if there is memory pressure that results in fragmented writeout,
      or there is fragmentation in the block groups, this won't necessarily be
      true.  Consider the case where we do a single 256MiB write to a file and
      then close it.  We will have 1 reservation for the inode update, the
      reservations for the checksum updates, and 1 reservation for the file
      extent item.  At some point later we decide to write this entire range
      out, but we're so fragmented that we break this into 100 different file
      extents.  Since we've already closed the file and are no longer writing
      to it there's nothing to trigger a refill of the delalloc block rsv to
      satisfy the 99 new file extent reservations we need.  At this point we
      exhaust our delalloc reservation, and we begin to steal from the global
      reserve.  If you have enough of these cases going in parallel you can
      easily exhaust the global reserve, get an ENOSPC at
      btrfs_alloc_tree_block() time, and then abort the transaction.
      
      The other case is the delayed refs reserve.  The delayed refs reserve
      updates its size based on outstanding delayed refs and dirty block
      groups.  However we only refill this block reserve when returning
      excess reservations and when we call btrfs_start_transaction(root, X).
      We will reserve 2*X credits at transaction start time, and fill in X
      into the delayed refs reserve to make sure it stays topped off.
      Generally this works well, but clearly has downsides.  If we do a
      particularly delayed ref heavy operation we may never catch up in our
      reservations.  Additionally running delayed refs generates more delayed
      refs, and at that point we may be committing the transaction and have no
      way to trigger a refill of our delayed refs rsv.  Then a similar thing
      occurs with the delalloc reserve.
      
      Generally speaking we well over-reserve in all of our block rsvs.  If we
      reserve 1 credit we're usually reserving around 264k of space, but we'll
      often not use any of that reservation, or use a few blocks of that
      reservation.  We can be reasonably sure that as long as you were able to
      reserve space up front for your operation you'll be able to find space
      on disk for that reservation.
      
      So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY.  This
      gets used in the case that we've exhausted our reserve and the global
      reserve.  It simply forces a reservation if we have enough actual space
      on disk to make the reservation, which is almost always the case.  This
      keeps us from hitting ENOSPC aborts in these odd occurrences where we've
      not kept up with the delayed work.
      
      Fixing this in a complete way is going to be relatively complicated and
      time consuming.  This patch is what I discussed with Filipe earlier this
      year, and what I put into our kernels inside FB.  With this patch we're
      down to 1-2 ENOSPC aborts per week, which is a significant reduction.
      This is a decent stop gap until we can work out a more wholistic
      solution to these two corner cases.
      
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      765c3fe9
  6. Sep 29, 2022
  7. Sep 26, 2022
    • Filipe Manana's avatar
      btrfs: remove useless used space increment during space reservation · b0b47a38
      Filipe Manana authored
      At space-info.c:__reserve_bytes(), we increment the 'used' variable, but
      then we don't use the variable anymore, making the increment pointless.
      The increment became useless with commit 2e294c60
      
       ("btrfs: simplify
      the logic in need_preemptive_flushing"), so just remove it.
      
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b47a38
    • Qu Wenruo's avatar
      btrfs: dump all space infos if we abort transaction due to ENOSPC · 8e327b9c
      Qu Wenruo authored
      
      We have hit some transaction abort due to -ENOSPC internally.
      
      Normally we should always reserve enough space for metadata for every
      transaction, thus hitting -ENOSPC should really indicate some cases we
      didn't expect.
      
      But unfortunately current error reporting will only give a kernel
      warning and stack trace, not really helpful to debug what's causing the
      problem.
      
      And mount option debug_enospc can only help when user can reproduce the
      problem, but under most cases, such transaction abort by -ENOSPC is
      really hard to reproduce.
      
      So this patch will dump all space infos (data, metadata, system) when we
      abort the first transaction with -ENOSPC.
      
      This should at least provide some clue to us.
      
      The example of a dump would look like this:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
        <call trace skipped>
        ---[ end trace 0000000000000000 ]---
        BTRFS info (device dm-1: state A): dumping space info:
        BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
        BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
        BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
        BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
        BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
        BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
        BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
        BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
        BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
        BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
        BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
        BTRFS info (device dm-1: state EA): forced readonly
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e327b9c
    • Qu Wenruo's avatar
      btrfs: output human readable space info flag · 25a860c4
      Qu Wenruo authored
      
      For btrfs_space_info, its flags has only 4 possible values:
      
      - BTRFS_BLOCK_GROUP_SYSTEM
      - BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA
      - BTRFS_BLOCK_GROUP_METADATA
      - BTRFS_BLOCK_GROUP_DATA
      
      Make the output more human readable, now it looks like:
      
        BTRFS info (device dm-1: state A): space_info METADATA has 251494400 free, is not full
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25a860c4
    • Josef Bacik's avatar
      btrfs: convert block group bit field to use bit helpers · 3349b57f
      Josef Bacik authored
      
      We use a bit field in the btrfs_block_group for different flags, however
      this is awkward because we have to hold the block_group->lock for any
      modification of any of these fields, and makes the code clunky for a few
      of these flags.  Convert these to a properly flags setup so we can
      utilize the bit helpers.
      
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3349b57f
    • Josef Bacik's avatar
      btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info · 723de71d
      Josef Bacik authored
      
      We previously had the pattern of
      
      	btrfs_update_space_info(all, the, bg, fields, &space_info);
      	link_block_group(bg);
      	bg->space_info = space_info;
      
      Now that we're passing the bg into btrfs_add_bg_to_space_info we can do
      the linking in that function, transforming this to simply
      
      	btrfs_add_bg_to_space_info(fs_info, bg);
      
      and put the link_block_group() and bg->space_info assignment directly in
      btrfs_add_bg_to_space_info.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723de71d
    • Josef Bacik's avatar
      btrfs: simplify arguments of btrfs_update_space_info and rename · 9d4b0a12
      Josef Bacik authored
      
      This function has grown a bunch of new arguments, and it just boils down
      to passing in all the block group fields as arguments.  Simplify this by
      passing in the block group itself and updating the space_info fields
      based on the block group fields directly.
      
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9d4b0a12
  8. Sep 06, 2022
    • Qu Wenruo's avatar
      btrfs: fix the max chunk size and stripe length calculation · 5da431b7
      Qu Wenruo authored
      [BEHAVIOR CHANGE]
      Since commit f6fca391 ("btrfs: store chunk size in space-info
      struct"), btrfs no longer can create larger data chunks than 1G:
      
        mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
        mount $dev1 $mnt
      
        btrfs balance start --full $mnt
        btrfs balance start --full $mnt
        umount $mnt
      
        btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
      
      Before that offending commit, what we got is a 4G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      Now what we got is only 1G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
      		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      This will increase the number of data chunks by the number of devices,
      not only increase system chunk usage, but also greatly increase mount
      time.
      
      Without a proper reason, we should not change the max chunk size.
      
      [CAUSE]
      Previously, we set max data chunk size to 10G, while max data stripe
      length to 1G.
      
      Commit f6fca391
      
       ("btrfs: store chunk size in space-info struct")
      completely ignored the 10G limit, but use 1G max stripe limit instead,
      causing above shrink in max data chunk size.
      
      [FIX]
      Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
      we limit stripe_size to 1G manually.
      
      This should only affect data chunks, as for metadata chunks we always
      set the max stripe size the same as max chunk size (256M or 1G
      depending on fs size).
      
      Now the same script result the same old result:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Fixes: f6fca391
      
       ("btrfs: store chunk size in space-info struct")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5da431b7
  9. Jul 25, 2022
  10. May 16, 2022
  11. Mar 14, 2022
  12. Jan 07, 2022
    • Yang Li's avatar
      btrfs: fix argument list that the kdoc format and script verified · be8d1a2a
      Yang Li authored
      
      The warnings were found by running scripts/kernel-doc, which is
      caused by using 'make W=1'.
      
      fs/btrfs/extent_io.c:3210: warning: Function parameter or member
      'bio_ctrl' not described in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
      description in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter
      'prev_bio_flags' description in 'btrfs_bio_add_page'
      fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
      description in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1602: warning: Function parameter or member
      'fs_info' not described in 'btrfs_reserve_metadata_bytes'
      
      Note: this is fixing only the warnings regarding parameter list, the
      first line is not strictly conforming to the kdoc format as the btrfs
      codebase does not stick to that and keeps the first line more free form
      (because it's only for internal use).
      
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be8d1a2a
  13. Jan 03, 2022