Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 12, 2024
  2. May 30, 2024
    • Waiman Long's avatar
      blk-throttle: Fix incorrect display of io.max · 0a751df4
      Waiman Long authored
      Commit bf20ab53 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
      attempts to revert the code change introduced by commit cd5ab1b0
      ("blk-throttle: add .low interface").  However, it leaves behind the
      bps_conf[] and iops_conf[] fields in the throtl_grp structure which
      aren't set anywhere in the new blk-throttle.c code but are still being
      used by tg_prfill_limit() to display the limits in io.max. Now io.max
      always displays the following values if a block queue is used:
      
      	<m>:<n> rbps=0 wbps=0 riops=0 wiops=0
      
      Fix this problem by removing bps_conf[] and iops_conf[] and use bps[]
      and iops[] instead to complete the revert.
      
      Fixes: bf20ab53
      
       ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW")
      Reported-by: default avatarJustin Forbes <jforbes@redhat.com>
      Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789
      
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Lin...
      0a751df4
    • Damien Le Moal's avatar
      block: Fix zone write plugging handling of devices with a runt zone · 29459c3e
      Damien Le Moal authored
      A zoned device may have a last sequential write required zone that is
      smaller than other zones. However, all tests to check if a zone write
      plug write offset exceeds the zone capacity use the same capacity
      value stored in the gendisk zone_capacity field. This is incorrect for a
      zoned device with a last runt (smaller) zone.
      
      Add the new field last_zone_capacity to struct gendisk to store the
      capacity of the last zone of the device. blk_revalidate_seq_zone() and
      blk_revalidate_conv_zone() are both modified to get this value when
      disk_zone_is_last() returns true. Similarly to zone_capacity, the value
      is first stored using the last_zone_capacity field of struct
      blk_revalidate_zone_args. Once zone revalidation of all zones is done,
      this is used to set the gendisk last_zone_capacity field.
      
      The checks to determine if a zone is full or if a sector offset in a
      zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
      disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
      and blk_zone_wplug_prepare_bio() are modified to use the new helper
      functions disk_zone_is_full() and disk_zone_wplug_is_full().
      disk_zone_is_full() uses the zone index to determine if the zone being
      tested is the last one of the disk and uses the either the disk
      zone_capacity or last_zone_capacity accordingly.
      
      Fixes: dd291d77
      
       ("block: Introduce zone write plugging")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarNiklas Cassel <cassel@kernel.org>
      Link: https://lore.kernel.org/r/20240530054035.491497-4-dlemoal@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      29459c3e
    • Damien Le Moal's avatar
      block: Fix validation of zoned device with a runt zone · cd639993
      Damien Le Moal authored
      Commit ecfe43b1 ("block: Remember zone capacity when revalidating
      zones") introduced checks to ensure that the capacity of the zones of
      a zoned device is constant for all zones. However, this check ignores
      the possibility that a zoned device has a smaller last zone with a size
      not equal to the capacity of other zones. Such device correspond in
      practice to an SMR drive with a smaller last zone and all zones with a
      capacity equal to the zone size, leading to the last zone capacity being
      different than the capacity of other zones.
      
      Correctly handle such device by fixing the check for the constant zone
      capacity in blk_revalidate_seq_zone() using the new helper function
      disk_zone_is_last(). This helper function is also used in
      blk_revalidate_zone_cb() when checking the zone size.
      
      Fixes: ecfe43b1
      
       ("block: Remember zone capacity when revalidating zones")
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarNiklas Cassel <cassel@kernel.org>
      Link: https://lore.kernel.org/r/20240530054035.491497-3-dlemoal@kernel.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cd639993
  3. May 28, 2024
  4. May 27, 2024
  5. May 22, 2024
  6. May 21, 2024
  7. May 20, 2024
  8. May 17, 2024
  9. May 15, 2024
  10. May 09, 2024
    • Yu Kuai's avatar
      blk-throttle: delay initialization until configuration · a3166c51
      Yu Kuai authored
      
      Other cgroup policy like bfq, iocost are lazy-initialized when they are
      configured for the first time for the device, but blk-throttle is
      initialized unconditionally from blkcg_init_disk().
      
      Delay initialization of blk-throttle as well, to save some cpu and
      memory overhead if it's not configured.
      
      Noted that once it's initialized, it can't be destroyed until disk
      removal, even if it's disabled.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3166c51
    • Yu Kuai's avatar
      blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW · bf20ab53
      Yu Kuai authored
      
      One the one hand, it's marked EXPERIMENTAL since 2017, and looks like
      there are no users since then, and no testers and no developers, it's
      just not active at all.
      
      On the other hand, even if the config is disabled, there are still many
      fields in throtl_grp and throtl_data and many functions that are only
      used for throtl low.
      
      At last, currently blk-throtl is initialized during disk initialization,
      and destroyed during disk removal, and it exposes many functions to be
      called directly from block layer.
      
      Remove throtl low to make code much more cleaner and follow up work much
      easier.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf20ab53
    • Yu Kuai's avatar
      block: fix that util can be greater than 100% · 7be83569
      Yu Kuai authored
      
      util means the percentage that disk has IO, and theoretically it should
      not be greater than 100%. However, there is a gap for rq-based disk:
      
      io_ticks will be updated when rq is allocated, however, before such rq
      dispatch to driver, it will not be account as inflight from
      blk_mq_start_request() hence diskstats_show()/part_stat_show() will not
      update io_ticks. For example:
      
      1) at t0, issue a new IO, rq is allocated, and blk_account_io_start()
      update io_ticks;
      
      2) something is wrong with drivers, and the rq can't be dispatched;
      
      3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks
      is updated;
      
      Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s,
      util will be zero, and between t0+9s - t0+10s, util will be 1000%.
      
      Fix this problem by updating io_ticks from diskstats_show() and
      part_stat_show() if there are rq allocated.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7be83569
    • Yu Kuai's avatar
      block: support to account io_ticks precisely · 99dc4223
      Yu Kuai authored
      Currently, io_ticks is accounted based on sampling, specifically
      update_io_ticks() will always account io_ticks by 1 jiffies from
      bdev_start_io_acct()/blk_account_io_start(), and the result can be
      inaccurate, for example(HZ is 250):
      
      Test script:
      fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
      
      Test result: util is about 90%, while the disk is really idle.
      
      This behaviour is introduced by commit 5b18b5a7 ("block: delete
      part_round_stats and switch to less precise counting"), however, there
      was a key point that is missed that this patch also improve performance
      a lot:
      
      Before the commit:
      part_round_stats:
        if (part->stamp != now)
         stats |= 1;
      
        part_in_flight()
        -> there can be lots of task here in 1 jiffies.
        part_round_stats_single()
         __part_stat_add()
        part->stamp = now;
      
      After the commit:
      update_io_ticks:
        stamp = part->bd_stamp;
        if (time_after(now, stamp))
         if (try_cmpxchg())
          __part_stat_add()
          -> only one task can reach here in 1 jiffies.
      
      Hence in order to account io_ticks precisely, we only need to know if
      there are IO inflight at most once in one jiffies. Noted that for
      rq-based device, iterating tags should not be used here because
      'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence
      part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight.
      The additional overhead is quite little:
      
       - per cpu add/dec for each IO for rq-based device;
       - per cpu sum for each jiffies;
      
      And it's verified by null-blk that there are no performance degration
      under heavy IO pressure.
      
      Fixes: 5b18b5a7
      
       ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      99dc4223
    • Yu Kuai's avatar
      block: add plug while submitting IO · 060406c6
      Yu Kuai authored
      
      So that if caller didn't use plug, for example, __blkdev_direct_IO_simple()
      and __blkdev_direct_IO_async(), block layer can still benefit from caching
      nsec time in the plug.
      
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      060406c6
  11. May 07, 2024
  12. May 06, 2024
  13. May 03, 2024