- Jun 12, 2024
-
-
Anuj Gupta authored
The user mapped intergity is copied back and unpinned by bio_integrity_free which is a low-level routine. Do it via the submitter rather than doing it in the low-level block layer code, to split the submitter side from the consumer side of the bio. Signed-off-by:
Anuj Gupta <anuj20.g@samsung.com> Signed-off-by:
Kanchan Joshi <joshi.k@samsung.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240610111144.14647-1-anuj20.g@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Chengming Zhou authored
Friedrich Weber reported a kernel crash problem and bisected to commit 81ada09c ("blk-flush: reuse rq queuelist in flush state machine"). The root cause is that we use "list_move_tail(&rq->queuelist, pending)" in the PREFLUSH/POSTFLUSH sequences. But rq->queuelist.next == xxx since it's popped out from plug->cached_rq in __blk_mq_alloc_requests_batch(). We don't initialize its queuelist just for this first request, although the queuelist of all later popped requests will be initialized. Fix it by changing to use "list_add_tail(&rq->queuelist, pending)" so rq->queuelist doesn't need to be initialized. It should be ok since rq can't be on any list when PREFLUSH or POSTFLUSH, has no move actually. Please note the commit 81ada09c ("blk-flush: reuse rq queuelist in flush state machine") also has another requirement that no drivers would touch rq->queuelist after blk_mq_end_request() since we will reuse it to add rq to the post-flush pending list in POSTFLUSH. If this is not true, we will have to revert that commit IMHO. This updated version adds "list_del_init(&rq->queuelist)" in flush rq callback since the dm layer may submit request of a weird invalid format (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH), which causes double list_add if without this "list_del_init(&rq->queuelist)". The weird invalid format problem should be fixed in dm layer. Reported-by:
Friedrich Weber <f.weber@proxmox.com> Closes: https://lore.kernel.org/lkml/14b89dfb-505c-49f7-aebb-01c54451db40@proxmox.com/ Closes: https://lore.kernel.org/lkml/c9d03ff7-27c5-4ebd-b3f6-5a90d96f35ba@proxmox.com/ Fixes: 81ada09c ("blk-flush: reuse rq queuelist in flush state machine") Cc: Christoph Hellwig <hch@lst.de> Cc: ming.lei@redhat.com Cc: bvanassche@acm.org Tested-by:
Friedrich Weber <f.weber@proxmox.com> Signed-off-by:
Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240608143115.972486-1-chengming.zhou@linux.dev Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
For zoned block devices using zone write plugging, an rcu_barrier() call is needed in disk_free_zone_resources() to synchronize freeing of zone write plugs and the destrution of the mempool used to allocate the plugs. The barrier call does slow down a little teardown of zoned block devices but should not affect teardown of regular block devices or zoned block devices that do not use zone write plugging (e.g. zoned DM devices that do not require zone append emulation). Modify disk_free_zone_resources() to return early if we do not have a mempool to start with, that is, if the device does not use zone write plugging. This avoids the costly rcu_barrier() and speeds up disk teardown. Reported-by:
Mikulas Patocka <mpatocka@redhat.com> Fixes: dd291d77 ("block: Introduce zone write plugging") Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Mikulas Patocka <mpatocka@redhat.com> Reviewed-by:
Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240607002126.104227-1-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Su Hui authored
Clang static checker (scan-build) warning: block/sed-opal.c:line 317, column 3 Value stored to 'ret' is never read. Fix this problem by returning the error code when keyring_search() failed. Otherwise, 'key' will have a wrong value when 'kerf' stores the error code. Fixes: 3bfeb612 ("block: sed-opal: keyring support for SED keys") Signed-off-by:
Su Hui <suhui@nfschina.com> Link: https://lore.kernel.org/r/20240611073659.429582-1-suhui@nfschina.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 30, 2024
-
-
Waiman Long authored
Commit bf20ab53 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") attempts to revert the code change introduced by commit cd5ab1b0 ("blk-throttle: add .low interface"). However, it leaves behind the bps_conf[] and iops_conf[] fields in the throtl_grp structure which aren't set anywhere in the new blk-throttle.c code but are still being used by tg_prfill_limit() to display the limits in io.max. Now io.max always displays the following values if a block queue is used: <m>:<n> rbps=0 wbps=0 riops=0 wiops=0 Fix this problem by removing bps_conf[] and iops_conf[] and use bps[] and iops[] instead to complete the revert. Fixes: bf20ab53 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Reported-by:
Justin Forbes <jforbes@redhat.com> Closes: https://github.com/containers/podman/issues/22701#issuecomment-2120627789 Signed-off-by:
Waiman Long <longman@redhat.com> Acked-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Yu Kuai <yukuai3@huawei.com> Lin...
-
Damien Le Moal authored
A zoned device may have a last sequential write required zone that is smaller than other zones. However, all tests to check if a zone write plug write offset exceeds the zone capacity use the same capacity value stored in the gendisk zone_capacity field. This is incorrect for a zoned device with a last runt (smaller) zone. Add the new field last_zone_capacity to struct gendisk to store the capacity of the last zone of the device. blk_revalidate_seq_zone() and blk_revalidate_conv_zone() are both modified to get this value when disk_zone_is_last() returns true. Similarly to zone_capacity, the value is first stored using the last_zone_capacity field of struct blk_revalidate_zone_args. Once zone revalidation of all zones is done, this is used to set the gendisk last_zone_capacity field. The checks to determine if a zone is full or if a sector offset in a zone exceeds the zone capacity in disk_should_remove_zone_wplug(), disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(), and blk_zone_wplug_prepare_bio() are modified to use the new helper functions disk_zone_is_full() and disk_zone_wplug_is_full(). disk_zone_is_full() uses the zone index to determine if the zone being tested is the last one of the disk and uses the either the disk zone_capacity or last_zone_capacity accordingly. Fixes: dd291d77 ("block: Introduce zone write plugging") Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240530054035.491497-4-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Commit ecfe43b1 ("block: Remember zone capacity when revalidating zones") introduced checks to ensure that the capacity of the zones of a zoned device is constant for all zones. However, this check ignores the possibility that a zoned device has a smaller last zone with a size not equal to the capacity of other zones. Such device correspond in practice to an SMR drive with a smaller last zone and all zones with a capacity equal to the zone size, leading to the last zone capacity being different than the capacity of other zones. Correctly handle such device by fixing the check for the constant zone capacity in blk_revalidate_seq_zone() using the new helper function disk_zone_is_last(). This helper function is also used in blk_revalidate_zone_cb() when checking the zone size. Fixes: ecfe43b1 ("block: Remember zone capacity when revalidating zones") Signed-off-by:
Damien Le Moal <dlemoal@kernel.org> Reviewed-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240530054035.491497-3-dlemoal@kernel.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 28, 2024
-
-
Hannes Reinecke authored
The logical block size need to be smaller than the max_hw_sector setting, otherwise we can't even transfer a single LBA. Signed-off-by:
Hannes Reinecke <hare@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
John Garry <john.g.garry@oracle.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The max_user_sectors is one of the three factors determining the actual max_sectors limit for READ/WRITE requests. Because of that it needs to be stacked at least for the device mapper multi-path case where requests are directly inserted on the lower device. For SCSI disks this is important because the sd driver actually sets it's own advisory limit that is lower than max_hw_sectors based on the block limits VPD page. While this is a bit odd an unusual, the same effect can happen if a user or udev script tweaks the value manually. Fixes: 4f563a64 ("block: add a max_user_discard_sectors queue limit") Reported-by:
Mike Snitzer <snitzer@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Mike Snitzer <snitzer@kernel.org> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240523182618.602003-3-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 27, 2024
-
-
hexue authored
blk_stats_alloc_enable was used for block hybrid poll, the related function definition was removed by patch: commit 54bdd67d ("blk-mq: remove hybrid polling") but the function declaration was not deleted. Signed-off-by:
hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20240527084533.1485210-1-xue01.he@samsung.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 22, 2024
-
-
Dr. David Alan Gilbert authored
'avg_latency_bucket' is unused since commit bf20ab53 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Remove it. Signed-off-by:
Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240522172458.334173-1-linux@treblig.org Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 21, 2024
-
-
Yu Kuai authored
With the following two conditions, bio will be lost: 1) blk plug is not enabled, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(); 2) bio plug is enabled, for example write IO for raid1/raid10 while bitmap is enabled; Root cause is that blk_finish_plug() will add the bio to curent->bio_list, while such bio will not be handled: __submit_bio_noacct current->bio_list = bio_list_on_stack; blk_start_plug do { dm_submit_bio md_handle_request raid10_write_request -> generate new bio for underlying disks raid1_add_bio_to_plug -> bio is added to plug } while ((bio = bio_list_pop(&bio_list_on_stack[0]))) -> previous bio are all handled blk_finish_plug raid10_unplug raid1_submit_write submit_bio_noacct if (current->bio_list) bio_list_add(¤t->bio_list[0], bio) -> add new bio current->bio_list = NULL -> new bio is lost Fix the problem by moving the plug into the while loop, so that current->bio_list will still be handled after blk_finish_plug(). By the way, enable plug for raid1/raid10 in this case will also prevent delay IO handling into daemon thread, which should also improve IO performance. Fixes: 060406c6 ("block: add plug while submitting IO") Reported-by:
Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+Xsmzy2G9YuEatfMT6qv1M--YdOCQ0g7z7OVmcTbBxQAg@mail.gmail.com/ Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Tested-by:
Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/r/20240521200308.983986-1-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 20, 2024
-
-
Jeff Johnson authored
Fix the allmodconfig 'make W=1' issue: WARNING: modpost: missing MODULE_DESCRIPTION() in block/t10-pi.o Signed-off-by:
Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240516-md-t10-pi-v1-1-44a3469374aa@quicinc.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 17, 2024
-
-
Ming Lei authored
Commit a46c2702 ("blk-mq: don't schedule block kworker on isolated CPUs") rules out isolated CPUs from hctx->cpumask, and hctx->cpumask should only be used for scheduling kworker. Add helper blk_mq_cpu_mapped_to_hctx() and apply it into cpuhp handlers. This patch avoids to forget clearing INACTIVE of hctx state in case that one isolated CPU becomes online, and fixes hang issue when allocating request from this hctx's tags. Cc: Raju Cheerla <rcheerla@redhat.com> Fixes: a46c2702 ("blk-mq: don't schedule block kworker on isolated CPUs") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240517020514.149771-1-ming.lei@redhat.com Tested-by:
Raju Cheerla <rcheerla@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 15, 2024
-
-
Waiman Long authored
During a cgroup_rstat_flush() call, the lowest level of nodes are flushed first before their parents. Since commit 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()"), iostat propagation was still done to the parent. Grandparent, however, may not get the iostat update if the parent has no blkg_iostat_set queued in its lhead lockless list. Fix this iostat propagation problem by queuing the parent's global blkg->iostat into one of its percpu lockless lists to make sure that the delta will always be propagated up to the grandparent and so on toward the root blkcg. Note that successive calls to __blkcg_rstat_flush() are serialized by the cgroup_rstat_lock. So no special barrier is used in the reading and writing of blkg->iostat.lqueued. Fixes: 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()") Reported-by:
Dan Schatzberg <schatzberg.dan@gmail.com> Closes: https://lore.kernel.org/lkml/ZkO6l%2FODzadSgdhC@dschatzberg-fedora-PF3DHTB...
-
Ming Lei authored
__blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start is being executed. If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one stat instance being added in blk_cgroup_bio_start(), then the local list in __blkcg_rstat_flush() could be corrupted. Fix the issue by adding one barrier. Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Fixes: 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240515013157.443672-3-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Since commit 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()"), each iostat instance is added to blkcg percpu list, so blkcg_reset_stats() can't reset the stat instance by memset(), otherwise the llist may be corrupted. Fix the issue by only resetting the counter part. Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Jay Shin <jaeshin@redhat.com> Fixes: 3b8cc629 ("blk-cgroup: Optimize blkcg_rstat_flush()") Signed-off-by:
Ming Lei <ming.lei@redhat.com> Acked-by:
Tejun Heo <tj@kernel.org> Reviewed-by:
Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240515013157.443672-2-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 09, 2024
-
-
Yu Kuai authored
Other cgroup policy like bfq, iocost are lazy-initialized when they are configured for the first time for the device, but blk-throttle is initialized unconditionally from blkcg_init_disk(). Delay initialization of blk-throttle as well, to save some cpu and memory overhead if it's not configured. Noted that once it's initialized, it can't be destroyed until disk removal, even if it's disabled. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509121107.3195568-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
One the one hand, it's marked EXPERIMENTAL since 2017, and looks like there are no users since then, and no testers and no developers, it's just not active at all. On the other hand, even if the config is disabled, there are still many fields in throtl_grp and throtl_data and many functions that are only used for throtl low. At last, currently blk-throtl is initialized during disk initialization, and destroyed during disk removal, and it exposes many functions to be called directly from block layer. Remove throtl low to make code much more cleaner and follow up work much easier. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Acked-by:
Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240509121107.3195568-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
util means the percentage that disk has IO, and theoretically it should not be greater than 100%. However, there is a gap for rq-based disk: io_ticks will be updated when rq is allocated, however, before such rq dispatch to driver, it will not be account as inflight from blk_mq_start_request() hence diskstats_show()/part_stat_show() will not update io_ticks. For example: 1) at t0, issue a new IO, rq is allocated, and blk_account_io_start() update io_ticks; 2) something is wrong with drivers, and the rq can't be dispatched; 3) at t0 + 10s, drivers recovers and rq is dispatched and done, io_ticks is updated; Then if user is using "iostat 1" to monitor "util", between t0 - t0+9s, util will be zero, and between t0+9s - t0+10s, util will be 1000%. Fix this problem by updating io_ticks from diskstats_show() and part_stat_show() if there are rq allocated. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-3-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250): Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms Test result: util is about 90%, while the disk is really idle. This behaviour is introduced by commit 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting"), however, there was a key point that is missed that this patch also improve performance a lot: Before the commit: part_round_stats: if (part->stamp != now) stats |= 1; part_in_flight() -> there can be lots of task here in 1 jiffies. part_round_stats_single() __part_stat_add() part->stamp = now; After the commit: update_io_ticks: stamp = part->bd_stamp; if (time_after(now, stamp)) if (try_cmpxchg()) __part_stat_add() -> only one task can reach here in 1 jiffies. Hence in order to account io_ticks precisely, we only need to know if there are IO inflight at most once in one jiffies. Noted that for rq-based device, iterating tags should not be used here because 'tags->lock' is grabbed in blk_mq_find_and_get_req(), hence part_stat_lock_inc/dec() and part_in_flight() is used to trace inflight. The additional overhead is quite little: - per cpu add/dec for each IO for rq-based device; - per cpu sum for each jiffies; And it's verified by null-blk that there are no performance degration under heavy IO pressure. Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting") Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123717.3223892-2-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Yu Kuai authored
So that if caller didn't use plug, for example, __blkdev_direct_IO_simple() and __blkdev_direct_IO_async(), block layer can still benefit from caching nsec time in the plug. Signed-off-by:
Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240509123825.3225207-1-yukuai1@huaweicloud.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 07, 2024
-
-
Matthew Wilcox (Oracle) authored
Several modules use __bio_add_page() today and may need to be converted to bio_add_folio_nofail(). Reviewed-by:
Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
David Sterba <dsterba@suse.com> Signed-off-by:
David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Discards can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Reported-by:
Conrad Meyer <conradmeyer@meta.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
Add a helper to wait for an entire chain of bios to complete. [hch: split from a larger patch, moved and changed the name now that it is non-static] Signed-off-by:
Keith Busch <kbusch@kernel.org> Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-6-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Factor out a helper from __blkdev_issue_discard that chews off as much as possible from a discard range and allocates a bio for it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-5-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-4-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Most bio operations get basic sanity checking in submit_bio and anything more complicated than that is done in the callers. Discards are a bit different from that in that a lot of checking is done in __blkdev_issue_discard, and the specific errnos for that are returned to userspace. Move the checks that require specific errnos to the ioctl handler instead, and just leave the basic sanity checking in submit_bio for the other handlers. This introduces two changes in behavior: 1) the logical block size alignment check of the start and len is lost for non-ioctl callers. This matches what is done for other operations including reads and writes. We should probably verify this for all bios, but for now make discards match the normal flow. 2) for non-ioctl callers all errors are reported on I/O completion now instead of synchronously. Callers in general mostly ignore or log errors so this will actually simplify the code once cleaned up Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
We now set a default granularity in the queue limits API, so don't bother with this extra check. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-2-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Justin Stitt authored
Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c58 ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by:
Justin Stitt <justinstitt@google.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 06, 2024
-
-
Ming Lei authored
For devices with virt_boundary limit, the driver may provide zero max segment size, we have to set it as UINT_MAX at default. Otherwise, it may cause warning in driver when handling sglist. Fix it by setting default max segment size as UINT_MAX. Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@kernel.org> Fixes: b561ea56 ("block: allow device to have both virt_boundary_mask and max segment size") Tested-by:
Geert Uytterhoeven <geert+renesas@glider.be> Reported-by:
Geert Uytterhoeven <geert+renesas@glider.be> Closes: https://lore.kernel.org/linux-block/7e38b67c-9372-a42d-41eb-abdce33d3372@linux-m68k.org/ Signed-off-by:
Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240424134722.2584284-1-ming.lei@redhat.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- May 03, 2024
-
-
INAGAKI Hiroshi authored
Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(), which makes the code simpler. Before commit 146afeb2 ("block: use strscpy() to instead of strncpy()"), we used a strncpy() to copy a block device name and partition names. The commit simply replaced a strncpy() and NULL termination with a strscpy(). It did not update calculations of length passed to strscpy(). While the length passed to strncpy() is just a length of valid characters without NULL termination ('\0'), strscpy() takes it as a length of the destination buffer, including a NULL termination. Since the source buffer is not necessarily NULL terminated, the current code copies "length - 1" characters and puts a NULL character in the destination buffer. It replaces the last character with NULL and breaks the parsing. As an example, that buffer will be passed to parse_parts() and breaks parsing sub-partitions due to the missing ')' at the end, like the following. example (Check Point V-80 & OpenWrt): - Linux Kernel 6.6 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.884016] mmc1: new HS200 MMC card at address 0001 [ 0.889951] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.895043] cmdline partition format is invalid. [ 0.895704] mmcblk1: p1 [ 0.903447] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.908667] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.913765] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0) 1. "48M@10M(kernel-1),..." is passed to strscpy() with length=17 from parse_parts() 2. strscpy() returns -E2BIG and the destination buffer has "48M@10M(kernel-1\0" 3. "48M@10M(kernel-1\0" is passed to parse_subpart() 4. parse_subpart() fails to find ')' when parsing a partition name, and returns error - Linux Kernel 6.1 [ 0.000000] Kernel command line: console=ttyS0,115200 earlycon=uart8250,mmio32,0xf0512000 crashkernel=30M mvpp2x.queue_mode=1 blkdevparts=mmcblk1:48M@10M(kernel-1),1M(dtb-1),720M(rootfs-1),48M(kernel-2),1M(dtb-2),720M(rootfs-2),300M(default_sw),650M(logs),1M(preset_cfg),1M(adsl),-(storage) maxcpus=4 ... [ 0.953142] mmc1: new HS200 MMC card at address 0001 [ 0.959114] mmcblk1: mmc1:0001 004GA0 3.69 GiB [ 0.964259] mmcblk1: p1(kernel-1) p2(dtb-1) p3(rootfs-1) p4(kernel-2) p5(dtb-2) 6(rootfs-2) p7(default_sw) p8(logs) p9(preset_cfg) p10(adsl) p11(storage) [ 0.979174] mmcblk1boot0: mmc1:0001 004GA0 2.00 MiB [ 0.984674] mmcblk1boot1: mmc1:0001 004GA0 2.00 MiB [ 0.989926] mmcblk1rpmb: mmc1:0001 004GA0 512 KiB, chardev (248:0 By the way, strscpy() takes a length of destination buffer and it is often confusing when copying characters with a specified length. Using strsep() helps to separate the string by the specified character. Then, we can use strscpy() naturally with the size of the destination buffer. Separating the string on the fly is also useful to omit the redundant string copy, reducing memory usage and improve the code readability. Fixes: 146afeb2 ("block: use strscpy() to instead of strncpy()") Suggested-by:
Naohiro Aota <naota@elisp.net> Signed-off-by:
INAGAKI Hiroshi <musashino.open@gmail.com> Reviewed-by:
Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20240421074005.565-1-musashino.open@gmail.com Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
blkdev_iomap_begin rounds down the offset to the logical block size before stashing it in iomap->offset and checking that it still is inside the inode size. Check the i_size check to the raw pos value so that we don't try a zero size write if iter->pos is unaligned. Fixes: 487c607d ("block: use iomap for writes to block devices") Reported-by:
<syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Tested-by:
<syzbot+0a3683a0a6fecf909244@syzkaller.appspotmail.com> Link: https://lore.kernel.org/r/20240503081042.2078062-1-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead. Cc: stable@vger.kernel.org # 6.3+ Depends-on: 140ce28d ("block: add a disk_has_partscan helper") Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de [axboe: add links and commit message from Keith] Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Here we know that bdevfs inodes are coallocated with struct block_device and we can get to ->bd_inode value without any dereferencing. Introduce an inlined helper (static, *not* exported, purely internal for bdev.c) that gets an associated inode by block_device - BD_INODE(bdev). NOTE: leave it static; nobody outside of block/bdev.c has any business playing with that. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-6-viro@zeniv.linux.org.uk Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Al Viro authored
Just the low-hanging fruit... Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by:
Christian Brauner <brauner@kernel.org>
-
Al Viro authored
points to ->i_data of coallocated inode. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-1-viro@zeniv.linux.org.uk Signed-off-by:
Christian Brauner <brauner@kernel.org>
-