Re: [PATCH -next RFC] block: support to account io_ticks precisely

From: Yu Kuai
Date: Wed Jan 03 2024 - 01:12:09 EST


Hi, Ming!

在 2024/01/03 12:02, Ming Lei 写道:
On Tue, Dec 05, 2023 at 05:37:43PM +0800, Yu Kuai wrote:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):

Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms

Test result: util is about 90%, while the disk is really idle.

In order to account io_ticks precisely, update_io_ticks() must know if
there are IO inflight already, and this requires overhead slightly,
hence precise io accounting is disabled by default, and user can enable
it through sysfs entry.

Yeah, the trouble is from commit 5b18b5a73760 ("block: delete part_round_stats and
switch to less precise counting"), and real reason is that IO inflight
info is too expensive to maintain in fast path, and RH have got several customer
complaint in this area too.


Noted that for rq-based devcie, part_stat_local_inc/dec() and
part_in_flight() is used to track inflight instead of iterating tags,
which is not supposed to be used in fast path because 'tags->lock' is
grabbed in blk_mq_find_and_get_req().

You can iterate over static requests via BT_TAG_ITER_STATIC_RQS, then
tags->lock can be bypassed, but new helper is needed.

But given it is only run once for each tick, I guess percpu counting
might be fine too even in case of big machine.


Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
Documentation/ABI/stable/sysfs-block | 8 ++++--
block/blk-core.c | 17 ++++++++----
block/blk-mq.c | 18 ++++++++++---
block/blk-sysfs.c | 40 ++++++++++++++++++++++++++--
block/blk.h | 4 ++-
block/genhd.c | 6 ++---
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 3 +++
8 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..e5fedecf7bdf 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -358,8 +358,12 @@ What: /sys/block/<disk>/queue/iostats
Date: January 2009
Contact: linux-block@xxxxxxxxxxxxxxx
Description:
- [RW] This file is used to control (on/off) the iostats
- accounting of the disk.
+ [RW] This file is used to control the iostats accounting of the
+ disk. If this value is 0, iostats accounting is disabled; If
+ this value is 1, iostats accounting is enabled, but io_ticks is
+ accounted by sampling and the result is not accurate; If this
+ value is 2, iostats accounting is enabled and io_ticks is
+ accounted precisely, but there will be slightly overhead.

IMO, this approach looks fine.

What: /sys/block/<disk>/queue/logical_block_size
diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..405883d606cd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -935,14 +935,20 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
}
EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
-void update_io_ticks(struct block_device *part, unsigned long now, bool end)
+void update_io_ticks(struct block_device *part, unsigned long now, bool end,
+ bool precise)
{
unsigned long stamp;
again:
stamp = READ_ONCE(part->bd_stamp);
- if (unlikely(time_after(now, stamp))) {
- if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
+ if (unlikely(time_after(now, stamp)) &&
+ likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
+ if (precise) {
+ if (end || part_in_flight(part))
+ __part_stat_add(part, io_ticks, now - stamp);

Strictly speaking, `end` isn't need any more, but it can be thought
as one optimization, given part_in_flight() is supposed to be non-zero
in case of account_done.

+ } else {
__part_stat_add(part, io_ticks, end ? now - stamp : 1);
+ }
}
if (part->bd_partno) {
part = bdev_whole(part);
@@ -954,7 +960,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
unsigned long start_time)
{
part_stat_lock();
- update_io_ticks(bdev, start_time, false);
+ update_io_ticks(bdev, start_time, false,
+ blk_queue_precise_io_stat(bdev->bd_queue));

blk_queue_precise_io_stat() can be moved into update_io_ticks()
directly, and it should be fine given it is just done once in each
tick.

Thanks for reviewing this patch! I'll update your suggestion in v2.

Kuai


Thanks,
Ming

.