Re: [PATCH for-6.8/block] block: support to account io_ticks precisely

From: Yu Kuai
Date: Tue Jan 16 2024 - 05:00:50 EST


Hi, Ming

在 2024/01/15 19:54, Yu Kuai 写道:
Hi,

在 2024/01/15 19:38, Ming Lei 写道:
On Tue, Jan 09, 2024 at 03:13:32PM +0800, Yu Kuai wrote:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

Currently, io_ticks is accounted based on sampling, specifically
update_io_ticks() will always account io_ticks by 1 jiffies from
bdev_start_io_acct()/blk_account_io_start(), and the result can be
inaccurate, for example(HZ is 250):

Test script:
fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms

Test result: util is about 90%, while the disk is really idle.

Just be curious, what is result with this patch? 0%?

I ran some tests on null_blk and found out that this patch actually
doesn't have any performance overhead, and I finially figure out that
it's true even in theroy.

Notice that there is a cmpxchg() in update_ticks():

update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()

Hence only one task can pass cmpxchg in 1 jiffies, and part_stat_add() be called at most once in 1 jiffies. Which means with this patch,
part_in_flight() will also be called at most once in 1 jiffies(not per
IO). And part_in_flight() once per jiffies really doesn't affect IO
performance at all.

Befor this cmpxchg():

part_round_stats:
if (part->stamp != now)
stats |= 1;

part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;

By the way, this cmpxchg() is added by commit 5b18b5a73760 ("block:
delete part_round_stats and switch to less precise counting"), hence
actually there is no need to switch to less precise counting in the
first place.

So I think I can just remove the switch and switch to precise io
accounting by default in the next version.

Please let me know what you think!

Thanks,
Kuai


No, it's not 0%, this actually depends on how many IO really start from
one jiffies and complete at the next jiffies. Given that the probability
is related to IO latency, so the result should be relatively
accurate(Around 10% in my environment). I think we can live with that
unless we improve time precision from jiffies to ns.


In order to account io_ticks precisely, update_io_ticks() must know if
there are IO inflight already, and this requires overhead slightly,
hence precise io accounting is disabled by default, and user can enable
it through sysfs entry.

Noted that for rq-based devcie, part_stat_local_inc/dec() and
part_in_flight() is used to track inflight instead of iterating tags,
which is not supposed to be used in fast path because 'tags->lock' is
grabbed in blk_mq_find_and_get_req().

Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
Changes from RFC v1:
  - remove the new parameter for update_io_ticks();
  - simplify update_io_ticks();
  - use swith in queue_iostats_store();
  - add missing part_stat_local_dec() in blk_account_io_merge_request();
Changes from RFC v2:
  - fix that precise is ignored for the first io in update_io_ticks();

  Documentation/ABI/stable/sysfs-block |  8 ++++--
  block/blk-core.c                     | 10 +++++--
  block/blk-merge.c                    |  3 ++
  block/blk-mq-debugfs.c               |  2 ++
  block/blk-mq.c                       | 11 +++++++-
  block/blk-sysfs.c                    | 42 ++++++++++++++++++++++++++--
  block/blk.h                          |  1 +
  block/genhd.c                        |  2 +-
  include/linux/blk-mq.h               |  1 +
  include/linux/blkdev.h               |  3 ++
  10 files changed, 74 insertions(+), 9 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..79027bf2661a 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -358,8 +358,12 @@ What:        /sys/block/<disk>/queue/iostats
  Date:        January 2009
  Contact:    linux-block@xxxxxxxxxxxxxxx
  Description:
-        [RW] This file is used to control (on/off) the iostats
-        accounting of the disk.
+        [RW] This file is used to control the iostats accounting of the
+        disk. If this value is 0, iostats accounting is disabled; If
+        this value is 1, iostats accounting is enabled, but io_ticks is
+        accounted by sampling and the result is not accurate; If this
+        value is 2, iostats accounting is enabled and io_ticks is
+        accounted precisely, but there will be slightly more overhead.
  What:        /sys/block/<disk>/queue/logical_block_size
diff --git a/block/blk-core.c b/block/blk-core.c
index 9520ccab3050..c70dc311e3b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -954,11 +954,15 @@ EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
  void update_io_ticks(struct block_device *part, unsigned long now, bool end)
  {
      unsigned long stamp;
+    bool precise = blk_queue_precise_io_stat(part->bd_queue);
  again:
      stamp = READ_ONCE(part->bd_stamp);
-    if (unlikely(time_after(now, stamp))) {
-        if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
-            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
+    if (unlikely(time_after(now, stamp)) &&
+        likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
+        if (end || (precise && part_in_flight(part)))
+            __part_stat_add(part, io_ticks, now - stamp);
+        else if (!precise)
+            __part_stat_add(part, io_ticks, 1);

It should be better or readable to move 'bool precise' into the above branch,
given we only need to read the flag once in each tick.

Otherwise, this patch looks fine.

Thanks for your advice, will change that in next version.

Kuai

Thanks,
Ming

.


.