Re: [PATCH v2] io_uring: Statistics of the true utilization of sq threads.

From: Pavel Begunkov
Date: Thu Nov 09 2023 - 11:16:06 EST


On 11/8/23 08:07, Xiaobing Li wrote:
Since the sq thread has a while(1) structure, during this process, there
may be a lot of time that is not processing IO but does not exceed the
timeout period, therefore, the sqpoll thread will keep running and will
keep occupying the CPU. Obviously, the CPU is wasted at this time;Our
goal is to count the part of the time that the sqpoll thread actually
processes IO, so as to reflect the part of the CPU it uses to process
IO, which can be used to help improve the actual utilization of the CPU
in the future.

Let's pull the elephant out of the room, what's the use case? "Improve
in the future" doesn't sound too convincing. If it's a future kernel
change you have in mind, it has to go together with this patch. If it's
a userspace application, it'd be interesting to hear what that is,
especially if you have numbers ready.

And another classic question, why can't it be done with bpf?


Signed-off-by: Xiaobing Li <xiaobing.li@xxxxxxxxxxx>

v1 -> v2: Added method to query data.

...
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index bd6c2c7959a5..c821273406bd 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -224,6 +224,7 @@ static int io_sq_thread(void *data)
struct io_ring_ctx *ctx;
unsigned long timeout = 0;
char buf[TASK_COMM_LEN];
+ unsigned long start, begin, end;

start and begin used for just slightly different accounting,
it'll get confused anyone.

DEFINE_WAIT(wait);
snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
@@ -235,6 +236,7 @@ static int io_sq_thread(void *data)
set_cpus_allowed_ptr(current, cpu_online_mask);
mutex_lock(&sqd->lock);
+ start = jiffies;
while (1) {
bool cap_entries, sqt_spin = false;
@@ -245,12 +247,18 @@ static int io_sq_thread(void *data)
}
cap_entries = !list_is_singular(&sqd->ctx_list);
+ begin = jiffies;

There can be {hard,soft}irq in between jiffies reads, and it can even
be scheduled out in favour of another process, so it'd collect a lot
of garbage. There should be a per-task stat for system time you can
use:

start = get_system_time(current);
do_io_part();
sq->total_time += get_system_time(current) - start;
wait();

...

list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
int ret = __io_sq_thread(ctx, cap_entries);
if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
sqt_spin = true;
}
+ end = jiffies;
+ sqd->total = end - start;

...and then you don't need to track total at all, it'd be your

total = get_system_time(sq_thread /* current */);

at any given point it time.


+ if (sqt_spin == true)
+ sqd->work += end - begin;

It should go after the io_run_task_work() below, task_work is a major
part of request execution.

+
if (io_run_task_work())
sqt_spin = true;
diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
index 8df37e8c9149..0aa4e2efa4db 100644
--- a/io_uring/sqpoll.h
+++ b/io_uring/sqpoll.h
@@ -16,6 +16,8 @@ struct io_sq_data {
pid_t task_pid;
pid_t task_tgid;
+ unsigned long work;
+ unsigned long total;
unsigned long state;
struct completion exited;
};

--
Pavel Begunkov