Re: fuse uring / wake_up on the same core

From: Bernd Schubert
Date: Fri Apr 28 2023 - 17:55:02 EST


On 4/28/23 03:44, Hillf Danton wrote:
> On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@xxxxxxx>
>> Btw, a very hackish way to 'solve' the issue is this
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index cd7aa679c3ee..dd32effb5010 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>> int err;
>> int prev_cpu = task_cpu(current);
>>
>> + /* When running over uring and core affined userspace threads, we
>> + * do not want to let migrate away the request submitting process.
>> + * Issue is that even after waking up on the right core, processes
>> + * that have submitted requests might get migrated away, because
>> + * the ring thread is still doing a bit of work or is in the process
>> + * to go to sleep. Assumption here is that processes are started on
>> + * the right core (i.e. idle cores) and can then stay on that core
>> + * when they come and do file system requests.
>> + * Another alternative way is to set SCHED_IDLE for ring threads,
>> + * but that would have an issue if there are other processes keeping
>> + * the cpu busy.
>> + * SCHED_IDLE or this hack here result in about factor 3.5 for
>> + * max meta request performance.
>> + *
>> + * Ideal would to tell the scheduler that ring threads are not disturbing
>> + * that migration away from it should very very rarely happen.
>> + */
>> + if (fc->ring.ready)
>> + migrate_disable();
>> +
>> if (!fc->no_interrupt) {
>> /* Any signal may interrupt this */
>> err = wait_event_interruptible(req->waitq,
>>
> If I understand it correctly, the seesaw workload hint to scheduler looks
> like the diff below, leaving scheduler free to pull the two players apart
> across CPU and to migrate anyone.

Thank a lot Hillf! I had a day off / family day today, kernel is now
eventually compiling.

>
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
> /* acquire extra reference, since request is still needed
> after fuse_request_end() */
> __fuse_get_request(req);
> + current->seesaw = 1;
> queue_request_and_unlock(fiq, req);
>
> request_wait_answer(req);
> @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
> fc->max_write))
> return -EINVAL;
>
> + current->seesaw = 1;

fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know
on which core these IO threads are running and which of them to wake up
when an application comes with a request.

There is a patch to use __wake_up_sync to wake the IO thread and reports
that it helps in performance, but I don't see it and I think Miklos
neither. For direct-io read I had also already tested disabling
migration - it didn't show any effect - we better don't set
current->seesaw = 1 in fuse_dev_do_read for now.

With my fuse-uring patches things are more clear
(https://lwn.net/Articles/926773/), there is one IO thread per core and
libfuse side is binding these threads to a single core only.

nproc /dev/fuse /dev/fuse fuse uring fuse uring
migrate on migrate off migrate on migrate off
1 2023 1652 1151 3998
2 3375 2805 2221 7950
4 3823 4193 4540 15022
8 7796 8161 7846 22591
16 8520 8518 12235 27864
24 8361 8084 9415 27864
32 8361 8084 9124 12971

(in MiB/s)

So core affinity really matters and with core affinity it is always
faster with fuse-uring over the existing code.

For single threaded metadata (file creates/stat/unlink) difference
between migrate on/off is rather similar. Going to run with multiple
processes during the next days.

For paged (async) IO it behaves a bit different as here uring can show
it strength and multiple requests can be combined on CQE processing -
better to chose and idle ring thread on another core. I actually have a
question for that as well - later.


> restart:
> for (;;) {
> spin_lock(&fiq->lock);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,7 @@ struct task_struct {
> /* delay due to memory thrashing */
> unsigned in_thrashing:1;
> #endif
> + unsigned seesaw:1;
>
> unsigned long atomic_flags; /* Flags requiring atomic access. */
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
> if (wake_flags & WF_TTWU) {
> record_wakee(p);
>
> + if (p->seesaw && current->seesaw)
> + return cpu;
> if (sched_energy_enabled()) {
> new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> if (new_cpu >= 0)


Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is
in cpus_ptr? The combination of both patches results in

if (p->seesaw && current->seesaw)
return cpu;

if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;



While writing the mail kernel compilation is ready, but it got late,
will test in the morning.


Thanks again,
Bernd