Re: fuse uring / wake_up on the same core

From: Bernd Schubert
Date: Wed May 03 2023 - 13:05:47 EST


On 5/2/23 02:33, Hillf Danton wrote:
> On 1 May 2023 21:44:48 +0000 Bernd Schubert <bschubert@xxxxxxx>
>> I'm not familiar at all with scheduler code,
>> given this works perfectly this suggests the same function is also
>> called without explicit waitq, when the scheduler preempts a task?
>
> Please see comment below in the select_task function.
>>
>> I think there might be side effects - what is if multiple
>> applications are on one core and another core would be available?
>> With this flag they would stay on the same core? Maybe better two flags?
>
> Gooddd question. Given multiple seesaws,
>
> Player i1 Player j1
> | |
> Seesaw i Seesaw j
> | |
> P i2 P j2
>
> what is ideal is run Seesaws i and j on different CPU cores.
>
> We can do that by replacing the seesaw flag in the task struct with
> seesaw id for instance. But I have no idea how complex it will become
> to set up multiple seesaw workloads on the fuse side, by grouping and
> binding players to different seesaws,
>
> - unsigned seesaw:1;
> + unsigned int seesaw;>
> while the corresponding change to scheduler looks like.
>
> + if (p->seesaw && p->seesaw == current->seesaw &&
> + cpumask_test_cpu(cpu, p->cpus_ptr))
> + return cpu;


Hmm, how is the seesaw id assigned and assuming two processes landed
on the same core but later on another core is available, how does it
dynamically change the ID?
My idea with two bits was that there is a fuse ring thread bound to a
specific core - it is the IO processor and gets the seesaw_proc bit.
Application is submitting requests and get the seesaw_req bit set.
Two applications running on the same core won't disturb each other that way.

As addition, if the application is not submitting requests anymore, but
let's say is busy doing computations, we want to have a timeout to let
it move away if another core is more suitable. What do you think about
the new patch version at the end of the mail? It uses two bits and jiffies.
Just tested it and it works fine. The exact timeout period is
certainly debatable. I also feel a bit bad
to take so many bits in struct task. If this approach is acceptable,
the jiffies parameter could be probably an u16.

>
> Even after job done for fuse, the emerging question is, how to set up
> seesaw workloads for crypto for example, if no more than the seesaw id
> hint to scheduler is preferred.
>
> And it makes me wonder why Prateek's work stalled for quite a while,
> as more is needed for userspace hint to scheduler to work for
> workloads other than seesaw.

Just quickly went over these a bit, assuming seesaw doesn't get accepted
and we need these, I think it would need a bit modification for fuse


> + /*
> + * Handle the case where a hint is set and the current CPU
> + * and the previous CPU where task ran don't share caches.
> + */
> + if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {

I'm testing on and older Xeon system (E5-2650) and tried different settings
with numa binding the application (fuse ring thread is bound anyway)


governor conservative performance
(default)
application cpu 0, ring cpu 0 creates/s ~9200 ~9500
application cpu 0. ring cpu 16 creates/s ~4200 ~8000
application cpu 0. ring cpu 1 creates/s ~4200 ~8500


No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
condition is not ideal for fuse. And I guess asking users to use cpu performance mode
for fuse is also too much asked - other file systems don't have that requirement...
So far your seesaw idea works best (the modified version in combination with
wake-on-same-core).

>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 63d242164b1a..07783ddaec5c 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,8 @@ struct task_struct {
>> /* delay due to memory thrashing */
>> unsigned in_thrashing:1;
>> #endif
>> + unsigned seesaw_req:1;
>> + unsigned seesaw_io:1;
>>
>> unsigned long atomic_flags; /* Flags requiring atomic access. */
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9d6ed7585c6..474bf3657ef0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> if (wake_flags & WF_TTWU) {
>> record_wakee(p);
>
> Seesaw does not work without WF_TTWU as per define.

What does WF_TTWU actually mean? Something like work flow time to wake unit?

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..6da0de4ae9ca 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
* Wait it out.
*/
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+ /*
+ * __wake_up_on_current_cpu ensures we wake up on the right core,
+ * after that we still want to stay on the same core, shared with
+ * a ring thread to submit next request to it. Issue without seesaw
+ * is that the while the ring thread is on its way to wait, it disturbs
+ * the application and application might get migrated away
+ */
+ if (fc->ring.per_core_queue) {
+ current->seesaw_req = 1;
+ current->seesaw_jiffies = jiffies;
+ }
+
+
out:
if (prev_cpu != task_cpu(current))
pr_devel("%s cpu switch from=%d to=%d\n",
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..73adc2b16778 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
/* XXX error injection or test with malicious daemon */
}

+ /* In combination with requesting process (application) seesaw
+ * setting (see request_wait_answer), the application will
+ * stay on the same core.
+ */
+ if (fc->ring.per_core_queue)
+ current->seesaw_proc = 1;
+
ret = fuse_uring_fetch(ring_ent, cmd);
break;
case FUSE_URING_REQ_COMMIT_AND_FETCH:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..53d9c77672b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,13 @@ struct task_struct {
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
+ /* requesting task */
+ unsigned seesaw_req:1;
+ /* request processing task */
+ unsigned seesaw_proc:1;
+
+ /* limit seesaw time slot */
+ unsigned long seesaw_jiffies;

unsigned long atomic_flags; /* Flags requiring atomic access. */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..a14161e6e456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);

+ /*
+ * Current is handling requests on behalf of the waking process,
+ * both want to run on the same core in seeswaw manner.
+ * Typically current is bound to one core.'and only p is allowed
+ * to freely move.
+ */
+ if (p->seesaw_req && current->seesaw_proc &&
+ time_after(jiffies, p->seesaw_jiffies + 10),
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;


Thanks,
Bernd