Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()

From: Peter Newman
Date: Thu Dec 08 2022 - 17:31:09 EST


Hi Reinette,

On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:
> On 12/7/2022 2:58 AM, Peter Newman wrote:
> >>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >>>     switch stores new task_curr() and task_cpu() values.
> >>
> >> This scenario is not clear to me. Could you please provide more detail about it?
> >> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> >> From what I can tell rq->curr, as used by task_curr() is set before
> >> even context_switch() is called ... and since the next task is picked from
> >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> >> a runqueue) it seems to me that the value used by task_cpu() would also
> >> be set early (before context_switch() is called). It is thus not clear to
> >> me how the above reordering could occur so an example would help a lot.
> >
> > Perhaps in both scenarios I didn't make it clear that reordering in the
> > CPU can cause the incorrect behavior rather than the program order. In
> > this explanation, items 1. and 2. are supposed to be completing the
> > sentence ending with a ':' at the end of paragraph 3, so I thought that
> > would keep focus on the CPU.
>
> You did make it clear that the cause is reordering in the CPU. I am just
> not able to see where the reordering is occurring in your scenario (2).

It will all come down to whether it can get from updating rq->curr to
reading task_struct::{closid,rmid} without encountering a full barrier.

I'll go into the details below.

> Please do. Could you start by highlighting which resctrl_sched_in()
> you are referring to? I am trying to dissect (2) with the given information:
> Through "the calling context switch" the scenario is written to create
> understanding that it refers to:
> context_switch()->switch_to()->resctrl_sched_in() - so the calling context
> switch is the first in the above call path ... where does it (context_switch())
> store the new task_curr() and task_cpu() values and how does that reorder with
> resctrl_sched_in() further down in call path?

Yes, the rq->curr update is actually in __schedule(). I was probably
still thinking it was in prepare_task_switch() (called from
context_switch()) because of older kernels where __rdtgroup_move_task()
is still reading task_struct::on_cpu.

There is an interesting code comment under the rq->curr update site in
__schedule():

/*
* RCU users of rcu_dereference(rq->curr) may not see
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
* rq->curr, before returning to user-space.
*
* Here are the schemes providing that barrier on the
* various architectures:
* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
* switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
* - finish_lock_switch() for weakly-ordered
* architectures where spin_unlock is a full barrier,
* - switch_to() for arm64 (weakly-ordered, spin_unlock
* is a RELEASE barrier),
*/

Based on this, I believe (2) doesn't happen on x86 because switch_mm()
provides the required ordering.

On arm64, it won't happen as long as it calls its resctrl_sched_in()
after the dsb(ish) in __switch_to(), which seems to be the case:

https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561

Based on this, I'll just sketch out the first scenario below and drop
(2) from the changelog. This also implies that the group update cases
can use a single smp_mb() to provide all the necessary ordering, because
there's a full barrier on context switch for it to pair with, so I don't
need to broadcast IPI anymore. I don't know whether task_call_func() is
faster than an smp_mb(). I'll take some measurements to see.

The presumed behavior is __rdtgroup_move_task() not seeing t1 running
yet implies that it observes the updated values:

CPU 0 CPU 1
----- -----
(t1->{closid,rmid} -> {1,1}) (rq->curr -> t0)

__rdtgroup_move_task():
t1->{closid,rmid} <- {2,2}
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {2,2}
if (curr == t1) // false
IPI(t1->cpu)

In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
is current:

CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)

Please let me know if these diagrams clarify things.

-Peter