Re: [PATCH v2 0/3] x86/intel_rdt: task_work vs task_struct rmid/closid write race

From: Reinette Chatre
Date: Wed Nov 25 2020 - 12:20:57 EST


Hi Valentin,

On 11/25/2020 7:01 AM, Valentin Schneider wrote:
On 24/11/20 21:37, Reinette Chatre wrote:
On 11/22/2020 6:24 PM, Valentin Schneider wrote:
This is a small cleanup + a fix for a race I stumbled upon while staring at
resctrl stuff.

Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing
a few tasks around control groups.


...

Thank you very much for taking this on. Unfortunately this race is one
of a few issues with the way in which tasks moving to a new resource
group is handled.

Other issues are:

1.
Until the queued work is run, the moved task runs with old (and even
invalid in the case when its original resource group has been removed)
closid and rmid.


For a userspace task, that queued work should be run as soon as possible
(& relevant). If said task is currently running, then task_work_add() will
lead to an IPI;
the other cases (task moving itself or not currently
running) are covered by the return to userspace path.

At this time the work is added with the TWA_RESUME flag so the running task does not get a signal. I tried to follow the task_work_add() path if there is a change to use TWA_SIGNAL instead and (I may have misunderstanding) it seems to me that a sleeping task will be woken (if it is TASK_INTERRUPTIBLE)? That is unnecessary. The goal of this work is only to change the CPU register to indicate the active closid/rmid so it is unnecessary to wake a process to do that, it only needs to be done next time the task is scheduled in (which is already done with the resctrl_sched_in() call in __switch_to()). If a task is not running all that is needed is to change the closid/rmid in its task_struct to be used next time it is scheduled in.

In the new solution, after updating closid/rmid in the task_struct, the CPU register is updated via smp_call_function_single() on a CPU the task is running. Nothing is done for tasks not running, next time they are scheduled in the CPU's register will be updated to reflect the task's closid/rmid. Moving to the smp_call_function_xxx() API would also bring this update in line with how other register updates are already done in resctrl.

Kernel threads however are a prickly matter because they quite explicitly
don't have this return to userspace - they only run their task_work
callbacks on exit. So we currently have to wait for those kthreads to go
through a context switch to update the relevant register, but I don't
see any other alternative that wouldn't involve interrupting every other
CPU (the kthread could move between us triggering some remote work and its
previous CPU receiving the IPI).

This seems ok? In the new solution the closid/rmid would be updated in task_struct and a smp_call_function_single() attempted on the CPU where the kthread is running. If the kthread is no longer running at the time the function is called the CPU register will not be changed. I assume the kthread move would include a context switch that would result in the register change (__switch_to()->resctrl_sched_in()) for the kthread to run with its new closid/rmid after the move.

2.
Work to update the PQR_ASSOC register is queued every time the user
writes a task id to the "tasks" file, even if the task already belongs
to the resource group and in addition to any other pending work for that
task. This could result in multiple pending work items associated with a
single task even if they are all identical and even though only a single
update with most recent values is needed. This could result in
significant system resource waste, especially on tasks sleeping for a
long time.

Fenghua solved these issues by replacing the callback with a synchronous
update, similar to how tasks are currently moved when a resource group
is deleted. We plan to submit this work next week.

This new solution will make patch 1 and 2 of this series unnecessary. As
I understand it patch 3 would still be a welcome addition but would
require changes. As you prefer you could either submit patch 3 on its
own for the code as it is now and we will rework the task related
changes on top of that, or you could wait for the task related changes
to land first?


Please do Cc me on those - I'll re-evaluate the need for patch 3 then.

Will do. I appreciate you taking a look.

Thank you very much

Reinette