Re: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask()

From: Haifeng Xu
Date: Thu Nov 24 2022 - 21:14:30 EST




On 2022/11/25 07:00, Waiman Long wrote:
> On 11/24/22 02:49, Haifeng Xu wrote:
>>
>> On 2022/11/24 12:24, Waiman Long wrote:
>>> On 11/23/22 22:33, Haifeng Xu wrote:
>>>> On 2022/11/24 04:23, Waiman Long wrote:
>>>>> On 11/23/22 03:21, haifeng.xu wrote:
>>>>>> When change the 'cpuset.mems' under some cgroup, system will hung
>>>>>> for a long time. From the dmesg, many processes or theads are
>>>>>> stuck in fork/exit. The reason is show as follows.
>>>>>>
>>>>>> thread A:
>>>>>> cpuset_write_resmask /* takes cpuset_rwsem */
>>>>>>      ...
>>>>>>        update_tasks_nodemask
>>>>>>          mpol_rebind_mm /* waits mmap_lock */
>>>>>>
>>>>>> thread B:
>>>>>> worker_thread
>>>>>>      ...
>>>>>>        cpuset_migrate_mm_workfn
>>>>>>          do_migrate_pages /* takes mmap_lock */
>>>>>>
>>>>>> thread C:
>>>>>> cgroup_procs_write /* takes cgroup_mutex and
>>>>>> cgroup_threadgroup_rwsem */
>>>>>>      ...
>>>>>>        cpuset_can_attach
>>>>>>          percpu_down_write /* waits cpuset_rwsem */
>>>>>>
>>>>>> Once update the nodemasks of cpuset, thread A wakes up thread B to
>>>>>> migrate mm. But when thread A iterates through all tasks, including
>>>>>> child threads and group leader, it has to wait the mmap_lock which
>>>>>> has been take by thread B. Unfortunately, thread C wants to migrate
>>>>>> tasks into cgroup at this moment, it must wait thread A to release
>>>>>> cpuset_rwsem. If thread B spends much time to migrate mm, the
>>>>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to
>>>>>> wait for a long time.
>>>>>>
>>>>>> There is no need to migrate the mm of child threads which is
>>>>>> shared with group leader. Just iterate through the group
>>>>>> leader only.
>>>>>>
>>>>>> Signed-off-by: haifeng.xu <haifeng.xu@xxxxxxxxxx>
>>>>>> ---
>>>>>>     kernel/cgroup/cpuset.c | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>>>>>> index 589827ccda8b..43cbd09546d0 100644
>>>>>> --- a/kernel/cgroup/cpuset.c
>>>>>> +++ b/kernel/cgroup/cpuset.c
>>>>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset
>>>>>> *cs)
>>>>>>               cpuset_change_task_nodemask(task, &newmems);
>>>>>>     +        if (!thread_group_leader(task))
>>>>>> +            continue;
>>>>>> +
>>>>>>             mm = get_task_mm(task);
>>>>>>             if (!mm)
>>>>>>                 continue;
>>>>> Could you try the attached test patch to see if it can fix your
>>>>> problem?
>>>>> Something along the line of this patch will be more acceptable.
>>>>>
>>>>> Thanks,
>>>>> Longman
>>>>>
>>>> Hi, Longman.
>>>> Thanks for your patch, but there are still some problems.
>>>>
>>>> 1)
>>>>     (group leader, node: 0,1)
>>>>            cgroup0
>>>>            /     \
>>>>           /       \
>>>>       cgroup1   cgroup2
>>>>      (threads)  (threads)
>>>>
>>>> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will
>>>> update
>>>> the mm. And the nodemask of mm depends on who set the node last.
>>> Yes, that is the existing behavior. It was not that well defined in the
>>> past and so it is somewhat ambiguous as to what we need to do about it.
>>>
>> The test patch works if the child threads are in same cpuset with group
>> leader which has same logic with my patch. But if they are in different
>> cpusets, the test patch will fail because the contention of mmap_lock
>> still exsits and seems similar to the original logic.
>
> That is true. I am thinking about adding a nodemask to mm_struct so that
> we can figure out if we need to propagate the changes down to all the
> VMAs and do the migration. That will enable us to avoid doing wasteful
> work.
>
> Current node mask handling isn't that efficient especially for distros
> that have a relatively large NODES_SHIFT value. Some work may also be
> need in this area.
>
>>> BTW, cgroup1 has a memory_migrate flag which will force page migration
>>> if set. I guess you may have it set in your case as it will introduce a
>>> lot more delay as page migration takes time. That is probably the reason
>>> why you are seeing a long delay. So one possible solution is to turn
>>> this flag off. Cgroup v2 doesn't have this flag.
>>>
>> Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup
>> v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and
>> couldn't be changed.
> You are right. Cgroup v2 has CS_MEMORY_MIGRATE enabled by default and
> can't be turned off.
>>
>>>> 2)
>>>>      (process, node: 0,1)
>>>>            cgroup0
>>>>            /     \
>>>>           /       \
>>>>       cgroup1   cgroup2
>>>>      (node: 0)  (node: 1)
>>>>
>>>> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach
>>>> won't update the mm. So the nodemask of thread, including mems_allowed
>>>> and mempolicy(updated in cpuset_change_task_nodemask), is different
>>>> from
>>>> the vm_policy in vma(updated in mpol_rebind_mm).
>>> Yes, that can be the case.
>>>
>>>>
>>>> In a word, if threads have different cpusets with different
>>>> nodemask, it
>>>> will cause inconsistent memory behavior.
>>> So do you have suggestion of what we need to do going forward?
>> Should we prevent thread from migrating to those cgroups which have
>> different nodemask with the cgroup that contains the group leader?
>>
>> In addition, the group leader and child threads should be in same cgroup
>> tree, also the level of cgroup containes group leader must be higher
>> than these cgroups contain child threads, so update_nodemask will work.
>>
>> Or just disable thread migration in cpuset?It's easy to achieve but will
>> affect cpu bind.
>
> As said above, my current inclination is to add a nodemask to mm_struct
> and revise the way nodemask is being handled. That will take some time>
> Cheers,
> Longman
>

OK, thanks.