Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

From: Huang, Ying
Date: Mon Feb 19 2024 - 22:31:27 EST


Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> writes:

> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>>
>> Thanks a lot for your great patchset!
>>
>> We have a similar starvation issue in mm subsystem too. Details are in
>> the patch description of the below commit. In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work. If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>>
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...

I don't think that it's common (or even reasonable) for real-time tasks
to use swap. So, IMHO, performance isn't very important here. But, we
need to avoid live-lock anyway. I think that your patchset solves the
live-lock issue.

>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.

If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?

> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>>
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.

Yes. To improve the performance, we need something else.

--
Best Regards,
Huang, Ying

> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>>
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <guo.ziliang@xxxxxxxxxx>
>> Date: Wed Mar 16 16:15:03 2022 -0700
>>
>> mm: swap: get rid of livelock in swapin readahead
>>
>> In our testing, a livelock task was found. Through sysrq printing, same
>> stack was found every time, as follows:
>>
>> __swap_duplicate+0x58/0x1a0
>> swapcache_prepare+0x24/0x30
>> __read_swap_cache_async+0xac/0x220
>> read_swap_cache_async+0x58/0xa0
>> swapin_readahead+0x24c/0x628
>> do_swap_page+0x374/0x8a0
>> __handle_mm_fault+0x598/0xd60
>> handle_mm_fault+0x114/0x200
>> do_page_fault+0x148/0x4d0
>> do_translation_fault+0xb0/0xd4
>> do_mem_abort+0x50/0xb0
>>
>> The reason for the livelock is that swapcache_prepare() always returns
>> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>> cannot jump out of the loop. We suspect that the task that clears the
>> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
>> priority of the task stuck in a livelock so that the task that clears
>> the SWAP_HAS_CACHE flag will run. The results show that the system
>> returns to normal after the priority is lowered.
>>
>> In our testing, multiple real-time tasks are bound to the same core, and
>> the task in the livelock is the highest priority task of the core, so
>> the livelocked task cannot be preempted.
>>
>> Although cond_resched() is used by __read_swap_cache_async, it is an
>> empty function in the preemptive system and cannot achieve the purpose
>> of releasing the CPU. A high-priority task cannot release the CPU
>> unless preempted by a higher-priority task. But when this task is
>> already the highest priority task on this core, other tasks will not be
>> able to be scheduled. So we think we should replace cond_resched() with
>> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>> call set_current_state first to set the task state, so the task will be
>> removed from the running queue, so as to achieve the purpose of giving
>> up the CPU and prevent it from running in kernel mode for too long.
>>
>> (akpm: ugly hack becomes uglier. But it fixes the issue in a
>> backportable-to-stable fashion while we hopefully work on something
>> better)
>>
>> Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@xxxxxxxxx
>> Signed-off-by: Guo Ziliang <guo.ziliang@xxxxxxxxxx>
>> Reported-by: Zeal Robot <zealci@xxxxxxxxxx>
>> Reviewed-by: Ran Xiaokai <ran.xiaokai@xxxxxxxxxx>
>> Reviewed-by: Jiang Xuexin <jiang.xuexin@xxxxxxxxxx>
>> Reviewed-by: Yang Yang <yang.yang29@xxxxxxxxxx>
>> Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>
>> Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
>> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
>> Cc: Minchan Kim <minchan@xxxxxxxxxx>
>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
>> Cc: Roger Quadros <rogerq@xxxxxxxxxx>
>> Cc: Ziliang Guo <guo.ziliang@xxxxxxxxxx>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>> * in swap_map, but not yet added its page to swap cache.
>> */
>> - cond_resched();
>> + schedule_timeout_uninterruptible(1);
>> }
>>
>> /*