Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

From: Daniel Bristot de Oliveira
Date: Mon Feb 19 2024 - 05:23:52 EST


Hi

On 2/19/24 08:33, Huang, Ying wrote:
> Hi, Daniel,
>
> Thanks a lot for your great patchset!
>
> We have a similar starvation issue in mm subsystem too. Details are in
> the patch description of the below commit. In short, task A is busy
> looping on some event, while task B will signal the event after some
> work. If the priority of task A is higher than that of task B, task B
> may be starved.

ok...

>
> IIUC, if task A is RT task while task B is fair task, then your patchset
> will solve the issue.

This patch set will not solve the issue. It will mitigate the effect of the
problem. Still the system will perform very poorly...

If both task A and task B is RT tasks, is there
> some way to solve the issue?

I would say reworking the swap algorithm, as it is not meant to be used when
real-time tasks are in place.

As an exercise, let's say that we add a server per priority on FIFO, with a default
50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
busy loop in vain.

Then one would say, let's lower the parameters, so the granularity of
the server would provide lower latencies. The same problem would still
exist, as it exists with sched fair....

So, the solution is not on schedule. Busy loop waiting is not right when you
have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
spin_locks that do busy waiting. spin_locks do not have this problem you
show because they disable preemption... but disabling preemption is not
the best solution either.

So, a first try of duct tape would using (local_?) locks like in
preempt rt to make things sleepable...

AFAICS, this was already discussed in the previous link, right?

>
> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
> resolve the issue, is there something better?

I am not a swap/mm expert.. my guesses would be all on sleepable locking.
But I know there are many smart people on the mm side with better guesses...

It is just that the DL server or any type of starvation avoidance does not
seem to be a solution for your problem.

-- Daniel


> Best Regards,
> Huang, Ying
>
> --------------------------8<---------------------------------------
> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
> Author: Guo Ziliang <guo.ziliang@xxxxxxxxxx>
> Date: Wed Mar 16 16:15:03 2022 -0700
>
> mm: swap: get rid of livelock in swapin readahead
>
> In our testing, a livelock task was found. Through sysrq printing, same
> stack was found every time, as follows:
>
> __swap_duplicate+0x58/0x1a0
> swapcache_prepare+0x24/0x30
> __read_swap_cache_async+0xac/0x220
> read_swap_cache_async+0x58/0xa0
> swapin_readahead+0x24c/0x628
> do_swap_page+0x374/0x8a0
> __handle_mm_fault+0x598/0xd60
> handle_mm_fault+0x114/0x200
> do_page_fault+0x148/0x4d0
> do_translation_fault+0xb0/0xd4
> do_mem_abort+0x50/0xb0
>
> The reason for the livelock is that swapcache_prepare() always returns
> EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
> cannot jump out of the loop. We suspect that the task that clears the
> SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
> priority of the task stuck in a livelock so that the task that clears
> the SWAP_HAS_CACHE flag will run. The results show that the system
> returns to normal after the priority is lowered.
>
> In our testing, multiple real-time tasks are bound to the same core, and
> the task in the livelock is the highest priority task of the core, so
> the livelocked task cannot be preempted.
>
> Although cond_resched() is used by __read_swap_cache_async, it is an
> empty function in the preemptive system and cannot achieve the purpose
> of releasing the CPU. A high-priority task cannot release the CPU
> unless preempted by a higher-priority task. But when this task is
> already the highest priority task on this core, other tasks will not be
> able to be scheduled. So we think we should replace cond_resched() with
> schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
> call set_current_state first to set the task state, so the task will be
> removed from the running queue, so as to achieve the purpose of giving
> up the CPU and prevent it from running in kernel mode for too long.
>
> (akpm: ugly hack becomes uglier. But it fixes the issue in a
> backportable-to-stable fashion while we hopefully work on something
> better)
>
> Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@xxxxxxxxx
> Signed-off-by: Guo Ziliang <guo.ziliang@xxxxxxxxxx>
> Reported-by: Zeal Robot <zealci@xxxxxxxxxx>
> Reviewed-by: Ran Xiaokai <ran.xiaokai@xxxxxxxxxx>
> Reviewed-by: Jiang Xuexin <jiang.xuexin@xxxxxxxxxx>
> Reviewed-by: Yang Yang <yang.yang29@xxxxxxxxxx>
> Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Minchan Kim <minchan@xxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Roger Quadros <rogerq@xxxxxxxxxx>
> Cc: Ziliang Guo <guo.ziliang@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8d4104242100..ee67164531c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> * in swap_map, but not yet added its page to swap cache.
> */
> - cond_resched();
> + schedule_timeout_uninterruptible(1);
> }
>
> /*