Re: [PATCH] debugobject: don't wake up kswapd from fill_pool()

From: Tetsuo Handa
Date: Fri May 12 2023 - 06:58:13 EST


On 2023/05/12 12:44, Andrew Morton wrote:
> On Thu, 11 May 2023 22:47:32 +0900 Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
>
>> syzbot is reporting lockdep warning in fill_pool(), for GFP_ATOMIC is
>> (__GFP_HIGH | __GFP_KSWAPD_RECLAIM) which wakes up kswapd.
>> Since fill_pool() might be called with arbitrary locks held,
>> fill_pool() should not assume that holding pgdat->kswapd_wait is safe.
>
> hm. But many GFP_ATOMIC allocation attempts are made with locks held.
> Why aren't all such callers buggy, by trying to wake kswapd with locks
> held? What's special about this one?

Because debugobject cannot know what locks are held when fill_pool() does
GFP_ATOMIC allocation.

syzbot is reporting base->lock => pgdat->kswapd_wait dependency

add_timer() {
__mod_timer() {
base = lock_timer_base(timer, &flags);
new_base = get_target_base(base, timer->flags);
if (base != new_base) {
raw_spin_unlock(&base->lock);
base = new_base;
raw_spin_lock(&base->lock);
}
debug_timer_activate(timer) {
debug_object_activate(timer, &timer_debug_descr) {
debug_objects_fill_pool() {
fill_pool() {
kmem_cache_zalloc(GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN) {
// wakes kswapd
}
}
}
}
}
raw_spin_unlock_irqrestore(&base->lock, flags);
}
}

when pgdat->kswapd_wait => p->pi_lock dependency

__alloc_pages() {
get_page_from_freelist() {
rmqueue() {
wakeup_kswapd() {
wake_up_interruptible(&pgdat->kswapd_wait) {
__wake_up_common_lock() {
spin_lock_irqsave(&pgdat->kswapd_wait.lock, flags);
__wake_up_common() {
autoremove_wake_function() {
try_to_wake_up() {
raw_spin_lock_irqsave(&p->pi_lock, flags);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
}
}
}
spin_unlock_irqrestore(&pgdat->kswapd_wait.lock, flags);
}
}
}
}
}
}

and p->pi_lock => rq->__lock => base->lock dependency

wake_up_new_task() {
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
rq = __task_rq_lock(p, &rf); // acquires rq->lock
activate_task(rq, p, ENQUEUE_NOCLOCK) {
enqueue_task() {
psi_enqueue() {
psi_task_change() {
queue_delayed_work_on() {
__queue_delayed_work() {
add_timer() {
__mod_timer() {
base = lock_timer_base(timer, &flags); // acquires base->lock
debug_timer_activate(timer); // possible base->lock => pgdat->kswapd_wait => p->pi_lock dependency
raw_spin_unlock_irqrestore(&base->lock, flags);
}
}
}
}
}
}
}
}
task_rq_unlock(rq, p, &rf);
}

exists.

All GFP_ATOMIC allocation users are supposed to be aware of what locks
are held, and are supposed to explicitly remove __GFP_KSWAPD_RECLAIM
if waking up kswapd can cause deadlock. But reality is that we can't be
careful enough to error-free. Who would imagine GFP_ATOMIC allocation
while base->lock is held can form circular locking dependency?

>
>> Also, __GFP_NORETRY is pointless for !__GFP_DIRECT_RECLAIM allocation

__GFP_NORETRY is not checked by !__GFP_DIRECT_RECLAIM allocation.
GFP_ATOMIC - __GFP_KSWAPD_RECLAIM is __GFP_HIGH.

>>
>> @@ -126,7 +126,7 @@ static const char *obj_states[ODEBUG_STATE_MAX] = {
>>
>> static void fill_pool(void)
>> {
>> - gfp_t gfp = GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN;
>> + gfp_t gfp = __GFP_HIGH | __GFP_NOWARN;
>
> Does this weaken fill_pool()'s allocation attempt more than necessary?
> We can still pass __GFP_HIGH?

What do you mean? I think that killing base->lock => pgdat->kswapd_wait
by removing __GFP_KSWAPD_RECLAIM is the right fix. This weakening is
needed for avoiding base->lock => pgdat->kswapd_wait dependency from
debugobject code.

For locking dependency safety, I wish that GFP_ATOMIC / GFP_NOWAIT do not imply
__GFP_KSWAPD_RECLAIM. Such allocations should not try to allocate as many pages
as even __GFP_HIGH fails. And if such allocations try to allocate as many pages
as even __GFP_HIGH fails, they likely already failed before background kswapd
reclaim finds some reusable pages....