Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization

From: Konrad Dybcio
Date: Sat Feb 10 2024 - 09:04:31 EST




On 2/7/24 11:59, Rafael J. Wysocki wrote:
On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
<m.szyprowski@xxxxxxxxxxx> wrote:

Dear All,

On 09.01.2024 17:59, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
system-wide PM code"), the resume of devices that were allowed to resume
asynchronously was scheduled before starting the resume of the other
devices, so the former did not have to wait for the latter unless
functional dependencies were present.

Commit 7839d0078e0d removed that optimization in order to address a
correctness issue, but it can be restored with the help of a new device
power management flag, so do that now.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
---

This patch finally landed in linux-next some time ago as 3e999770ac1c
("PM: sleep: Restore asynchronous device resume optimization"). Recently
I found that it causes a non-trivial interaction with commit
5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
system suspend/resume fails (board doesn't wake up) on my old Samsung
Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
stable for last years.

My further investigations confirmed that the mentioned commits are
responsible for this issue. Each of them separately (3e999770ac1c and
5797b1c18919) doesn't trigger any problems. Reverting any of them on top
of linux-next (with some additional commit due to code dependencies)
also fixes/hides the problem.

Let me know if You need more information or tests on the hardware. I'm
open to help debugging this issue.

So I found this report:

https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@xxxxxxxxxx/

which appears to be about the same issue.

Strictly speaking, the regression is introduced by 5797b1c18919,
because it is not a mainline commit yet, but the information regarding
the interaction of it with commit 3e999770ac1c is valuable.

Essentially, what commit 3e999770ac1c does is to schedule the
execution of all of the async resume callbacks in a given phase
upfront, so they can run without waiting for the sync ones to complete
(except when there is a parent-child or supplier-consumer dependency -
the latter represented by a device link).

What seems to be happening after commit 5797b1c18919 is that one (or
more) of the async callbacks get blocked in the workqueue for some
reason.

I would try to replace system_unbound_wq in
__async_schedule_node_domain() with a dedicated workqueue that is not
unbound and see what happens.

Thanks Rafael for helping connect the dots!

I did the laziest things imaginable to switch to a WQ that's
not unbound:

diff --git a/kernel/async.c b/kernel/async.c
index 97f224a5257b..37f1204ab4e9 100644
--- a/kernel/async.c
+++ b/kernel/async.c
@@ -174,7 +174,7 @@ static async_cookie_t __async_schedule_node_domain(async_func_t func,
spin_unlock_irqrestore(&async_lock, flags);
/* schedule for execution */
- queue_work_node(node, system_unbound_wq, &entry->work);
+ queue_work_node(node, system_wq, &entry->work);
.and the system can now suspend/resume all day long again!

Konrad