Re: [PATCH] soc: qcom: qmi: Signal the txn completion after releasing the mutex

From: Chris Lew
Date: Mon Jul 31 2023 - 20:36:29 EST




On 7/31/2023 8:19 AM, Pavan Kondeti wrote:
On Mon, Jul 31, 2023 at 06:37:55PM +0530, Praveenkumar I wrote:
txn is in #1 stack

Worker #1 Worker #2
******** *********

qmi_txn_wait(txn) qmi_handle_message
| |
| |
wait_for_complete(txn->complete) ....
| mutex_lock(txn->lock)
| |
mutex_lock(txn->lock) |
..... complete(txn->lock)
| mutex_unlock(txn->lock)
|
mutex_unlock(txn->lock)

In this case above, while #2 is doing the mutex_unlock(txn->lock),
in between releasing lock and doing other lock related wakeup, #2 gets
scheduled out. As a result #1, acquires the lock, unlocks, also
frees the txn also (where the lock resides)

Now #2, gets scheduled again and tries to do the rest of the lock
related wakeup, but lock itself is invalid because txn itself is gone.

Fixing this, by doing the mutex_unlock(txn->lock) first and then
complete(txn->lock) in #2

Fixes: 3830d0771ef6 ("soc: qcom: Introduce QMI helpers")
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Sricharan Ramabadhran <quic_srichara@xxxxxxxxxxx>
Signed-off-by: Praveenkumar I <quic_ipkumar@xxxxxxxxxxx>
---
drivers/soc/qcom/qmi_interface.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/qcom/qmi_interface.c b/drivers/soc/qcom/qmi_interface.c
index 78d7361fdcf2..92e29db97359 100644
--- a/drivers/soc/qcom/qmi_interface.c
+++ b/drivers/soc/qcom/qmi_interface.c
@@ -505,12 +505,13 @@ static void qmi_handle_message(struct qmi_handle *qmi,
pr_err("failed to decode incoming message\n");
txn->result = ret;
- complete(&txn->completion);
} else {
qmi_invoke_handler(qmi, sq, txn, buf, len);
}
mutex_unlock(&txn->lock);
+ if (txn->dest && txn->ei)
+ complete(&txn->completion);
} else {
/* Create a txn based on the txn_id of the incoming message */
memset(&tmp_txn, 0, sizeof(tmp_txn));

What happens in a remote scenario where the waiter gets timed out at the
very same time you are releasing the mutex but before calling
complete()? The caller might end up freeing txn structure and it results
in the same issue you are currently facing.

Thanks,
Pavan

I think downstream we had various attempts of moving the signal around trying to avoid this, but hit scenarios like the one Pavan described.

We eventually settled on removing the txn->lock and treating the qmi->txn_lock as a big lock. This remedied the issue where the txn->lock goes out of scope since qmi->txn_lock is tied to the qmi handle.