Re: [PATCH v4 07/13] firmware: arm_scmi: Add notification dispatch and delivery

From: Lukasz Luba
Date: Wed Mar 18 2020 - 04:26:15 EST


Hi Cristian,

On 3/16/20 2:46 PM, Cristian Marussi wrote:
On Thu, Mar 12, 2020 at 09:43:31PM +0000, Lukasz Luba wrote:


On 3/12/20 6:34 PM, Cristian Marussi wrote:
On 12/03/2020 13:51, Lukasz Luba wrote:
Hi Cristian,

Hi Lukasz

just one comment below...
[snip]
+ eh.timestamp = ts;
+ eh.evt_id = evt_id;
+ eh.payld_sz = len;
+ kfifo_in(&r_evt->proto->equeue.kfifo, &eh, sizeof(eh));
+ kfifo_in(&r_evt->proto->equeue.kfifo, buf, len);
+ queue_work(r_evt->proto->equeue.wq,
+ &r_evt->proto->equeue.notify_work);

Is it safe to ignore the return value from the queue_work here?


In fact yes, we do not want to care: it returns true or false depending on the
fact that the specific work was or not already queued, and we just rely on
this behavior to keep kicking the worker only when needed but never kick
more than one instance of it per-queue (so that there's only one reader
wq and one writer here in the scmi_notify)...explaining better:

1. we push an event (hdr+payld) to the protocol queue if we found that there was
enough space on the queue

2a. if at the time of the kfifo_in( ) the worker was already running
(queue not empty) it will process our new event sooner or later and here
the queue_work will return false, but we do not care in fact ... we
tried to kick it just in case

2b. if instead at the time of the kfifo_in() the queue was empty the worker would
have probably already gone to the sleep and this queue_work() will return true and
so this time it will effectively wake up the worker to process our items

The important thing here is that we are sure to wakeup the worker when needed
but we are equally sure we are never causing the scheduling of more than one worker
thread consuming from the same queue (because that would break the one reader/one writer
assumption which let us use the fifo in a lockless manner): this is possible because
queue_work checks if the required work item is already pending and in such a case backs
out returning false and we have one work_item (notify_work) defined per-protocol and
so per-queue.

I see. That's a good assumption: one work_item per protocol and simplify
the locking. What if there would be an edge case scenario when the
consumer (work_item) has handled the last item (there was NULL from
scmi_process_event_header()), while in meantime scmi_notify put into
the fifo new event but couldn't kick the queue_work. Would it stay there
till the next IRQ which triggers queue_work to consume two events (one
potentially a bit old)? Or we can ignore such race situation assuming
that cleaning of work item is instant and kfifo_in is slow?


In fact, this is a very good point, since between the moment the worker
determines that the queue is empty and the moment in which the worker
effectively exits (and it's marked as no more pending by the Kernel cmwq)
there is a window of opportunity for a race in which the ISR could fill
the queue with one more event and then fail to kick with queue_work() since
the work is in fact still nominally marked as pending from the point of view
of Kernel cmwq, as below:

ISR (core N) | WQ (core N+1) cmwq flags queued events
------------------------------------------------------------------------------------------------
| if (queue_is_empty) - WORK_PENDING 0 events queued
+ ... - WORK_PENDING 0 events queued
+ } while (scmi_process_event_payload);
+}// worker function exit
kfifo_in() + ...cmwq backing out - WORK_PENDING 1 events queued
kfifo_in() + ...cmwq backing out - WORK_PENDING 1 events queued
queue_work() + ...cmwq backing out - WORK_PENDING 1 events queued
-> FALSE (pending) + ...cmwq backing out - WORK_PENDING 1 events queued
+ ...cmwq backing out - WORK_PENDING 1 events queued
+ ...cmwq backing out - WORK_PENDING 1 events queued
| ---- WORKER THREAD EXIT - !WORK_PENDING 1 events queued
| - !WORK_PENDING 1 events queued
kfifo_in() | - !WORK_PENDING 2 events queued
kfifo_in() | - !WORK_PENDING 2 events queued
queue_work() | - !WORK_PENDING 2 events queued
-> TRUE | --- WORKER ENTER - WORK_PENDING 2 events queued
| - WORK_PENDING 2 events consumed

where effectively the last event queued won't be consumed till the next
iteration once another event is queued.

Given the fact that the ISR and the dedicated WQ on an SMP run effectively
in parallel I do not think unfortunately that we can simply count on the fact
the worker exit is faster than the kifos_in, enough to close the race window
opportunity. (even if rare)

On the other side considering the impact of such scenario, I can imagine that
it's not simply that we could only have a delayed delivery, but we must consider
that if the delayed event is effectively the last one ever it would remain
undelivered forever; this is particularly worrying in a scenario in which such
last event is particularly important: imagine a system shutdown where a last
system-power-off remains undelivered.

Agree, another example could be a thermal notification for some critical
trip point.


As a consequence I think this rare racy condition should be addressed somehow.

Looking at this scenario, it seems the classic situation in which you want to
use some sort of completion to avoid missing out on events delivery, BUT in our
usecase:

- placing the workers loaned from cmwq into an unbounded wait_for_completion()
once the queue is empty seems not the best to use resources (and probably
frowned upon)....using a few dedicated kernel threads to simply let them idle
waiting most of the time seems equally frowned upon (I could be wrong...))
- the needed complete() in the ISR would introduce a spinlock_irqsave into the
interrupt path (there's already one inside queue_work in fact) so it is not
desirable, at least not if used on a regular base (for each event notified)

So I was thinking to try to reduce sensibly the above race window, more
than eliminate it completely, by adding an early flag to be checked under
specific conditions in order to retry the queue_work a few times when the race
is hit, something like:

ISR (core N) | WQ (core N+1)
-------------------------------------------------------------------------------
| atomic_set(&exiting, 0);
|
| do {
| ...
| if (queue_is_empty) - WORK_PENDING 0 events queued
+ atomic_set(&exiting, 1) - WORK_PENDING 0 events queued
static int cnt=3 | --> breakout of while - WORK_PENDING 0 events queued
kfifo_in() | ....
| } while (scmi_process_event_payload);
kfifo_in() |
exiting = atomic_read() | ...cmwq backing out - WORK_PENDING 1 events queued
do { | ...cmwq backing out - WORK_PENDING 1 events queued
ret = queue_work() | ...cmwq backing out - WORK_PENDING 1 events queued
if (ret || !exiting)| ...cmwq backing out - WORK_PENDING 1 events queued
break; | ...cmwq backing out - WORK_PENDING 1 events queued
mdelay(5); | ...cmwq backing out - WORK_PENDING 1 events queued
exiting = | ...cmwq backing out - WORK_PENDING 1 events queued
atomic_read; | ...cmwq backing out - WORK_PENDING 1 events queued
} while (--cnt); | ...cmwq backing out - WORK_PENDING 1 events queued
| ---- WORKER EXIT - !WORK_PENDING 0 events queued

like down below between the scissors.

Not tested or tried....I could be missing something...and the mdelay is horrible (and not
the cleanest thing you've ever seen probably :D)...I'll have a chat with Sudeep too.

Indeed it looks more complicated. If you like I can join your offline
discuss when Sudeep is back.

Regards,
Lukasz