Re: [PATCH v4 07/13] firmware: arm_scmi: Add notification dispatch and delivery

From: Lukasz Luba
Date: Wed May 20 2020 - 06:23:26 EST

Next message: Christophe Leroy: "[PATCH] Revert "powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.""
Previous message: Viresh Kumar: "[PATCH] asm-generic/sembuf: Update architecture related information in comment"
In reply to: Cristian Marussi: "Re: [PATCH v4 07/13] firmware: arm_scmi: Add notification dispatch and delivery"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Cristian,

On 5/20/20 8:09 AM, Cristian Marussi wrote:

On Mon, Mar 16, 2020 at 02:46:05PM +0000, Cristian Marussi wrote:

On Thu, Mar 12, 2020 at 09:43:31PM +0000, Lukasz Luba wrote:

Hi Lukasz,

I went back looking deeper into the possible race issue you pointed out a
while ago understanding it a bit better down below.

On 3/12/20 6:34 PM, Cristian Marussi wrote:

On 12/03/2020 13:51, Lukasz Luba wrote:

Hi Cristian,

Hi Lukasz

just one comment below...

[snip]

+ eh.timestamp = ts;
+ eh.evt_id = evt_id;
+ eh.payld_sz = len;
+ kfifo_in(&r_evt->proto->equeue.kfifo, &eh, sizeof(eh));
+ kfifo_in(&r_evt->proto->equeue.kfifo, buf, len);
+ queue_work(r_evt->proto->equeue.wq,
+ &r_evt->proto->equeue.notify_work);

Is it safe to ignore the return value from the queue_work here?

In fact yes, we do not want to care: it returns true or false depending on the
fact that the specific work was or not already queued, and we just rely on
this behavior to keep kicking the worker only when needed but never kick
more than one instance of it per-queue (so that there's only one reader
wq and one writer here in the scmi_notify)...explaining better:

1. we push an event (hdr+payld) to the protocol queue if we found that there was
enough space on the queue

2a. if at the time of the kfifo_in( ) the worker was already running
(queue not empty) it will process our new event sooner or later and here
the queue_work will return false, but we do not care in fact ... we
tried to kick it just in case

2b. if instead at the time of the kfifo_in() the queue was empty the worker would
have probably already gone to the sleep and this queue_work() will return true and
so this time it will effectively wake up the worker to process our items

The important thing here is that we are sure to wakeup the worker when needed
but we are equally sure we are never causing the scheduling of more than one worker
thread consuming from the same queue (because that would break the one reader/one writer
assumption which let us use the fifo in a lockless manner): this is possible because
queue_work checks if the required work item is already pending and in such a case backs
out returning false and we have one work_item (notify_work) defined per-protocol and
so per-queue.

I see. That's a good assumption: one work_item per protocol and simplify
the locking. What if there would be an edge case scenario when the
consumer (work_item) has handled the last item (there was NULL from
scmi_process_event_header()), while in meantime scmi_notify put into
the fifo new event but couldn't kick the queue_work. Would it stay there
till the next IRQ which triggers queue_work to consume two events (one
potentially a bit old)? Or we can ignore such race situation assuming
that cleaning of work item is instant and kfifo_in is slow?

In fact, this is a very good point, since between the moment the worker
determines that the queue is empty and the moment in which the worker
effectively exits (and it's marked as no more pending by the Kernel cmwq)
there is a window of opportunity for a race in which the ISR could fill
the queue with one more event and then fail to kick with queue_work() since
the work is in fact still nominally marked as pending from the point of view
of Kernel cmwq, as below:

ISR (core N) | WQ (core N+1) cmwq flags queued events
------------------------------------------------------------------------------------------------
| if (queue_is_empty) - WORK_PENDING 0 events queued
+ ... - WORK_PENDING 0 events queued
+ } while (scmi_process_event_payload);
+}// worker function exit
kfifo_in() + ...cmwq backing out - WORK_PENDING 1 events queued
kfifo_in() + ...cmwq backing out - WORK_PENDING 1 events queued
queue_work() + ...cmwq backing out - WORK_PENDING 1 events queued
-> FALSE (pending) + ...cmwq backing out - WORK_PENDING 1 events queued
+ ...cmwq backing out - WORK_PENDING 1 events queued
+ ...cmwq backing out - WORK_PENDING 1 events queued
| ---- WORKER THREAD EXIT - !WORK_PENDING 1 events queued
| - !WORK_PENDING 1 events queued
kfifo_in() | - !WORK_PENDING 2 events queued
kfifo_in() | - !WORK_PENDING 2 events queued
queue_work() | - !WORK_PENDING 2 events queued
-> TRUE | --- WORKER ENTER - WORK_PENDING 2 events queued
| - WORK_PENDING 2 events consumed

where effectively the last event queued won't be consumed till the next
iteration once another event is queued.

In summary, looking better at Kernel cmwq code, my explanation above about
how the possible race could be exposed by a particular tricky limit condition
and the values assumed by the WORK_STRUCT_PENDING_BIT was ... bullshit :D

In fact there's no race at all because Kernel cmwq takes care to clear the above
PENDING flag BEFORE the user-provided worker-function starts to finally run:
such flag is active only when a work instance is queued pending for execution
but it is cleared just before execution effctively starts.

kernel/workqueue.c:process_one_work()

set_work_pool_and_clear_pending(work, pool->id);
....
worker->current_func(work);

As a consequence in the racy scenario above where the ISR pushes events on the
queues after the worker has already determined the queue to be empty but while
the worker func is still being deactivated in terms of Kernel cmwq internal
handling, it is not a problem since the worker while running is already NO more
marked pending so the queue_work succeeds and a new work will simply be queued
and run once the current instance terminates fully and it is removed from pool.

Sounds good, thanks to for digging into this workqueue code and figuring it out.

On the other side in the normal non racy scenario, when the worker is processing
normally a non-empty queue, we'll end-up anyway queueing new items and a new work
from the ISR even if the currently executing one will in fact consume already
naturally the queued items: this will result (it's what I observe in fact) in a
final un-needed quick worker activation/deactivation processing zero items (empty
queue) which is in fact harmless.

Basically the racy condition is taken care by the Kernel cmwq itself, and in fact
there is an extensive explanation also of the barriers employed to properly
realize this in the comments around set_work_pool_and_clear_pending()

I'll add a comment in v8 just to note this behaviour.

Great research.

Regards,
Lukasz

Thanks

Cristian

Next message: Christophe Leroy: "[PATCH] Revert "powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.""
Previous message: Viresh Kumar: "[PATCH] asm-generic/sembuf: Update architecture related information in comment"
In reply to: Cristian Marussi: "Re: [PATCH v4 07/13] firmware: arm_scmi: Add notification dispatch and delivery"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]