Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO

From: Peter Hurley
Date: Mon Apr 25 2016 - 21:22:15 EST


On 04/25/2016 08:48 AM, Tejun Heo wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>> CPU#6 CPU#2
>> reqeust ffff884000343600 inserted
>> hctx marked as pended
>> kblockd_schedule...() returns 1
>> <schedule to kblockd workqueue>
>> *** WORK_STRUCT_PENDING_BIT is cleared ***
>> flush_busy_ctxs() is executed
>> reqeust ffff884000343cc0 inserted
>> hctx marked as pended
>> kblockd_schedule...() returns 0
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>> static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU. But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging. I wonder how old this bug is.

This is the same bug I wrote about 2 yrs ago (but with the wrong fix).

http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html

Unfortunately I didn't have a reproducer at all :/


> cc'ing David
> Howells who ISTR to have reported a similar issue. The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
> A B
>
> queue (test_and_sets PENDING)
> dispatch (clears PENDING)
> execute queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should. Can you please test whether the following
> patch fixes the issue?
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
> */
> smp_wmb();
> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> + smp_mb();
> }

The atomic_long_xchg() patch has several benefits over the naked barrier:

1. set_work_pool_and_clear_pending() has the same requirements as
clear_work_data(); note that both require write barrier before and
full barrier after.

2. xchg() et al imply full barrier before and full barrier after.

3. The naked barriers could be removed, while improving efficiency.
On x86, mov + mfence => xchg

4. Maybe fixes other hidden bugs.
For example, I'm wondering if reordering with set_work_pwq/list_add_tail
would be a problem; ie., what if work is visible on the worklist _before_
data is initialized by set_work_pwq()?

Regards,
Peter Hurley