Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

From: Alexey Kardashevskiy
Date: Thu Sep 01 2016 - 08:16:29 EST


On 01/09/16 11:48, Alexey Kardashevskiy wrote:
> On 31/08/16 17:28, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>>>>
>>>>>
>>>>> The origin of the issue I've seen seems to be related to
>>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>>>>> following state
>>>>
>>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>>> rwsems.
>>>>
>>>> This is true for pretty much every blocking wait loop out there, they
>>>> all do:
>>>>
>>>> for (;;) {
>>>> current->state = UNINTERRUPTIBLE;
>>>> smp_mb();
>>>> if (cond)
>>>> break;
>>>> schedule();
>>>> }
>>>> current->state = RUNNING;
>>>>
>>>> Which, if the wakeup is spurious, is just the pattern you need.
>>>
>>> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
>>> in my commit log. Should I resend the patch?
>>
>> Yes please.
>>
>>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>>
>>>
>>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a
>>> different thread.
>>
>> Right, although even without that, there is sufficient ordering, as the
>> rq unlock from the wakeup, coupled with the rq lock from the schedule
>> already form a load-store barrier.
>>
>>>> Now, this has been present for a fair while, I suspect ever since we
>>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>>> hit it.
>>>>
>>>
>>> Yes, I just hit it a a week or two back and I needed to collect data to
>>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>>> I needed a system with large threads and less memory running stress-ng.
>>> Reproducing the problem takes an unpredictable amount of time.
>>
>> What hardware do you see this on, is it shiny new Power8 chips which
>> have never before seen deep queues or something. Or is it 'regular' old
>> Power7 like stuff?
>
> I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
> devices with vhost enabled, all virtio-net devices are connected to the
> same virtual bridge on the host (via /dev/tap*) and are doing lots of
> trafic, just between these 2 guests.
>
> I remember doing the same test on POWER7 more than 2 years ago and finding
> missing barriers in virtio but nothing like this one. But POWER7 is
> seriously slower than POWER8 so it seems that nobody bothered with loading
> it that much.
>
> I wonder how to reproduce the bug quicker as sometime it works days with no
> fault but sometime it fails within first 30 minutes (backtraces from 2
> stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
> type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
> return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
> but this did not make any difference.


Below are the backtraces of 2 stuck cpus (they could not be stopped in xmon
as others, got these via bmc). I am also adding Michael in cc:, just in case...

0:mon> t 0xc000000fd4ab79b0
[c000000fd4ab79b0] c0000000001330c4 do_raw_spin_lock+0x1f4/0x260
[c000000fd4ab79f0] c000000000a368e8 _raw_spin_lock_irqsave+0x98/0xd0
[c000000fd4ab7a30] c00000000011ecbc remove_wait_queue+0x2c/0x70
[c000000fd4ab7a70] d0000000166e0364 vhost_poll_stop+0x34/0x60 [vhost]
[c000000fd4ab7aa0] d0000000167a08c8 handle_rx+0x168/0x8d0 [vhost_net]
[c000000fd4ab7c80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c000000fd4ab7d00] c0000000000ebad8 kthread+0x118/0x140
[c000000fd4ab7e30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74

0:mon> t 0xc000001fff41b5a0
[c000001fff41b5a0] c0000000000fc3ac try_to_wake_up+0x5c/0x6c0
[c000001fff41b630] d0000000166e0fcc vhost_work_queue+0x6c/0x90 [vhost]
[c000001fff41b660] d0000000166e205c vhost_poll_wakeup+0x3c/0x50 [vhost]
[c000001fff41b680] c00000000011e6d4 __wake_up_common+0x84/0xf0
[c000001fff41b6e0] c00000000011ee98 __wake_up_sync_key+0x68/0xa0
[c000001fff41b730] c0000000008defd0 sock_def_readable+0x80/0x180
[c000001fff41b760] d0000000167438ac tun_net_xmit+0x52c/0x5d0 [tun]
[c000001fff41b7b0] c000000000906dd8 dev_hard_start_xmit+0x178/0x450
[c000001fff41b870] c000000000938d6c sch_direct_xmit+0x11c/0x250
[c000001fff41b910] c000000000907840 __dev_queue_xmit+0x560/0x8c0
[c000001fff41b9c0] d000000016553c80 br_dev_queue_push_xmit+0xb0/0x250 [bridge]
[c000001fff41ba00] d000000016553e54 br_forward_finish+0x34/0xe0 [bridge]
[c000001fff41ba80] d000000016554044 __br_forward.isra.0+0x144/0x220 [bridge]
[c000001fff41baf0] d000000016555e78 br_handle_frame_finish+0x148/0x570 [bridge]
[c000001fff41bb80] d00000001655654c br_handle_frame+0x24c/0x400 [bridge]
[c000001fff41bc10] c0000000008fed28 __netif_receive_skb_core+0x498/0xc60
[c000001fff41bcf0] c000000000904314 process_backlog+0x114/0x220
[c000001fff41bd60] c00000000090343c net_rx_action+0x36c/0x520
[c000001fff41be70] c0000000000c5068 __do_softirq+0x258/0x5b0
[c000001fff41bf90] c000000000028bd0 call_do_softirq+0x14/0x24
[c000000fc675b950] c0000000000155f8 do_softirq_own_stack+0x58/0xa0
[c000000fc675b990] c0000000000c4cb4 do_softirq.part.3+0x74/0xa0
[c000000fc675b9c0] c0000000008fe674 netif_rx_ni+0x1d4/0x230
[c000000fc675ba00] d0000000167415f8 tun_get_user+0x408/0xbc0 [tun]
[c000000fc675baf0] d000000016741e04 tun_sendmsg+0x54/0x90 [tun]
[c000000fc675bb30] d0000000167a1f68 handle_tx+0x298/0x5f0 [vhost_net]
[c000000fc675bc80] d0000000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c000000fc675bd00] c0000000000ebad8 kthread+0x118/0x140
[c000000fc675be30] c0000000000098e8 ret_from_kernel_thread+0x5c/0x74
0:mon>




--
Alexey