Re: corruption causing crash in __queue_work

From: Nikolay Borisov
Date: Thu Dec 10 2015 - 04:28:15 EST




On 12/09/2015 06:27 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Dec 09, 2015 at 06:23:15PM +0200, Nikolay Borisov wrote:
>> I think we are seeing this at least daily on at least 1 server (we have
>> multiple servers like that). So adding printk's would likely be the way
>> to go, anything in particular you might be interested in knowing? I see
>> RCU stuff around so might be tricky race condition.
>
> Printing out the workqueue's pointer, name, pwq's pointer, the node
> being installed for and the installed pointer should give us enough
> clues. There's RCU involved but the pointers shouldn't be becoming
> NULLs unless we're installing NULL ptrs.

So the debug patch has been rolled on 1 server and several more
are in the process, here it is what it prints:

WQ: ffff88046f00ba00 (events_unbound) old_pwq: (null) new_pwq: ffff88046f00d300 node: 0
WQ: ffff88046f00be00 (events_power_efficient) old_pwq: (null) new_pwq: ffff88046f00d400 node: 0
WQ: ffff88046d71c000 (events_freezable_power_) old_pwq: (null) new_pwq: ffff88046f00d500 node: 0
WQ: ffff88046ce9ca00 (khelper) old_pwq: (null) new_pwq: ffff88046f00d600 node: 0
WQ: ffff88046ce9c000 (netns) old_pwq: (null) new_pwq: ffff88046f00d700 node: 0
WQ: ffff88046ce9d400 (perf) old_pwq: (null) new_pwq: ffff88046f00d800 node: 0
WQ: ffff88046c408000 (writeback) old_pwq: (null) new_pwq: ffff88046c800000 node: 0
WQ: ffff88046c409200 (kacpi_hotplug) old_pwq: (null) new_pwq: ffff88046c42e200 node: 0
WQ: ffff880468455600 (scsi_tmf_0) old_pwq: (null) new_pwq: ffff88046c801f00 node: 0
WQ: ffff8804687f4400 (scsi_tmf_1) old_pwq: (null) new_pwq: ffff88046caa6700 node: 0
WQ: ffff8804687f4c00 (scsi_tmf_2) old_pwq: (null) new_pwq: ffff88046caa6900 node: 0
WQ: ffff8804687f5400 (scsi_tmf_3) old_pwq: (null) new_pwq: ffff88046caa6b00 node: 0
WQ: ffff8804687f5c00 (scsi_tmf_4) old_pwq: (null) new_pwq: ffff88046caa6d00 node: 0
WQ: ffff8804687f6400 (scsi_tmf_5) old_pwq: (null) new_pwq: ffff88046caa7000 node: 0
WQ: ffff8804687f6c00 (scsi_tmf_6) old_pwq: (null) new_pwq: ffff88046caa7300 node: 0
WQ: ffff880467964000 (kdmremove) old_pwq: (null) new_pwq: ffff880467a3c800 node: 0
WQ: ffff880467965000 (deferwq) old_pwq: (null) new_pwq: ffff880467a3c100 node: 0
WQ: ffff8804669bc600 (ib_addr) old_pwq: (null) new_pwq: ffff88046845a600 node: 0
WQ: ffff88007d167e00 (qib0_0) old_pwq: (null) new_pwq: ffff880466c19800 node: 0
WQ: ffff88007d165a00 (qib0_1) old_pwq: (null) new_pwq: ffff880466c18e00 node: 0
WQ: ffff88007d165200 (ib_mad1) old_pwq: (null) new_pwq: ffff880466c19d00 node: 0
WQ: ffff8804665d2000 (ib_mad2) old_pwq: (null) new_pwq: ffff880466c18a00 node: 0
WQ: ffff8804667d7600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880469806100 node: 0
WQ: ffff880079a9fc00 (edac-poller) old_pwq: (null) new_pwq: ffff88007d5ebf00 node: 0
WQ: ffff88046b47cc00 (kvm-irqfd-cleanup) old_pwq: (null) new_pwq: ffff8804651f0f00 node: 0
WQ: ffff8804694baa00 (kloopd0) old_pwq: (null) new_pwq: ffff88046949d100 node: 0
WQ: ffff880079a9cc00 (kloopd1) old_pwq: (null) new_pwq: ffff8804698cb900 node: 0
WQ: ffff88046809dc00 (kloopd2) old_pwq: (null) new_pwq: ffff88046957aa00 node: 0
WQ: ffff88046809c000 (kloopd3) old_pwq: (null) new_pwq: ffff8804650acc00 node: 0
WQ: ffff880466f3b000 (kloopd4) old_pwq: (null) new_pwq: ffff880469575900 node: 0
WQ: ffff88046809e800 (kloopd5) old_pwq: (null) new_pwq: ffff880469888200 node: 0
WQ: ffff88046809de00 (kloopd6) old_pwq: (null) new_pwq: ffff880469827400 node: 0
WQ: ffff88007d5f1c00 (dm_bufio_cache) old_pwq: (null) new_pwq: ffff8804673dda00 node: 0
WQ: ffff88046c42a400 (dm-thin) old_pwq: (null) new_pwq: ffff880079955100 node: 0
WQ: ffff8804672d0800 (dm-thin) old_pwq: (null) new_pwq: ffff88046baed800 node: 0
WQ: ffff88046993fa00 (dm-thin) old_pwq: (null) new_pwq: ffff8804650ff100 node: 0
WQ: ffff88046993d400 (dm-thin) old_pwq: (null) new_pwq: ffff88046949d600 node: 0
WQ: ffff88046993e400 (dm-thin) old_pwq: (null) new_pwq: ffff88046b833000 node: 0
WQ: ffff880466466400 (dm-thin) old_pwq: (null) new_pwq: ffff88007da60d00 node: 0
WQ: ffff88046b3eb200 (dm-thin) old_pwq: (null) new_pwq: ffff88046633d200 node: 0
WQ: ffff8804672d0600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880079955400 node: 0
WQ: ffff88046b3eb600 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880465684900 node: 0
WQ: ffff88046c42a400 (dm-thin) old_pwq: (null) new_pwq: ffff8800799ee900 node: 0
WQ: ffff880466f39a00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880469849e00 node: 0
WQ: ffff880467b0cc00 (dm-thin) old_pwq: (null) new_pwq: ffff88007d52fa00 node: 0
WQ: ffff8804672d4e00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff88046ca07f00 node: 0
WQ: ffff880079a9ca00 (dm-thin) old_pwq: (null) new_pwq: ffff8802d1be9e00 node: 0
WQ: ffff880466175000 (dm-thin) old_pwq: (null) new_pwq: ffff8802d8efec00 node: 0
WQ: ffff880403f28400 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff8802e224dd00 node: 0
WQ: ffff880403f29a00 (dm-thin) old_pwq: (null) new_pwq: ffff880465685300 node: 0
WQ: ffff8804672d6c00 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff880466d69300 node: 0
WQ: ffff880466f3ba00 (dm-thin) old_pwq: (null) new_pwq: ffff880469576500 node: 0
WQ: ffff8804672d4600 (dm-thin) old_pwq: (null) new_pwq: ffff8802d1a1ee00 node: 0
WQ: ffff8803ccf5c200 (ext4-rsv-conversion) old_pwq: (null) new_pwq: ffff8804657b3200 node: 0

Is this format ok? Also I observed the exact same crash
on a machine running 4.1.12 kernel as well.

>
> Thanks.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/