RE: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock when scheduling workqueue elements")

From: Dexuan Cui
Date: Tue Feb 14 2017 - 10:54:43 EST


> From: hch@xxxxxx [mailto:hch@xxxxxx]
> Sent: Tuesday, February 14, 2017 22:51
> To: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Cc: hch@xxxxxx; Jens Axboe <axboe@xxxxxxxxx>; Bart Van Assche
> <Bart.VanAssche@xxxxxxxxxxx>; hare@xxxxxxxx; hare@xxxxxxx; Martin K.
> Petersen <martin.petersen@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> linux-block@xxxxxxxxxxxxxxx; jth@xxxxxxxxxx; Nick Meier
> <Nick.Meier@xxxxxxxxxxxxx>; Alex Ng (LIS) <alexng@xxxxxxxxxxxxx>; Long Li
> <longli@xxxxxxxxxxxxx>; Adrian Suhov (Cloudbase Solutions SRL) <v-
> adsuho@xxxxxxxxxxxxx>; Chris Valean (Cloudbase Solutions SRL) <v-
> chvale@xxxxxxxxxxxxx>
> Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event lock
> when scheduling workqueue elements")
>
> On Tue, Feb 14, 2017 at 02:46:41PM +0000, Dexuan Cui wrote:
> > > From: hch@xxxxxx [mailto:hch@xxxxxx]
> > > Sent: Tuesday, February 14, 2017 22:29
> > > To: Dexuan Cui <decui@xxxxxxxxxxxxx>
> > > Subject: Re: Boot regression (was "Re: [PATCH] genhd: Do not hold event
> lock
> > > when scheduling workqueue elements")
> > >
> > > Ok, thanks for testing. Can you try the patch below? It fixes a
> > > clear problem which was partially papered over before the commit
> > > you bisected to, although it can't explain why blk-mq still works.
> >
> > Still bad luck. :-(
> >
> > BTW, I'm using the first "bad" commit (scsi: allocate scsi_cmnd structures
> as
> > part of struct request) + the 2 patches you provided today.
> >
> > I suppose I don't need to test the 2 patches on the latest linux-next repo.
>
> I'd love a test on that repo actually. We had a few other for sense
> handling since then I think.

I tested today's linux-next (next-20170214) + the 2 patches just now and got
a weird result:
sometimes the VM stills hung with a new calltrace (BUG: spinlock bad
magic) , but sometimes the VM did boot up despite the new calltrace!

Attached is the log of a "good" boot.

It looks we have a memory corruption issue somewhere...

Actually previously I saw the "BUG: spinlock bad magic" message once, but I
couldn't repro it later, so I didn't mention it to you.

The good news is that now I can repro the "spinlock bad magic" message
every time.
I tried to dig into this by enabling Kernel hacking -> Memory debugging,
but didn't find anything abnormal.
Is it possible that the SCSI layer passes a wrong memory address?

Thanks,
-- Dexuan

Attachment: dmesg.log
Description: dmesg.log