Re: Regression of v4.6-rc vs. v4.5 bisected: a98ee79317b4 "drm/i915/fbc: enable FBC by default on HSW and BDW"

From: Zanoni, Paulo R
Date: Thu May 05 2016 - 14:50:23 EST


Em Qui, 2016-05-05 Ãs 19:45 +0200, Stefan Richter escreveu:
> On Apr 30 Stefan Richter wrote:
> >
> > On Apr 29 Stefan Richter wrote:
> > >
> > > On Apr 26 Stefan Richter wrote:ÂÂ
> > > >
> > > > v4.6-rc solidly hangs after a short while after boot, login to
> > > > X11, and
> > > > doing nothing much remarkable on the just brought up X desktop.
> > > >
> > > > Hardware: x86-64, E3-1245 v3 (Haswell),
> > > > ÂÂÂÂÂÂÂÂÂÂmainboard Supermicro X10SAE,
> > > > ÂÂÂÂÂÂÂÂÂÂusing integrated Intel graphics (HD P4600, i915
> > > > driver),
> > > > ÂÂÂÂÂÂÂÂÂÂC226 PCH's AHCI and USB 2/3, ASMedia ASM1062 AHCI,
> > > > ÂÂÂÂÂÂÂÂÂÂIntel LAN (i217, igb driver),
> > > > ÂÂÂÂÂÂÂÂÂÂseveral IEEE 1394 controllers, some of them behind
> > > > ÂÂÂÂÂÂÂÂÂÂPCIe bridges (IDT, PLX) or PCIe-to-PCI bridges (TI,
> > > > Tundra)
> > > > ÂÂÂÂÂÂÂÂÂÂand one PCI-to-CardBus bridge (Ricoh)
> > > >
> > > > kernel.org kernel, Gentoo Linux userland
> > > >
> > > > 1. known good:ÂÂv4.5-rc5 (gcc 4.9.3)
> > > > ÂÂÂknown bad:ÂÂÂv4.6-rc2 (gcc 4.9.3), only tried one time
> > > >
> > > > 2. known good:ÂÂv4.5.2 (gcc 5.2.0)
> > > > ÂÂÂknown bad:ÂÂÂv4.6-rc5 (gcc 5.2.0), only tried one time
> > > >
> > > > I will send my linux-4.6-rc5/.config in a follow-up message.ÂÂ
> > Â.config: http://www.spinics.net/lists/kernel/msg2243444.html
> > ÂÂÂlspci: http://www.spinics.net/lists/kernel/msg2243447.html
> >
> > Some userland package versions, in case these have any bearing:
> > x11-base/xorg-drivers-1.17
> > x11-base/xorg-server-1.17.4
> > x11-bas/xorg-x11-7.4-r2
> Furthermore, there is a single display hooked up via DisplayPort.
>
> >
> > >
> > > After it proved impossible to capture an oops through netconsole,
> > > I
> > > started git bisect.ÂÂThis will apparently take almost a week, as
> > > git
> > > estimated 13 bisection steps and I will be allowing about 12
> > > hours of
> > > uptime as a sign for a good kernel.ÂÂ(In my four or five tests of
> > > bad
> > > kernels before I started bisection, they hung after 3
> > > minutes...5.5 hours
> > > uptime, with no discernible difference in workload.ÂÂMaybe 12 h
> > > cutoff is
> > > even too short...)ÂÂ
> I took at least 18 hours uptime (usually 24 hours) as a sign for good
> kernels.ÂÂDuring the bisection, bad kernels hung after 3 h, 2 h, 9
> min,
> 45 min, and 4 min uptime.ÂÂThus I arrived at a98ee79317b4
> "drm/i915/fbc:
> enable FBC by default on HSW and BDW" as the point where the hangs
> are
> introduced.
>
> Quoting the changelog of the commit:

Thanks for following the instructions on the commit message! :)

>
> ÂÂÂÂOh, and in case you - the person reading this commit message -
> found
> ÂÂÂÂthis commit through git bisect, please do the following:
> ÂÂÂÂÂ- Check your dmesg and see if there are error messages
> mentioning
> ÂÂÂÂÂÂÂunderruns around the time your problem started happening.
>
> Well, I always had the followings lines in dmesg:
> [drm:intel_set_cpu_fifo_underrun_reporting] *ERROR* uncleared fifo
> underrun on pipe A
> [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO
> underrun

Oh, well... I had a patch that would just disable FBC in case we saw a
FIFO underrun, but it was rejected. Maybe this is the time to think
about it again? Otherwise, I can't think of much besides disabling FBC
on HSW until all the underruns and watermarks regressions are fixed
forever.

>
> I always got these when I switch on the DisplayPort attached monitor.
> Recently I changed userland from kdm to sddm and noticed that I
> apparently get these when sddm shuts down.ÂÂI am not aware of whether
> or not this also already happened with kdm.
>
> However, "around the time your problem started happening" there is
> nothing in dmesg, because "your problem" is a complete hang without
> possibility of disk IO and without netconsole output.
>
> ÂÂÂÂÂ- Download intel-gpu-tools, compile it, and run:
> ÂÂÂÂÂÂÂ$ sudo ./tests/kms_frontbuffer_tracking --run-subtest '*fbc-*'
> 2>&1 | tee fbc.txt
> ÂÂÂÂÂÂÂThen send us the fbc.txt file, especially if you get a
> failure.
> ÂÂÂÂÂÂÂThis will really maximize your chances of getting the bug
> fixed
> ÂÂÂÂÂÂÂquickly.
>
> Do you need this while FBC is enabled, or can I run it while FBC is
> disabled?

FBC enabled. Considering your description, my hope is that maybe some
specific subtest will be able to hang your machine, so testing this
again will require only running the specific subtest instead of waiting
18 hours.

>
> ÂÂÂÂÂ- Try to find a reliable way to reproduce the problem, and tell
> us.
>
> The reliable way is to just wait for the kernel to hang after about
> 3 minutes to 5.5 hours.ÂÂI have not identified any special activity
> which would trigger the hang.
>
> ÂÂÂÂÂ- Boot with drm.debug=0xe, reproduce the problem, then send us
> the
> ÂÂÂÂÂÂÂdmesg file.
>
> I can try this, but I am skeptical about getting any useful kernel
> messages from before the hang.

Agree.

>
> PS:
> I am mentioning the following just in case that it has any
> relationship
> with the FBC related kernel freezes.ÂÂMaybe it doesn't...ÂÂThere is
> another recent regression on this PC, but I have not yet figured out
> whether it was introduced by any particular kernel version.ÂÂThe
> regression is:ÂÂWhen switching from X11 to text console by
> [Ctrl][Alt][Fx]
> or by shutting down sddm, I often only get a blank screen.ÂÂI suspect
> that this regression was introduced when I replaced kdm by sddm, but
> I am not sure about that.

Maybe there is some relationship, since this operation involves a mode
change. You can also try checking dmesg to see if there are underruns
right when you do the change.


If you don't want to keep carrying a manual revert, you can just boot
with i915.enable_fbc=0 for now (or write a /etc/modprobe.d file). Also,
it would be good to know in case you still somehow see the machine
hangs even with FBC disabled.

Thanks,
Paulo