Re: pinctrl_cherryview: massive error messages (handle_bad_irq)

From: Christian Steiner
Date: Thu Oct 20 2016 - 11:36:40 EST


On 20.10.2016 12:08, Mika Westerberg wrote:
> +LinusW
>
> On Wed, Oct 19, 2016 at 08:40:25PM +0000, Christian Steiner wrote:
>> On 19.10.2016 21:22, Mika Westerberg wrote:
>>> On Wed, Oct 19, 2016 at 06:32:57PM +0000, Christian Steiner wrote:
>>>> Hello,
>>>>
>>>> since Kernel 4.8 I see a lot of the following error messages after
>>>> resuming from suspend to RAM:
>>>>
>>>>> irq 117, desc: ffff88017a61e600, depth: 1, count: 0, unhandled: 0
>>>>> ->handle_irq(): ffffffff8109b613, handle_bad_irq+0x0/0x1e0
>>>>> ->irq_data.chip(): ffffffffa0020180, chv_pinctrl_exit+0x2d84/0x12 [pinctrl_cherryview]
>>>>> ->action(): (null)
>>>>> IRQ_NOPROBE set
>>>>
>>>> The messages appear a few thousand times and mess up the log. Despite
>>>> the messages everything seems to work fine.
>>>> I see the messages with Kernel 4.8.2 and 4.9-rc1. The do not appear with
>>>> Kernel 4.7.8 or earlier.
>>>
>>> We changed the driver not to mask all IRQs when it starts. See commit
>>> bcb48cca23ec ("pinctrl: cherryview: Do not mask all interrupts in
>>> probe").
>>>
>>>> I have an Acer TravelMate B116-M laptop with an Intel Pentium N3700
>>>> processor.
>>>>
>>>> Here are the full log messages of a suspend to RAM and resume cycle with
>>>> Kernel 4.7.8:
>>>>
>>>>> [ 22.605418] PM: Syncing filesystems ... done.
>>>>> [ 22.613640] PM: Preparing system for sleep (mem)
>>>>> [ 22.615108] Freezing user space processes ... (elapsed 0.001 seconds) done.
>>>>> [ 22.616930] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
>>>>> [ 22.616956] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>>>>> [ 22.618611] PM: Suspending system (mem)
>>>>> [ 22.618640] Suspending console(s) (use no_console_suspend to debug)
>>>>> [ 22.619773] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>> [ 22.626173] sd 0:0:0:0: [sda] Stopping disk
>>>>> [ 23.670076] PM: suspend of devices complete after 1051.008 msecs
>>>>> [ 23.691058] PM: late suspend of devices complete after 20.971 msecs
>>>>> [ 23.692104] r8169 0000:03:00.0: System wakeup enabled by ACPI
>>>>> [ 23.693087] xhci_hcd 0000:00:14.0: System wakeup enabled by ACPI
>>>>> [ 23.711687] PM: noirq suspend of devices complete after 20.624 msecs
>>>>> [ 23.711726] ACPI: Preparing to enter system sleep state S3
>>>>> [ 23.712295] ACPI : EC: EC stopped
>>>>> [ 23.712297] PM: Saving platform NVS memory
>>>>> [ 23.712306] Disabling non-boot CPUs ...
>>>>> [ 23.712590] Cannot set affinity for irq 191
>>>>> [ 23.713647] smpboot: CPU 1 is now offline
>>>>> [ 23.752643] smpboot: CPU 2 is now offline
>>>>> [ 23.802930] smpboot: CPU 3 is now offline
>>>>> [ 23.831827] ACPI: Low-level resume complete
>>>>> [ 23.831938] ACPI : EC: EC started
>>>>> [ 23.831939] PM: Restoring platform NVS memory
>>>>> [ 23.832600] Enabling non-boot CPUs ...
>>>>> [ 23.882819] x86: Booting SMP configuration:
>>>>> [ 23.882821] smpboot: Booting Node 0 Processor 1 APIC 0x2
>>>>> [ 23.885308] cache: parent cpu1 should not be sleeping
>>>>> [ 23.885757] CPU1 is up
>>>>> [ 23.932612] smpboot: Booting Node 0 Processor 2 APIC 0x4
>>>>> [ 23.935171] cache: parent cpu2 should not be sleeping
>>>>> [ 23.935674] CPU2 is up
>>>>> [ 23.982749] smpboot: Booting Node 0 Processor 3 APIC 0x6
>>>>> [ 23.985324] cache: parent cpu3 should not be sleeping
>>>>> [ 23.985840] CPU3 is up
>>>>> [ 23.986370] ACPI: Waking up from system sleep state S3
>>>>> [ 24.012307] xhci_hcd 0000:00:14.0: System wakeup disabled by ACPI
>>>>> [ 24.023417] PM: noirq resume of devices complete after 20.911 msecs
>>>>> [ 25.052523] PM: early resume of devices complete after 1029.024 msecs
>>>>> [ 25.053082] r8169 0000:03:00.0: System wakeup disabled by ACPI
>>>>> [ 25.053124] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
>>>>> [ 25.054222] rtc_cmos 00:04: System wakeup disabled by ACPI
>>>>> [ 25.055645] sd 0:0:0:0: [sda] Starting disk
>>>>> [ 25.075031] r8169 0000:03:00.0 eth0: link down
>>>>> [ 25.374179] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>>>>> [ 25.377616] ata1.00: configured for UDMA/133
>>>>> [ 26.025736] PM: resume of devices complete after 973.238 msecs
>>>>> [ 26.026109] PM: Finishing wakeup.
>>>>> [ 26.026111] Restarting tasks ... done.
>>>>> [ 27.767611] r8169 0000:03:00.0 eth0: link up
>>>>
>>>> And this is how it looks with Kernel 4.8.2:
>>>>
>>>>> [ 46.928231] PM: Syncing filesystems ... done.
>>>>> [ 46.930391] PM: Preparing system for sleep (mem)
>>>>> [ 46.932002] Freezing user space processes ... (elapsed 0.001 seconds) done.
>>>>> [ 46.933721] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds)
>>>>> [ 46.933749] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>>>>> [ 46.935457] PM: Suspending system (mem)
>>>>> [ 46.935486] Suspending console(s) (use no_console_suspend to debug)
>>>>> [ 46.936492] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>>>>> [ 46.936598] sd 0:0:0:0: [sda] Stopping disk
>>>>> [ 48.026353] PM: suspend of devices complete after 1090.448 msecs
>>>>> [ 48.057844] PM: late suspend of devices complete after 31.480 msecs
>>>>> [ 48.059086] r8169 0000:03:00.0: System wakeup enabled by ACPI
>>>>> [ 48.059240] xhci_hcd 0000:00:14.0: System wakeup enabled by ACPI
>>>>> [ 48.117234] PM: noirq suspend of devices complete after 59.386 msecs
>>>>> [ 48.117269] ACPI: Preparing to enter system sleep state S3
>>>>> [ 48.117836] ACPI : EC: EC stopped
>>>>> [ 48.117837] PM: Saving platform NVS memory
>>>>> [ 48.117846] Disabling non-boot CPUs ...
>>>>> [ 48.118130] Cannot set affinity for irq 191
>>>>> [ 48.119186] smpboot: CPU 1 is now offline
>>>>> [ 48.179126] smpboot: CPU 2 is now offline
>>>>> [ 48.229450] smpboot: CPU 3 is now offline
>>>>> [ 48.278338] ACPI: Low-level resume complete
>>>>> [ 48.278415] ACPI : EC: EC started
>>>>> [ 48.278416] PM: Restoring platform NVS memory
>>>>> [ 48.279036] irq 117, desc: ffff88017a61e600, depth: 1, count: 0, unhandled: 0
>>>>> [ 48.279044] ->handle_irq(): ffffffff8109b613, handle_bad_irq+0x0/0x1e0
>>>>> [ 48.279052] ->irq_data.chip(): ffffffffa0020180, chv_pinctrl_exit+0x2d84/0x12 [pinctrl_cherryview]
>>>>> [ 48.279054] ->action(): (null)
>>>>> [ 48.279055] IRQ_NOPROBE set
>>>>> [ 48.279057] unexpected IRQ trap at vector 75
>>>>>
>>>> [...]
>>>>>
>>>>> [ 48.659033] irq 117, desc: ffff88017a61e600, depth: 1, count: 0, unhandled: 0
>>>>> [ 48.659037] ->handle_irq(): ffffffff8109b613, handle_bad_irq+0x0/0x1e0
>>>>> [ 48.659041] ->irq_data.chip(): ffffffffa0020180, chv_pinctrl_exit+0x2d84/0x12 [pinctrl_cherryview]
>>>>> [ 48.659042] ->action(): (null)
>>>>> [ 48.659043] IRQ_NOPROBE set
>>>>> [ 49.698698] PM: early resume of devices complete after 1089.068 msecs
>>>>> [ 49.699601] r8169 0000:03:00.0: System wakeup disabled by ACPI
>>>>> [ 49.699633] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
>>>>> [ 49.708346] sd 0:0:0:0: [sda] Starting disk
>>>>> [ 49.708581] rtc_cmos 00:04: System wakeup disabled by ACPI
>>>>> [ 49.731492] r8169 0000:03:00.0 eth0: link down
>>>>> [ 50.039718] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>>>>> [ 50.041544] ata1.00: configured for UDMA/133
>>>>> [ 50.381833] PM: resume of devices complete after 683.151 msecs
>>>>> [ 50.385539] PM: Finishing wakeup.
>>>>> [ 50.385541] Restarting tasks ... done.
>>>>
>>>> Are the messages just informational or is this something serious that
>>>> should be fixed?
>>>
>>> Basically it means the machine has an interrupt line active an nothing
>>> is handling it. It is not serious but rather annoying.
>>
>> Thank you. Good to hear it is not critical. I would ignore the messages,
>> but it is annoying that almost 4MB of log messages are generated on
>> every resume after standby. Maybe it is possible to define a limit for
>> the maximum amount of the same messages?
>
> The kernel will disable that line after 100000 spuriuos interrupts but I
> guess it is still rather annoying for the user.
>
>>> I've seen this happening on proto hardware but it was fixed with
>>> firmware update.
>>
>> Currently I am using BIOS v1.22. Recently Acer released v1.23. I could
>> test if the BIOS update solves the problem.
>
> Please try that first.

Updating the BIOS to v1.23 solved the problem. Thank you very much for
your advice!

>
> If that does not work, then I we may be able to mask all interrupts up
> to nirqs - 1 but I first would like to check if the BIOS upgrade helps.
>
>>>> If you need more information or if I should test a patch, please let me
>>>> know.
>>>
>>> Can you send me output of /proc/interrupts and
>>> /sys/kernel/debug/pinctrl/INT33FF:*/pins.
>>
>> Of course:
>>
>>> $ cat /proc/interrupts
>>> CPU0 CPU1 CPU2 CPU3
>>> 0: 1388 0 0 0 IO-APIC 2-edge timer
>>> 1: 17 3 4 3 IO-APIC 1-edge i8042
>>> 8: 1 0 0 0 IO-APIC 8-edge rtc0
>>> 9: 28 23 22 23 IO-APIC 9-fasteoi acpi
>>> 12: 76 36 36 36 IO-APIC 12-edge i8042
>>> 18: 0 0 0 0 IO-APIC 18-fasteoi i801_smbus
>>> 45: 13 17 9 13 IO-APIC 45-fasteoi mmc0
>>> 47: 0 0 0 0 IO-APIC 47-fasteoi mmc1
>>> 115: 7816 7632 7694 7549 PCI-MSI 311296-edge ahci[0000:00:13.0]
>>> 116: 43793 44661 43670 44402 PCI-MSI 1572864-edge eth0
>>> 117: 10759 0 0 0 chv-gpio 0
>>> 191: 0 0 0 0 chv-gpio 50 80860F14:01 cd
>>> 196: 35 37 40 41 PCI-MSI 1048576-edge iwlwifi
>>> 197: 24359 22517 23054 22852 PCI-MSI 327680-edge xhci_hcd
>>> 198: 174515 168035 165519 167706 PCI-MSI 32768-edge i915
>>> 199: 20 18 18 16 PCI-MSI 442368-edge snd_hda_intel:card0
>>> NMI: 0 0 0 0 Non-maskable interrupts
>>> LOC: 409748 405619 425239 440549 Local timer interrupts
>>> SPU: 0 0 0 0 Spurious interrupts
>>> PMI: 0 0 0 0 Performance monitoring interrupts
>>> IWI: 0 0 0 0 IRQ work interrupts
>>> RTR: 0 0 0 0 APIC ICR read retries
>>> RES: 74598 69435 68734 60007 Rescheduling interrupts
>>> CAL: 23842 28885 29873 26859 Function call interrupts
>>> TLB: 17407 23763 25060 21008 TLB shootdowns
>>> TRM: 0 0 0 0 Thermal event interrupts
>>> THR: 0 0 0 0 Threshold APIC interrupts
>>> MCE: 0 0 0 0 Machine check exceptions
>>> MCP: 45 44 44 44 Machine check polls
>>> ERR: 0
>>> MIS: 0
>>> PIN: 0 0 0 0 Posted-interrupt notification event
>>> PIW: 0 0 0 0 Posted-interrupt wakeup event

FYI: IRQ 117 is no longer listed in /proc/interrupts with BIOS v1.23.

>>
>>> $ cat /sys/kernel/debug/pinctrl/INT33FF:*/pin
>>> registered pins: 24
>>> pin 0 (PMU_SLP_S3_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 1 (PMU_BATLOW_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 2 (SUS_STAT_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 3 (PMU_SLP_S0IX_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 4 (PMU_AC_PRESENT) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 5 (PMU_PLTRST_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 6 (PMU_SUSCLK) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
>>> pin 7 (PMU_SLP_LAN_B) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
>>> pin 8 (PMU_PWRBTN_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 9 (PMU_SLP_S4_B) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
>>> pin 10 (PMU_WAKE_B) mode 1 ctrl0 0x00c10001 ctrl1 0x04c00000
>>> pin 11 (PMU_WAKE_LAN_B) GPIO ctrl0 0x00018300 ctrl1 0x04c00000
>>> pin 15 (MF_ISH_GPIO_3) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
>>> pin 16 (MF_ISH_GPIO_7) GPIO ctrl0 0x00018102 ctrl1 0x04c00000
>>> pin 17 (MF_ISH_I2C1_SCL) GPIO ctrl0 0x00018102 ctrl1 0x04c00000
>>> pin 18 (MF_ISH_GPIO_1) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
>>> pin 19 (MF_ISH_GPIO_5) GPIO ctrl0 0x10008201 ctrl1 0x04c00004
>
> It most probably is this one. Active high and the line is high so this
> is generating interrupts all the time if not masked.

Looks you were right. Here is a diff of the changes (-: BIOS v1.22, +:
BIOS v1.23):

-pin 19 (MF_ISH_GPIO_5) GPIO ctrl0 0x10008201 ctrl1 0x04c00004
+pin 19 (MF_ISH_GPIO_5) GPIO ctrl0 0x10008200 ctrl1 0x04c00004
-pin 49 (MF_LPC_CLKOUT1) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000
+pin 49 (MF_LPC_CLKOUT1) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
-pin 51 (MF_LPC_CLKOUT0) mode 1 ctrl0 0x00010000 ctrl1 0x04c00000
+pin 51 (MF_LPC_CLKOUT0) mode 1 ctrl0 0x00010001 ctrl1 0x04c00000

By the way, is the message
> Cannot set affinity for irq 191
in the above kernel logs something to care about?

Thanks,
Christian