Re: [PATCH v2] PCI: Mark LSI FW643 to avoid bus reset

From: edmund.raile
Date: Thu Mar 28 2024 - 14:35:56 EST


So Bjorn Helgaas beat me to it,

I'm monitoring kernel messages with
```
sudo journalctl --system -f
```

Indeed all that is necessary to generate the trace from the patch is to unbind the FW643 from ohci on my system (unpatched kernel):
```
su -c 'echo -n "0000:03:00.0" > /sys/bus/pci/drivers/firewire_ohci/unbind'
```

In combination with the kernel parameters intel_iommu=on iommu=pt, I can then bind it to vfio-pci
```
su -c 'echo -n "0000:03:00.0" > /sys/bus/pci/drivers/vfio-pci/bind'
```

And then I'd attach it to a qemu Windows VM using `-device vfio-pci,host=03:00.0 \`.
The OS finds the firewire card straightaway and the RME FireFace driver connects on booting just like it would when booting the guest on bare metal.

> $ echo 1 > sudo tee -a /sys/devices/pci0000:00/0000:00:01.2/0000:03:00.0/0000:04:02.0/0000:06:00.0/reset
> (nothing happens)

Replicating the reset line Takashi sent on my system, there is no error in the kernel log but playback doesn't stop either, leading me to believe that permissions of sudo may be insufficient (root-only)?
```
echo 1 > sudo tee -a /sys/devices/pci0000\:00/0000\:00\:1c.1/0000\:03\:00.0/reset
```

So instead I ran this:
```
su -c 'echo 1 > /sys/devices/pci0000\:00/0000\:00\:1c.1/0000\:03\:00.0/reset'
```
Playback stopped immediately and could not be resumed.

Then I received this trace:

INFO: task alsa-sink-Firef:4110 blocked for more than 245 seconds.
Tainted: G W OE 6.6.10-1-MANJARO #1
task:alsa-sink-Firef state:D stack:0 pid:4110 ppid:2657 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x3e7/0x1410
? tlb_batch_pages_flush+0x3d/0x70
schedule+0x5e/0xd0
schedule_timeout+0x151/0x160
wait_for_completion+0x8a/0x160
fw_run_transaction+0xe5/0x120 [firewire_core d9ff4eaf1ffb23a203d413e851f405323b49fec7]
? __pfx_split_transaction_timeout_callback+0x10/0x10 [firewire_core d9ff4eaf1ffb23a203d413e851f405323b49fec7]
? __pfx_transmit_complete_callback+0x10/0x10 [firewire_core d9ff4eaf1ffb23a203d413e851f405323b49fec7]
? __pfx_transaction_callback+0x10/0x10 [firewire_core d9ff4eaf1ffb23a203d413e851f405323b49fec7]
snd_fw_transaction+0x70/0x110 [snd_firewire_lib 30b43a591db389bbc6be51459cb243ba1fe1e662]
ff800_finish_session+0x43/0x70 [snd_fireface 5f7f3f556960f4838886792be8e9c18aa5089b0a]
snd_ff_stream_stop_duplex+0x39/0x70 [snd_fireface 5f7f3f556960f4838886792be8e9c18aa5089b0a]
pcm_hw_free+0x3c/0x50 [snd_fireface 5f7f3f556960f4838886792be8e9c18aa5089b0a]
snd_pcm_common_ioctl+0xe28/0x12b0 [snd_pcm 24933227879438b755ef98bc4844113025f38cdf]
? __seccomp_filter+0x32c/0x510
? __vm_munmap+0xbb/0x150
snd_pcm_ioctl+0x2e/0x50 [snd_pcm 24933227879438b755ef98bc4844113025f38cdf]
__x64_sys_ioctl+0x94/0xd0
do_syscall_64+0x5d/0x90
? syscall_exit_to_user_mode+0x2b/0x40
? do_syscall_64+0x6c/0x90
? do_syscall_64+0x6c/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Now of course yanking the device from underneath streaming snd_firewire is not an intended usecase and I get that this has good cause to produce a trace but it shows that root privileges are necessary here.

Doing this without running playback yields the same result.

In both instances, I had to REISUO the system as it would keep waiting for that thread to finish when shutting down.

When I instead turned the FireFace off, I got this kernel message:
snd_fireface fw1.0: transaction failed: bus reset
snd_fireface fw1.0: transaction failed: bus reset
snd_fireface fw1.0: transaction failed: bus reset
Then running the same command, this time no threads went into D state and start producing spinlock messages/traces but by that point, BUT I wasn't able to use the FireFace after powering it back on again.
At least shutdown worked normally this time.

Are there other steps necessary to get the FW643 back working again after this?

System:
MSI PRO Z690-A DDR4(MS-7D25)
StarTech FW800 PCIe card (LSI FW643)

A peculiarity about this system that may or may not be of relevance here:
When rebooting, the FW643 can not be seen by the system any more.
Its root port also won't show up, regardless of slot used.
I have to perform a full shutdown for it to be recognized again.
This behavior is OS-independent, but it never happened on my previous Z68 system with this card (or any), there reboot worked as you'd expect.
The original MSI BIOS was buggy in this regard (and many others) in that it would not even recognize the card half the time booting from power-off.
MSI support did not care, said it was an "old device". How they can claim PCIe compliance is beyond me.
This is why I'm running Dasharo (coreboot), it always picks up the card from full power-off and behaves very predictably in other regards.
It might even be an issue with modern Intel since I've read of very similar issues on other manufacturer's Z690 and even Z790 boards (missing PCIe devices after reboot).
I've learned to live with this as I don't know how to solve it and I'm
stuck on this platform. Should of bought AMD again this time around.

> Can you both collect the output of "sudo lspci -vvv" so we can try to figure out the difference?
> This is my 1394 OHCI hardware.
```
sudo lspci -vvv
03:00.0 FireWire (IEEE 1394): LSI Corporation FW643 [TrueFire] PCIe 1394b Controller (rev 08) (prog-if 10 [OHCI])
Subsystem: Device 5901:1101
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 136
IOMMU group: 13
Region 0: Memory at 50800000 (64-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME+
Capabilities: [4c] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00358 Data: 0000
Capabilities: [60] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 2048 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable- ID=0 ArbSelect=Fixed TC/VC=00
Status: NegoPending- InProgress-
Capabilities: [170 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: firewire_ohci
Kernel modules: firewire_ohci
```
This does not change whether I boot patched or unpatched kernel.

> > > Can you collect the output of:
> > >
> > > $ find /sys/devices -name reset_method | xargs grep .
> Edmund, could you run this command, too?

with patch applied:
```
sudo find /sys/devices -name reset_method | xargs grep .
/sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0/reset_method:pm bus
/sys/devices/pci0000:00/0000:00:1c.0/reset_method:pm
/sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/reset_method:pm
/sys/devices/pci0000:00/0000:00:1c.1/reset_method:pm
/sys/devices/pci0000:00/0000:00:1a.0/0000:01:00.0/reset_method:flr bus
/sys/devices/pci0000:00/0000:00:1a.0/reset_method:pm
/sys/devices/pci0000:00/0000:00:1d.0/reset_method:pm
/sys/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/reset_method:flr bus
/sys/devices/pci0000:00/0000:00:02.0/reset_method:flr pm
/sys/devices/pci0000:00/0000:00:1c.2/reset_method:pm
/sys/devices/pci0000:00/0000:00:1c.2/0000:04:00.0/reset_method:flr bus
```
without applied patch only bus reset method is added, everything else
stays the same:
```
/sys/devices/pci0000:00/0000:00:1c.1/0000:03:00.0/reset_method:pm bus
```

This is the root port it is currently connected to:
```
sudo lspci -vvv
00:1c.1 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #2 (rev 11) (prog-if 00 [Normal decode])
Subsystem: Micro-Star International Co., Ltd. [MSI] Alder Lake-S PCH PCI Express Root Port
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 124
IOMMU group: 7
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: f000-0fff [disabled] [16-bit]
Memory behind bridge: 50800000-508fffff [size=1M] [32-bit]
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #2, Speed 8GT/s, Width x1, ASPM not supported
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1
TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #1, PowerLimit 10W; Interlock- NoCompl+
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet- LinkState+
RootCap: CRSVisible-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 2
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled, ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00278 Data: 0000
Capabilities: [98] Subsystem: Micro-Star International Co., Ltd. [MSI] Alder Lake-S PCH PCI Express Root Port
Capabilities: [a0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [220 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
Capabilities: [150 v1] Precision Time Measurement
PTMCap: Requester:- Responder:+ Root:+
PTMClockGranularity: 4ns
PTMControl: Enabled:+ RootSelected:+
PTMEffectiveGranularity: Unknown
Capabilities: [a30 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [a90 v1] Data Link Feature <?>
Kernel driver in use: pcieport
```

`lspci -tv` reveals that it is connected to root port #2:
+-1c.1-[03]----00.0 LSI Corporation FW643 [TrueFire] PCIe 1394b Controller

I hope this information is of use to you.
Maybe there is a better solution?

Kind regards,
Edmund Raile