Re: ASMedia ASM1062 (AHCI) hang after "ahci 0000:28:00.0: Using 64-bit DMA addresses"

From: Lennert Buytenhek
Date: Wed Jan 17 2024 - 16:14:48 EST


On Tue, Jan 16, 2024 at 03:20:23PM +0100, Niklas Cassel wrote:

> Hello Lennert,

Hi Niklas,

Thanks for your reply!


> > On kernel 6.6.x, with an ASMedia ASM1062 (AHCI) controller, on an

Minor correction to this: lspci says that this is an ASM1062, but it's
actually an ASM1061. I think that the two parts share a PCI device ID,
and I've submitted a PCI ID DB change here:

https://admin.pci-ids.ucw.cz/read/PC/1b21/0612


> > ASUSTeK Pro WS WRX80E-SAGE SE WIFI mainboard, PCI ID 1b21:0612 and
> > subsystem ID 1043:858d, I got a total apparent controller hang,
> > rendering the two attached SATA devices unavailable, that was
> > immediately preceded by the following kernel messages:
> >
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: Using 64-bit DMA addresses
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00000 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00300 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00380 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00400 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00680 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00700 flags=0x0000]
> >
> > It seems as if the controller has problems with 64-bit DMA addresses,
> > and the comments around the source of the message in
> > drivers/iommu/dma-iommu.c seem to point into that same direction:
> >
> > /*
> > * Try to use all the 32-bit PCI addresses first. The original SAC vs.
> > * DAC reasoning loses relevance with PCIe, but enough hardware and
> > * firmware bugs are still lurking out there that it's safest not to
> > * venture into the 64-bit space until necessary.
> > *
> > * If your device goes wrong after seeing the notice then likely either
> > * its driver is not setting DMA masks accurately, the hardware has
> > * some inherent bug in handling >32-bit addresses, or not all the
> > * expected address bits are wired up between the device and the IOMMU.
> > */
> > if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
> > iova = alloc_iova_fast(iovad, iova_len,
> > DMA_BIT_MASK(32) >> shift, false);
> > if (iova)
> > goto done;
> >
> > dev->iommu->pci_32bit_workaround = false;
> > dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit));
> > }
>
> The DMA mask is set here:
> https://github.com/torvalds/linux/blob/v6.7/drivers/ata/ahci.c#L967
>
> And should be called using:
> hpriv->cap & HOST_CAP_64
> https://github.com/torvalds/linux/blob/v6.7/drivers/ata/ahci.c#L1929
>
> Where hpriv->cap is capabilities reported by the AHCI controller itself.
> So it definitely seems like your controller supports 64-bit addressing.

Perhaps, or maybe it's misreporting its capabilities, as it is an old
part (from 2011 or before), and given that it doesn't seem to support
64-bit MSI addressing, either, which for a part with a 64-bit DMA engine
would be an odd restriction:

# lspci -s 28:00.0 -vv | grep -A1 MSI:
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
Address: fee00000 Data: 0000
#

(I checked the available datasheets, but there is no mention of whether
or not the part supports 64-bit DMA.)


> I guess it could be some problem with your BIOS.
> Have you tried updating your BIOS?

The machine is running the latest BIOS available from the vendor at
the time of this writing, version 1201:

# dmidecode | grep -A2 "^BIOS Information"
BIOS Information
Vendor: American Megatrends Inc.
Version: 1201
#

Per:

https://www.asus.com/motherboards-components/motherboards/workstation/pro-ws-wrx80e-sage-se-wifi/helpdesk_bios?model2Name=Pro-WS-WRX80E-SAGE-SE-WIFI

However, some Googling suggests that the ASM106x loads its own firmware
from a directly attached SPI flash chip, and there are several versions
of this firmware available in the wild, with different versions of the
firmware apparently available for legacy IDE mode and for AHCI mode. If
(some of) the AHCI logic is indeed contained inside the firmware, I
could see a firmware bug leading to the controller incorrectly presenting
itself as being 64-bit DMA capable.

Some poking around in the BIOS image suggests that there is no copy of
the ASM106x firmware inside the BIOS image. In other words, it could be
that, even though the machine is running the latest available BIOS, the
ASM1061 might be running an older firmware version.

The ASM1061 firmware does not seem to be readable from software via a
ROM BAR, and it doesn't seem to readable from software in general (the
vendor-supplied DOS .exe updater tool only allows you to erase or
update the SPI flash), so I can't check which firmware version it is
currently using.


> If that does not work, perhaps you could try this (completely untested) patch:
> (You might need to modify the strings to match the exact strings reported by
> your BIOS.)

Thanks for the patch!

I will do some tests with PCI passthrough to a VM, to see whether, and if
it does, exactly how the controller mangles DMA addresses.

I've also ordered a discrete PCIe card with an ASM1061 chip on it, and I
will perform similar tests with that card, to see exactly where the issue
is, i.e. whether it is specific to this mainboard or not.

I will follow up once I will have more information.

Kind regards,
Lennert