Re: data corruption with nvidia chipsets and IDE/SATA drives //memory hole mapping related bug?!

From: Karsten Weiss
Date: Wed Dec 13 2006 - 15:34:58 EST


On Wed, 13 Dec 2006, Chris Wedgwood wrote:

> > Any ideas why iommu=disabled in the bios does not solve the issue?
>
> The kernel will still use the IOMMU if the BIOS doesn't set it up if
> it can, check your dmesg for IOMMU strings, there might be something
> printed to this effect.

FWIW: As far as I understand the linux kernel code (I am no kernel
developer so please correct me if I am wrong) the PCI dma mapping code is
abstracted by struct dma_mapping_ops. I.e. there are currently four
possible implementations for x86_64 (see linux-2.6/arch/x86_64/kernel/)

1. pci-nommu.c : no IOMMU at all (e.g. because you have < 4 GB memory)
Kernel boot message: "PCI-DMA: Disabling IOMMU."

2. pci-gart.c : (AMD) Hardware-IOMMU.
Kernel boot message: "PCI-DMA: using GART IOMMU" (this message
first appeared in 2.6.16)

3. pci-swiotlb.c : Software-IOMMU (used e.g. if there is no hw iommu)
Kernel boot message: "PCI-DMA: Using software bounce buffering
for IO (SWIOTLB)"

4. pci-calgary.c : Calgary HW-IOMMU from IBM; used in pSeries servers.
This HW-IOMMU supports dma address mapping with memory proctection,
etc.
Kernel boot message: "PCI-DMA: Using Calgary IOMMU" (since 2.6.18!)

What all this means is that you can use "dmesg|grep ^PCI-DMA:" to see
which implementation your kernel is currently using.

As far as our problem machines are concerned the "PCI-DMA: using GART
IOMMU" case is broken (data corruption). But both "PCI-DMA: Disabling
IOMMU" (trigged with mem=2g) and "PCI-DMA: Using software bounce buffering
for IO (SWIOTLB)" (triggered with iommu=soft) are stable.

BTW: It would be really great if this area of the kernel would get some
more and better documentation. The information at
linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to
read the code to get a *rough* idea what all the "iommu=" options
actually do and how they interact.

> > 1) And does this now mean that there's an error in the hardware
> > (chipset or CPU/memcontroller)?
>
> My guess is it's a kernel bug, I don't know for certain. Perhaps we
> shaould start making a more comprehensive list of affected kernels &
> CPUs?

BTW: Did someone already open an official bug at
http://bugzilla.kernel.org ?

Best regards,
Karsten

--
__________________________________________creating IT solutions
Dipl.-Inf. Karsten Weiss science + computing ag
phone: +49 7071 9457 452 Hagellocher Weg 73
teamline: +49 7071 9457 681 72070 Tuebingen, Germany
email: knweiss@xxxxxxxxxxxxxxxxxxxx www.science-computing.de

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/