Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

From: Kani, Toshi
Date: Fri Mar 02 2018 - 11:22:37 EST


On Fri, 2018-03-02 at 09:34 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 14:31 -0800, Linus Torvalds wrote:
> > On Thu, Mar 1, 2018 at 2:06 PM, Benjamin Herrenschmidt <benh@xxxxxxxxxxx> wrote:
> > >
> > > Could be that x86 has the smarts to do the right thing, still trying to
> > > untangle the code :-)
> >
> > Afaik, x86 will not cache PCI unless the system is misconfigured, and
> > even then it's more likely to just raise a machine check exception
> > than cache things.
> >
> > The last-level cache is going to do fills and spills directly to the
> > memory controller, not to the PCIe side of things.
> >
> > (I guess you *can* do things differently, and I wouldn't be surprised
> > if some people inside Intel did try to do things differently with
> > trying nvram over PCIe, but in general I think the above is true)
> >
> > You won't find it in the kernel code either. It's in hardware with
> > firmware configuration of what addresses are mapped to the memory
> > controllers (and _how_ they are mapped) and which are not.
>
> Ah thanks ! Thanks explains. We can fix that on ppc64 in our linear
> mapping code by checking the address vs. memblocks to chose the right
> page table attributes.

FWIW, this thing is called MTRRs on x86, which are initialized by BIOS.
These registers effectively overwrite page table setups. Intel SDM
defines the effect as follows. 'PAT Entry Value' is the page table
setup.

MTRR Memory Type PAT Entry Value Effective Memory Type
--------------------------------------------------------
UC UC UC
UC WC WC
UC WT UC
UC WB UC
UC WP UC

On my system, BIOS sets MTRRs to cover the entire MMIO ranges with UC.
Other BIOSes may simply set the MTRR default type to UC, i.e. uncovered
ranges become UC.

# cat /proc/mtrr
:
reg01: base=0xc0000000000 (12582912MB), size=2097152MB, count=1:
uncachable
:

# cat /proc/iomem | grep 'PCI Bus'
:
c0000000000-c3fffffffff : PCI Bus 0000:00
c4000000000-c7fffffffff : PCI Bus 0000:11
c8000000000-cbfffffffff : PCI Bus 0000:36
cc000000000-cffffffffff : PCI Bus 0000:5b
d0000000000-d3fffffffff : PCI Bus 0000:80
d4000000000-d7fffffffff : PCI Bus 0000:85
d8000000000-dbfffffffff : PCI Bus 0000:ae
dc000000000-dffffffffff : PCI Bus 0000:d7

-Toshi