Re: [PATCH] cxl/pci: Fix disabling CXL memory for zero-based addressing

From: Robert Richter
Date: Tue Feb 13 2024 - 14:10:04 EST


On 13.02.24 10:40:07, Dan Williams wrote:
> Robert Richter wrote:
> > Dan,
> >
> > On 09.02.24 12:22:01, Dan Williams wrote:
> > > Robert Richter wrote:
> >
> > > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > > > index 569354a5536f..3a36a2f0c94f 100644
> > > > --- a/drivers/cxl/core/pci.c
> > > > +++ b/drivers/cxl/core/pci.c
> > > > @@ -466,6 +466,18 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> > > > for (i = 0, allowed = 0; info->mem_enabled && i < info->ranges; i++) {
> > > > struct device *cxld_dev;
> > > >
> > > > + /*
> > > > + * Handle zero-based hardware addresses
> > > > + */
> > > > + if (!info->dvsec_range[i].start &&
> > > > + info->dvsec_range[i].end != CXL_RESOURCE_NONE &&
> > > > + info->dvsec_range[i].end) {
> > > > + dev_dbg(dev, "Zero-based hardware range found [%#llx - %#llx]\n",
> > > > + info->dvsec_range[i].start, info->dvsec_range[i].end);
> > > > + allowed++;
> > > > + continue;
> > > > + }
> > > > +
> > >
> > > I am not comfortable with this. It should be checking a platform
> > > specific quirk, or similar for the possibility of HPA != SPA. The
> > > entirety of the Linux CXL subsystem is built on the assumption that HPA
> > > == SPA, and if a platform wants to inject an offset between those Linux
> > > needs some way to enumerate that it is running in that new world. Yes,
> > > nothing in the CXL specification precludes HPA != SPA, but Linux has
> > > long since shipped the opposite assumption.
> >
> > this check prevents the memory from disabling an enabled decoder. So it
> > just keeps everything as it comes out of firmware.
> >
> > Can you explain the motivation why active memory is disabled?
>
> It is a sanity check that Linux is operating in a CXL world that it
> understands. The fundamental assumption is that the CFMWS correctly
> conveys the CXL space, and that the HW decoder resources match platform
> expectations match Linux resource management.

It would be sane to just not use CXL if assumptions on it are not
valid and not to break system to boot.

>
> > This may take system memory offline and could lead to a kernel hang.
>
> Yes, that is not an unreasonable result when Linux fundamental
> assumptions are violated.

BUG_ON(fw_table_broken)? If at all, it is not mandatory to have a
CFMWS. Btw, the check is more strict and also checks memory
attributes. It is very likely something can break.

>
> > The same could happen if the CEDT is broken or just missing (it is not
> > mandatory for 1.1). Such systems just die when booting. So the check
> > to take memory offline should be changed in a way that it will be safe
> > to disable it.
> >
> > A platform check would fix that only for certain systems.
>
> I am not worried about platforms that accidentally break the CEDT, those
> mistakes are typically caught pre-production. Otherwise, if the platform
> is designed to break assumptions then Linux needs explicit enabling for
> the address translation sitting between the endpoint and the platform
> address map.

As said, if assumptions break, just let cxl init fail.

Decoders should just be disabled if the address mapping is invalid.
There is no point to tear down the system.

A platform check is fine to me, but in the end this looks like a
whitelist to let systems boot.

Thanks,

-Robert