Re: [PATCH] cxl/pci: Fix disabling CXL memory for zero-based addressing

From: Dan Williams
Date: Tue Feb 13 2024 - 13:40:29 EST


Robert Richter wrote:
> Dan,
>
> On 09.02.24 12:22:01, Dan Williams wrote:
> > Robert Richter wrote:
>
> > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > > index 569354a5536f..3a36a2f0c94f 100644
> > > --- a/drivers/cxl/core/pci.c
> > > +++ b/drivers/cxl/core/pci.c
> > > @@ -466,6 +466,18 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> > > for (i = 0, allowed = 0; info->mem_enabled && i < info->ranges; i++) {
> > > struct device *cxld_dev;
> > >
> > > + /*
> > > + * Handle zero-based hardware addresses
> > > + */
> > > + if (!info->dvsec_range[i].start &&
> > > + info->dvsec_range[i].end != CXL_RESOURCE_NONE &&
> > > + info->dvsec_range[i].end) {
> > > + dev_dbg(dev, "Zero-based hardware range found [%#llx - %#llx]\n",
> > > + info->dvsec_range[i].start, info->dvsec_range[i].end);
> > > + allowed++;
> > > + continue;
> > > + }
> > > +
> >
> > I am not comfortable with this. It should be checking a platform
> > specific quirk, or similar for the possibility of HPA != SPA. The
> > entirety of the Linux CXL subsystem is built on the assumption that HPA
> > == SPA, and if a platform wants to inject an offset between those Linux
> > needs some way to enumerate that it is running in that new world. Yes,
> > nothing in the CXL specification precludes HPA != SPA, but Linux has
> > long since shipped the opposite assumption.
>
> this check prevents the memory from disabling an enabled decoder. So it
> just keeps everything as it comes out of firmware.
>
> Can you explain the motivation why active memory is disabled?

It is a sanity check that Linux is operating in a CXL world that it
understands. The fundamental assumption is that the CFMWS correctly
conveys the CXL space, and that the HW decoder resources match platform
expectations match Linux resource management.

> This may take system memory offline and could lead to a kernel hang.

Yes, that is not an unreasonable result when Linux fundamental
assumptions are violated.

> The same could happen if the CEDT is broken or just missing (it is not
> mandatory for 1.1). Such systems just die when booting. So the check
> to take memory offline should be changed in a way that it will be safe
> to disable it.
>
> A platform check would fix that only for certain systems.

I am not worried about platforms that accidentally break the CEDT, those
mistakes are typically caught pre-production. Otherwise, if the platform
is designed to break assumptions then Linux needs explicit enabling for
the address translation sitting between the endpoint and the platform
address map.