Re: [RFC PATCH 2/9] memory: scrub: sysfs: Add Documentation entries for set of scrub attributes

From: Jonathan Cameron
Date: Fri Sep 22 2023 - 06:20:16 EST


On Thu, 21 Sep 2023 17:07:04 -0700
Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:

> On Fri, Sep 15, 2023 at 10:29 AM <shiju.jose@xxxxxxxxxx> wrote:
> >
> > From: Shiju Jose <shiju.jose@xxxxxxxxxx>
> >
> > Add sysfs documentation entries for the set of attributes those are
> > exposed in /sys/class/scrub/ by the scrub driver. These attributes
> > support configuring parameters of a scrub device.
> >
> > Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> > ---
> > .../ABI/testing/sysfs-class-scrub-configure | 82 +++++++++++++++++++
> > 1 file changed, 82 insertions(+)
> > create mode 100644 Documentation/ABI/testing/sysfs-class-scrub-configure
> >
> > diff --git a/Documentation/ABI/testing/sysfs-class-scrub-configure b/Documentation/ABI/testing/sysfs-class-scrub-configure
> > new file mode 100644
> > index 000000000000..347e2167dc62
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-class-scrub-configure
> > @@ -0,0 +1,82 @@
> > +What: /sys/class/scrub/
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + The scrub/ class subdirectory belongs to the
> > + scrubber subsystem.
> > +
> > +What: /sys/class/scrub/scrubX/
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + The /sys/class/scrub/scrub{0,1,2,3,...} directories
>
> This API (sysfs interface) is very specific to the ACPI interface
> defined for hardware patrol scrubber. I wonder can we have some
> interface that is more generic, for a couple of reasons:

Agreed that it makes sense to define a broad interface. We have
some hardware focused drivers we can't share yet (IP rules until
a release date in the near future) where this is a reasonable fit
- but indeed there are others such as mapping this to DDR ECS
where it isn't a great mapping.

I'd love to come up with an interface that has the right blend
of generality and flexibility. That is easiest done before we have
any implementation merged.

>
> 1. I am not aware of any chip/platform hardware that implemented the
> hw ps part defined in ACPI RASF/RAS2 spec. So I am curious what the
> RAS experts from different hardware vendors think about this. For
> example, Tony and Dave from Intel, Jon and Vilas from AMD. Is there
> any hardware platform (if allowed to disclose) that implemented ACPI
> RASF/RAS2? If so, will vendors continue to support the control of
> patrol scrubber using the ACPI spec? If not (as Tony said in [1], will
> the vendor consider starting some future platform?
>
> If we are unlikely to get the vendor support, creating this ACPI
> specific sysfs API (and the driver implementations) in Linux seems to
> have limited meaning.

There is a bit of a chicken and egg problem here. Until there is
reasonable support in kernel (or it looks like there will be),
BIOS teams push back on a requirement to add the tables.
I'd encourage no one to bother with RASF - RAS2 is much less
ambiguous.

>
> > + correspond to each scrub device.
> > +
> > +What: /sys/class/scrub/scrubX/name
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (RO) name of the memory scrub device
> > +
> > +What: /sys/class/scrub/scrubX/regionY/
>
> 2. I believe the concept of "region" here is probably from
> PATROL_SCRUB defined in “APCI section 5.2.20.5. Parameter Block". It
> is indeed powerful: if a process's physical memory spans over multiple
> memory controllers, OS can in theory scrub chunks of the memory
> belonging to the process. However, from a previous discussion [1],
> "From a h/w perspective it might always be complex". IIUC, the address
> translation from physical address to channel address is hard to
> achieve, and probably that's one of the tech reasons the patrol scrub
> ACPI spec is not in practice implemented?

Next bit is kind of an aside as I mostly agree with your conclusions ;)

This obviously depends on your memory interleave. You want to present
physical address ranges as single controllable regions - probably
most interesting being stuff that maps to NUMA nodes. The short
answer is that any firmware control will end up writing to all the
memory controllers involved in a given PA range - firmware can easily
establish which those are.

A memory controller can support multiple scrub regions
which map from a PA range to a particular set of RAM addresses
- that's implementation specific. The memory controller is getting
the host PA and can carry out appropriate hashing if it wants to.
Many scrub solutions won't do this - in which case it's max one
region per memory controller (mapped by firmware to one region per
interleave set - assuming interleave sets take in whole memory
controllers - which they normally do).

I would expect existing systems (not design with this differentiated
scrub in mind) to only support scrub control by PA range at the
granularity of interleave sets.

Note that this general question of PA based controls also
maps to things like resource control (resctl) where it's only interesting
to partition memory bandwidth such that the partition applies to the
whole interleave set - that's done for ARM systems anyway by having
the userspace interface pretend there is only one control, but
write the settings to all actual controllers involved. Not sure what
x86 does.

Taking a few examples that I know of. All 4 socket server - with
control of these as bios options ;).
Assuming individual memory controllers don't allow scrub to be
configured by PA range.

1. Full system memory interleave (people do this form of crazy)
In that case, there is only one set of firmware controls
that write to the interfaces of every memory controller. Depending
on the interleave design that may still allow multiple regions.

2. Socket wide memory interleave. In that case, firmware controls
need to effect all memory controllers in that socket if the
requested 'region' maps to them. So 4 PA regions.

3. Die wide memory interleave. Finer granularity of control
so perhaps 8 PA rgiones.

4. Finer granularity (there are reasons to do this for above mentioned
bandwidth resource control which you can only do if not interleaving
across multiple controllers).



>
> So my take is, control at the granularity of the memory controller is
> probably a nice compromise.
> Both OS and userspace can get a pretty
> decent amount of flexibility, start/stop/adjust speed of the scrubbing
> on a memory controller; meanwhile it doesn't impose too much pain to
> hardware vendors when they provide these features in hardware. In
> terms of how these controls/features will be implemented, I imagine it
> could be implemented:
> * via hardware registers that directly or indirectly control on memory
> controllers
> * via ACPI if the situation changes in 10 years (and the RASF/RAS2/PCC
> drivers implemented in this patchset can be directly plugged into)
> * a kernel-thread that uses cpu read to detect memory errors, if
> hardware support is unavailable or not good enough
>

I more or less agree, but would tweak a little.

1) Allow for multiple regions per memory controller - that's needed
for RASF etc anyway - because as far as they are concerned there
is only one interface presented.
2) Tie the interface to interleave sets, not memory controllers.
NUMA nodes often being a good stand in for those.
Controlling memory controllers separately for parts of an interleave
isn't something I'd think was very useful. This will probably get
messy in the future though and the complexity could be pushed to
a userspace tool - as long as enough info is available elsewhere
in sysfs. So need those memory controller directories you propose
to include info on the PA ranges that they are involved in backing
(which to keep things horrible, can change at runtime via memory
hotplug and remapping of host physical address ranges on CXL etc)

> Given these possible backends of scrubbing, I think a more generic
> sysfs API that covers and abstracts these backends will be more
> valuable right now. I haven’t thought thoroughly, but how about
> defining the top-level interface as something like
> “/sys/devices/system/memory_controller_scrubX/”, or
> “/sys/class/memory_controllerX/scrub”?

No particular harm in the rename of the directory I guess.
Though some of those 'memory_controllers' would be virtual as they
wouldn't correspond to actual memory controllers but rather to
sets of them.

Jonathan

>
> [1] https://lore.kernel.org/linux-mm/SJ1PR11MB6083BF93E9A88E659CED5EC4FC3F9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m13516ee35caa05b506080ae805bee14f9f958d43

>
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + The /sys/class/scrub/scrubX/region{0,1,2,3,...}
> > + directories correspond to each scrub region under a scrub device.
> > + Scrub region is a physical address range for which scrub may be
> > + separately controlled. Regions may overlap in which case the
> > + scrubbing rate of the overlapped memory will be at least that
> > + expected due to each overlapping region.
> > +
> > +What: /sys/class/scrub/scrubX/regionY/addr_base
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (RW) The base of the address range of the memory region
> > + to be patrol scrubbed.
> > + On reading, returns the base of the memory region for
> > + the actual address range(The platform calculates
> > + the nearest patrol scrub boundary address from where
> > + it can start scrubbing).
> > +
> > +What: /sys/class/scrub/scrubX/regionY/addr_size
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (RW) The size of the address range to be patrol scrubbed.
> > + On reading, returns the size of the memory region for
> > + the actual address range.
> > +
> > +What: /sys/class/scrub/scrubX/regionY/enable
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (WO) Start/Stop scrubbing the memory region.
> > + 1 - enable the memory scrubbing.
> > + 0 - disable the memory scrubbing..
> > +
> > +What: /sys/class/scrub/scrubX/regionY/speed_available
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (RO) Supported range for the partol speed(scrub rate)
> > + by the scrubber for a memory region.
> > + The unit of the scrub rate vary depends on the scrubber.
> > +
> > +What: /sys/class/scrub/scrubX/regionY/speed
> > +Date: September 2023
> > +KernelVersion: 6.7
> > +Contact: linux-kernel@xxxxxxxxxxxxxxx
> > +Description:
> > + (RW) The partol speed(scrub rate) on the memory region specified and
> > + it must be with in the supported range by the scrubber.
> > + The unit of the scrub rate vary depends on the scrubber.
> > --
> > 2.34.1
> >
> >
>