Re: [PATCH] arm64: Add the arm64.nolse_atomics command line option

From: Mark Rutland
Date: Tue Jul 11 2023 - 06:34:46 EST


On Tue, Jul 11, 2023 at 06:15:49PM +0800, Aiqun(Maria) Yu wrote:
> On 7/11/2023 4:22 PM, Will Deacon wrote:
> > On Tue, Jul 11, 2023 at 12:02:22PM +0800, Aiqun(Maria) Yu wrote:
> > > On 7/10/2023 5:37 PM, Will Deacon wrote:
> > > > On Mon, Jul 10, 2023 at 01:59:55PM +0800, Maria Yu wrote:
> > > > > In order to be able to disable lse_atomic even if cpu
> > > > > support it, most likely because of memory controller
> > > > > cannot deal with the lse atomic instructions, use a
> > > > > new idreg override to deal with it.
> > > >
> > > > This should not be a problem for cacheable memory though, right?
> > > >
> > > > Given that Linux does not issue atomic operations to non-cacheable mappings,
> > > > I'm struggling to see why there's a problem here.
> > >
> > > The lse atomic operation can be issued on non-cacheable mappings as well.
> > > Even if it is cached data, with different CPUECTLR_EL1 setting, it can also
> > > do far lse atomic operations.
> >
> > Please can you point me to the place in the kernel sources where this
> > happens? The architecture doesn't guarantee that atomics to non-cacheable
> > mappings will work, see "B2.2.6 Possible implementation restrictions on
> > using atomic instructions". Linux, therefore, doesn't issue atomics
> > to non-cacheable memory.
>
> We encounter the issue on third party kernel modules

Which kernel modules?

Those modules are clearly broken; as Will has already said, the architecture
says doing atomics to non-cacheable memory can result in external aborts, and
that's exaclty the behaviour that you're reporting as a problem. This is
working *as designed*.

Note that the same is true for LDXR+STXR; so just hiding LSE doesn't make
sense: if the code falls back to LDXR+STXR it still suffers from the exact same
problem.

Regardless, hiding bugs in out-of-tree code is not a justification for changing
the upstream kernel.

> and third party apps instead of linux kernel itself.

Which apps?

Why are those apps using non-cacheable memory?

Why are those apps trying to perform atomics to non-cacheable memory?

> This is a tradeoff of performance and stability. Per my understanding,
> options can be used to enable the lse_atomic to have the most performance
> cared system, and disable the lse_atomic by stability cared most system.

I think that's a misrepresentation of this patch.

This patch disables a feature to *hide* bugs in out-of-tree kernel modules and
userspace software. It's not about making the system more stable, it's about
making broken code appear to work.

The LSE atomics aren't just about performance. They're significantly fairer
than LDXR+STXR in many practical situations, and contribute to the stability of
the system.

Thanks,
Mark.

> > > > Please can you explain the problem that you are trying to solve?
> > >
> > > In our current case, it is a 100% reproducible issue that happened for
> > > uncached data, the cpu which support LSE atomic, but the system's DDR
> > > subsystem is not support this and caused a NOC error and thus synchronous
> > > external abort happened.
> >
> > So? The Arm ARM allows this behaviour and Linux shouldn't run into it.
> >
> > Will
>
> --
> Thx and BRs,
> Aiqun(Maria) Yu
>