Re: [patch 04/12] clockevent unbind: use smp_call_func_single_fail

From: Marcelo Tosatti
Date: Wed Feb 14 2024 - 14:06:13 EST


On Sun, Feb 11, 2024 at 09:52:35AM +0100, Thomas Gleixner wrote:
> On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote:
> > On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
> >
> > OK, so the problem is the following: due to software complexity, one is
> > often not aware of all operations that might take place.
>
> The problem is that people throw random crap on their systems and avoid
> proper system engineering and then complain that their realtime
> constraints are violated. So you are proliferating bad engineering
> practices and encourage people not to care.

Its more of a practicality and cost concern: one usually does not have
resources to fully review software before using that software.

> > Now think of all possible paths, from userspace, that lead to kernel
> > code that ends up in smp_call_function_* variants (or other functions
> > that cause IPIs to isolated CPUs).
>
> So you need to analyze every possible code path and interface and add
> your magic functions there after figuring out whether that's valid or
> not.

"A magic function", yes.

> > The alternative, from blocking this in the kernel, would be to validate all
> > userspace software involved in your application, to ensure it won't end
> > up in the kernel sending IPIs. Which is impractical, isnt it ?
>
> It's absolutely not impractical. It's part of proper system
> engineering. The wet dream that you can run random docker containers and
> everything works magically is just a wet dream.

Unfortunately that is what people do.

I understand that "full software review" would be the ideal, but in most
situations it does not seem to happen.

> > (or rather, with such option in the kernel, it would be possible to run
> > applications which have not been validated, since the kernel would fail
> > the operation that results in IPI to isolated CPU).
>
> That's a fallacy because you _cannot_ define with a single CPU mask
> which interface is valid in a particular configuration to end up with an
> IPI and which one is not. There are legitimate reasons in realtime or
> latency constraint systems to invoke selective functionality which
> interferes with the overall system constraints.
>
> How do you cover that with your magic CPU mask? You can't.
>
> Aside of that there is a decent chance that you are subtly breaking user
> space that way. Just look at that hwmon/coretemp commit you pointed to:
>
> "Temperature information from the housekeeping cores should be
> sufficient to infer die temperature."
>
> That's just wishful thinking for various reasons:
>
> - The die temperature on larger packages is not evenly distributed and
> you can run into situations where the housekeeping cores are sitting
> "far" enough away from the worker core which creates the heat spot

I know.

> - Some monitoring applications just stop to work when they can't read
> the full data set, which means that they break subtly and you can
> infer exactly nothing.
>
> > So the idea would be an additional "isolation mode", which when enabled,
> > would disallow the IPIs. Its still possible for root user to disable
> > this mode, and retry the operation.
> >
> > So lets say i want to read MSRs on a given CPU, as root.
> >
> > You'd have to:
> >
> > 1) readmsr on given CPU (returns -EPERM or whatever), since the
> > "block interference" mode is enabled for that CPU.
> >
> > 2) Disable that CPU in the block interference cpumask.
> >
> > 3) readmsr on the given CPU (success).
> >
> > 4) Re-enable CPU in block interference cpumask, if desired.
>
> That's just wrong. Why?
>
> Once you enable it just to read the MSR you enable the operation for
> _ALL_ other non-validated crap too. So while the single MSR read might
> be OK under certain circumstances the fact that you open up a window for
> all other interfaces to do far more interfering operations is a red
> flag.
>
> This whole thing is a really badly defined policy mechanism of very
> dubious value.
>
> Thanks,

OK, fair enough. From your comments, it seems that per-callsite
toggling would be ideal, for example:

/sys/kernel/interference_blocking/ directory containing one sub-directory
per call site.

Inside each sub-directory, a "enabled" file, controlling a boolean
to enable or disable interference blocking for that particular
callsite.

Also a "cpumask" file on each directory, by default containing the same
cpumask as the nohz_full CPUs, to control to which CPUs to block the
interference for.

How does that sound?