Re: [patch 04/12] clockevent unbind: use smp_call_func_single_fail

From: Thomas Gleixner
Date: Sun Feb 11 2024 - 03:52:57 EST


On Wed, Feb 07 2024 at 09:51, Marcelo Tosatti wrote:
> On Wed, Feb 07, 2024 at 12:55:59PM +0100, Thomas Gleixner wrote:
>
> OK, so the problem is the following: due to software complexity, one is
> often not aware of all operations that might take place.

The problem is that people throw random crap on their systems and avoid
proper system engineering and then complain that their realtime
constraints are violated. So you are proliferating bad engineering
practices and encourage people not to care.

> Now think of all possible paths, from userspace, that lead to kernel
> code that ends up in smp_call_function_* variants (or other functions
> that cause IPIs to isolated CPUs).

So you need to analyze every possible code path and interface and add
your magic functions there after figuring out whether that's valid or
not.

> The alternative, from blocking this in the kernel, would be to validate all
> userspace software involved in your application, to ensure it won't end
> up in the kernel sending IPIs. Which is impractical, isnt it ?

It's absolutely not impractical. It's part of proper system
engineering. The wet dream that you can run random docker containers and
everything works magically is just a wet dream.

> (or rather, with such option in the kernel, it would be possible to run
> applications which have not been validated, since the kernel would fail
> the operation that results in IPI to isolated CPU).

That's a fallacy because you _cannot_ define with a single CPU mask
which interface is valid in a particular configuration to end up with an
IPI and which one is not. There are legitimate reasons in realtime or
latency constraint systems to invoke selective functionality which
interferes with the overall system constraints.

How do you cover that with your magic CPU mask? You can't.

Aside of that there is a decent chance that you are subtly breaking user
space that way. Just look at that hwmon/coretemp commit you pointed to:

"Temperature information from the housekeeping cores should be
sufficient to infer die temperature."

That's just wishful thinking for various reasons:

- The die temperature on larger packages is not evenly distributed and
you can run into situations where the housekeeping cores are sitting
"far" enough away from the worker core which creates the heat spot

- Some monitoring applications just stop to work when they can't read
the full data set, which means that they break subtly and you can
infer exactly nothing.

> So the idea would be an additional "isolation mode", which when enabled,
> would disallow the IPIs. Its still possible for root user to disable
> this mode, and retry the operation.
>
> So lets say i want to read MSRs on a given CPU, as root.
>
> You'd have to:
>
> 1) readmsr on given CPU (returns -EPERM or whatever), since the
> "block interference" mode is enabled for that CPU.
>
> 2) Disable that CPU in the block interference cpumask.
>
> 3) readmsr on the given CPU (success).
>
> 4) Re-enable CPU in block interference cpumask, if desired.

That's just wrong. Why?

Once you enable it just to read the MSR you enable the operation for
_ALL_ other non-validated crap too. So while the single MSR read might
be OK under certain circumstances the fact that you open up a window for
all other interfaces to do far more interfering operations is a red
flag.

This whole thing is a really badly defined policy mechanism of very
dubious value.

Thanks,

tglx