Re: [PATCH 1/2] torture: use for_each_present() loop in torture_online_all()

From: Paul E. McKenney
Date: Fri Nov 18 2022 - 18:43:25 EST


On Thu, Nov 17, 2022 at 07:06:37AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 17, 2022 at 07:30:32AM +0100, Sven Schnelle wrote:
> > Hi Paul,
> >
> > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes:
> >
> > >> > Yes, rcutorture has lower-level checks for CPUs being hotplugged
> > >> > behind its back. Which might be sufficient. But this patch is in
> > >> > response to something bad happening if the CPU is also not present in
> > >> > the cpu_present_mask. Would that same bad thing happen if rcutorture saw
> > >> > the CPU in cpu_online_mask, but by the time it attempted to CPU-hotplug
> > >> > it, that CPU was gone not just from cpu_online_mask, but also from
> > >> > cpu_present_mask?
> > >> >
> > >> > Or are CPUs never removed from cpu_present_mask?
> > >>
> > >> In the current implementation CPUs can only be added to the
> > >> cpu_present_mask, but never removed. This might change in the future
> > >> when we get support from firmware for that, but the current s390 code
> > >> doesn't do that.
> > >
> > > Very good!
> > >
> > > Then could the patch please check that bits are never removed?
> > > That way the code will complain should firmware support be added.
> > >
> > > Thanx, Paul
> >
> > I'm not sure whether i fully understand that. If the CPU could
> > be removed from the system and the cpu_present_mask, that could
> > happen at any time. So i don't see how we should check about that?
>
> Well, that is my question to you. ;-)
>
> Suppose we have the following sequence of events:
>
> o rcutorture sees that CPU 5 is in cpu_present_mask, but offline.
>
> o rcutorture therefore decides to online CPU 5.
>
> o s390 firmware removes CPU 5, and s390 architecture code then
> clears it from the cpu_present_mask.
>
> o rcutorture proceeds with onlining CPU 5.
>
> Don't we then get the same problem that prompted you to change from
> cpu_possible_mask to cpu_present mask? If not, why can't the rcutorture
> code continue to use cpu_possible_mask?
>
> If it really is bad to try to online or offline a CPU that is in
> cpu_possible_mask but not in cpu_present_mask, and if CPUs can be removed
> from cpu_present_mask, then we need some way to synchronize the removal
> of CPUs from cpu_present_mask. There are of course a lot of possible
> ways to do that synchronization, for example, protecting cpu_present_mask
> with a mutex or similar.
>
> Alternatively, s390 could restrict things. One way to do that would
> be to turn off rcutorture's use of CPU hotplug when running on s390,
> for example, by using the module parameters provided for that purpose.
> Another way to do that would be to refrain from removing CPUs from
> cpu_present_mask while rcutorture is running.
>
> Are there other approaches?

For the near term, why not have rcutorture keep a snapshot of
cpu_present_mask, and splat if a CPU is ever removed from that mask?

That would catch any issues, and defer any synchronization decisions to
a time at which we actually have some chance of knowing what is going on.

Thanx, Paul