Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

From: Paul E. McKenney
Date: Thu Jul 06 2023 - 12:39:32 EST


On Thu, Jul 06, 2023 at 01:40:14PM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
> > >> + ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> > >> +
> > >> + preempt_enable();
> > >> + return ret;
> > >> +}
> > > [...]
> > >> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> > >> */
> > >> static noinstr void ct_kernel_enter_state(int offset)
> > >> {
> > >> + struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> > >> int seq;
> > >> + unsigned int work;
> > >>
> > >> + work = ct_work_fetch(ct);
> > >
> > > So this adds another fully ordered operation on user <-> kernel transition.
> > > How many such IPIs can we expect?
> > >
> >
> > Despite having spent quite a lot of time on that question, I think I still
> > only have a hunch.
> >
> > Poking around RHEL systems, I'd say 99% of the problematic IPIs are
> > instruction patching and TLB flushes.
> >
> > Staring at the code, there's quite a lot of smp_calls for which it's hard
> > to say whether the target CPUs can actually be isolated or not (e.g. the
> > CPU comes from a cpumask shoved in a struct that was built using data from
> > another struct of uncertain origins), but then again some of them don't
> > need to hook into context_tracking.
> >
> > Long story short: I /think/ we can consider that number to be fairly small,
> > but there could be more lurking in the shadows.
>
> I guess it will still be time to reconsider the design if we ever reach such size.
>
> > > If this is just about a dozen, can we stuff them in the state like in the
> > > following? We can potentially add more of them especially on 64 bits we could
> > > afford 30 different works, this is just shrinking the RCU extended quiescent
> > > state counter space. Worst case that can happen is that RCU misses 65535
> > > idle/user <-> kernel transitions and delays a grace period...
> > >
> >
> > I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
> > even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
> > but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
> > only does a !=).
> >
> > I'm rephrasing here to make sure I get it - is it then that the worst case
> > here is 2^(dynticks_counter_size) transitions happen between saving the
> > dynticks snapshot and checking it again, so RCU waits some more?
>
> That's my understanding as well but I have to defer on Paul to make sure I'm
> not overlooking something.

That does look plausible to me.

And yes, RCU really cares about whether its part of this counter has
been a multiple of two during a given interval of time, because this
indicates that the CPU has no pre-existing RCU readers still active.
One way that this can happen is for that value to be a multiple of two
at some point in time. The other way that this can happen is for the
value to have changed. No matter what the start and end values, if they
are different, the counter must necessarily have at least passed through
multiple of two in the meantime, again guaranteeing that any RCU readers
that around when the count was first fetched have now finished.

But we should take the machine's opinions much more seriously than we
take any of our own opinions. Why not adjust RCU_DYNTICKS_IDX so as
to crank RCU's portion of this counter down to (say) two or three bits
and let rcutorture have at it on TREE04 or TREE07, both of which have
nohz_full CPUs?

Maybe also adjust mkinitrd.sh to make the user/kernel transitions more
frequent?

Please note that I do -not- recommend production use of a three-bit
(let alone a two-bit) RCU portion because this has a high probability
of excessively extending grace periods. But it might be good to keep
a tiny counter as a debug option so that we regularly rcutorture it.

Thanx, Paul