Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

From: Ankur Arora
Date: Mon Mar 11 2024 - 16:10:51 EST



Paul E. McKenney <paulmck@xxxxxxxxxx> writes:

> On Sun, Mar 10, 2024 at 09:50:33PM -0700, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@xxxxxxxxxx> writes:
>>
>> > On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>> >>
>> >> Paul E. McKenney <paulmck@xxxxxxxxxx> writes:
>> >>
>> >> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>> >> >>
>> >> >>
>> >> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> >> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> >> >> Hi Ankur,
>> >> >> >>
>> >> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >> >> >>>
>> >> >> >>> Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> writes:
>> >> >> >>>
>> >> >> >> [..]
>> >> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >> >> >>>> 'voluntary' business because
>> >> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >> >> >>>> is not about the old voluntary.
>> >> >> >>>
>> >> >> >>> What do you think about folding the higher scheduling class preemption logic
>> >> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >> >> >>> deadline task needs to be done for correctness.
>> >> >> >>>
>> >> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >> >> >>> there's a separate use for it.)
>> >> >> >>
>> >> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> >> >> that is more aggressive, it could be added in the future.
>> >> >> >
>> >> >> > This would be something that happens only after removing cond_resched()
>> >> >> > might_sleep() functionality from might_sleep(), correct?
>> >> >>
>> >> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> >> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> >> >> even for preempt=none mode, without having to wait till the next
>> >> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> >> >> to treat "higher class" any differently than "higher priority" for preempt=none.
>> >> >>
>> >> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> >> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> >> >> more than the treatment given to higher priority within same class. Ankur/Juri?
>> >> >>
>> >> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> >> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> >> >> cond_resched()/might_sleep() modifications?
>> >> >
>> >> > Because right now, one large difference between CONFIG_PREEMPT_NONE
>> >> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
>> >> > preemption point, but not for the former.
>> >>
>> >> True. But, there is no difference between either of those with
>> >> PREEMPT_AUTO=y (at least right now).
>> >>
>> >> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
>> >> might_sleep() is:
>> >>
>> >> # define might_resched() do { } while (0)
>> >> # define might_sleep() \
>> >> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>> >>
>> >> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
>> >> DEBUG_ATOMIC_SLEEP=y):
>> >>
>> >> static inline int _cond_resched(void)
>> >> {
>> >> klp_sched_try_switch();
>> >> return 0;
>> >> }
>> >> #define cond_resched() ({ \
>> >> __might_resched(__FILE__, __LINE__, 0); \
>> >> _cond_resched(); \
>> >> })
>> >>
>> >> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
>> >
>> > As long as it is easy to restore the prior cond_resched() functionality
>> > for testing in the meantime, I should be OK. For example, it would
>> > be great to have the commit removing the old functionality from
>> > cond_resched() at the end of the series,
>>
>> I would, of course, be happy to make any changes that helps testing,
>> but I think I'm missing something that you are saying wrt
>> cond_resched()/might_sleep().
>>
>> There's no commit explicitly removing the core cond_reshed()
>> functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
>> out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
>> (That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
>>
>> For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
>> and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
>>
>> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
>> /* ... */
>> #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else /* !CONFIG_PREEMPTION */
>> /* ... */
>> #endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>>
>> #else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
>> static inline int _cond_resched(void)
>> {
>> klp_sched_try_switch();
>> return 0;
>> }
>> #endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>>
>> Same for might_sleep() (which really amounts to might_resched()):
>>
>> #ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else
>> # define might_resched() do { } while (0)
>> #endif /* CONFIG_PREEMPT_* */
>>
>> But, I doubt that I'm telling you anything new. So, what am I missing?
>
> It is really a choice at your end.
>
> Suppose we enable CONFIG_PREEMPT_AUTO on our fleet, and find that there
> was some small set of cond_resched() calls that provided sub-jiffy
> preemption that matter to some of our workloads. At that point, what
> are our options?
>
> 1. Revert CONFIG_PREEMPT_AUTO.
>
> 2. Revert only the part that disables the voluntary preemption
> semantics of cond_resched(). Which, as you point out, ends up
> being the same as #1 above.
>
> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.
>
> So, how easy would you like to make it for us to use as much of
> CONFIG_PREEMPT_AUTO=y under various possible problem scenarios?

Ah, I see your point. Basically, keep the lazy semantics but -- in
addition -- also provide the ability to dynamically toggle
cond_resched(), might_reshed() as a feature to help move this along
further.

So, as I mentioned earlier, the callsites are already present, and
removing them needs work (with livepatch and more generally to ensure
PREEMPT_AUTO is good enough for the current PREEMPT_* scenarios so
we can ditch cond_resched()).

I honestly don't see any reason not to do this -- I would prefer
this be a temporary thing to help beat PREEMPT_AUTO into shape. And,
this provides an insurance policy for using PREEMPT_AUTO.

That said, I would like Thomas' opinion on this.

> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.

Apropos of this, how would you determine which are the locations
where we specifically need voluntary preemption?

> Yes, in a perfect world, we would have tested this already, but I
> am still chasing down problems induced by simple rcutorture testing.
> Cowardly of us, isn't it? ;-)

Cowards are us :).

--
ankur