Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Andy Lutomirski
Date: Mon Sep 18 2023 - 23:21:38 EST


On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> On preempt_model_none() or preempt_model_voluntary() configurations
> rescheduling of kernel threads happens only when they allow it, and
> only at explicit preemption points, via calls to cond_resched() or
> similar.
>
> That leaves out contexts where it is not convenient to periodically
> call cond_resched() -- for instance when executing a potentially long
> running primitive (such as REP; STOSB.)
>

So I said this not too long ago in the context of Xen PV, but maybe it's time to ask it in general:

Why do we support anything other than full preempt? I can think of two reasons, neither of which I think is very good:

1. Once upon a time, tracking preempt state was expensive. But we fixed that.

2. Folklore suggests that there's a latency vs throughput tradeoff, and serious workloads, for some definition of serious, want throughput, so they should run without full preemption.

I think #2 is a bit silly. If you want throughput, and you're busy waiting for a CPU that wants to run you, but it's not because it's running some low-priority non-preemptible thing (because preempt is set to none or volunary), you're not getting throughput. If you want to get keep some I/O resource busy to get throughput, but you have excessive latency getting scheduled, you don't get throughput.

If the actual problem is that there's a workload that performs better when scheduling is delayed (which preempt=none and preempt=volunary do, essentialy at random), then maybe someone should identify that workload and fix the scheduler.

So maybe we should just very strongly encourage everyone to run with full preempt and simplify the kernel?