RE: [Discussion v2] Usecases for the per-task latency-nice attribute

From: David Laight
Date: Wed Oct 02 2019 - 12:11:48 EST


From: Parth Shah
> Sent: 30 September 2019 11:44
...
> 5> Separating AVX512 tasks and latency sensitive tasks on separate cores
> ( -Tim Chen )
> ===========================================================================
> Another usecase we are considering is to segregate those workload that will
> pull down core cpu frequency (e.g. AVX512) from workload that are latency
> sensitive. There are certain tasks that need to provide a fast response
> time (latency sensitive) and they are best scheduled on cpu that has a
> lighter load and not have other tasks running on the sibling cpu that could
> pull down the cpu core frequency.
>
> Some users are running machine learning batch tasks with AVX512, and have
> observed that these tasks affect the tasks needing a fast response. They
> have to rely on manual CPU affinity to separate these tasks. With
> appropriate latency hint on task, the scheduler can be taught to separate them.

Has this been diagnosed properly?
I can't really see how the frequency drop from AVX512 significantly affects latency.
Most tasks that require low latency probably don't do a lot of work.
It is much more likely that the latency issues happen because the AVX512 tasks
are doing very few system calls so can't be pre-empted even by a high priority task.
This 'feature' is hinted by this:
> 2> TurboSched
> ( -Parth Shah )
> ====================
> TurboSched [2] tries to minimize the number of active cores in a socket by
> packing an un-important and low-utilization (named jitter) task on an
> already active core and thus refrains from waking up of a new core if
> possible.

Consider this example of a process that requires low latency (sub 1ms would be good):
- A hardware interrupt (or timer interrupt) wakes up on thread.
- When that thread wakes it wakes up other threads that are sleeping.
- All the threads 'beaver away' for a few ms (processing RTP and other audio).
- They all sleep for the rest of a 10ms period.

The affinities are set so each thread runs on a separate cpu, and all are SCHED_RR.
Now loop all the cpus in userspace (run: while :; do :; done) and see what happens to the latencies.
You really want the SCHED_RR threads to immediately pre-empt the running processes.
But I suspect nothing happens until a timer interrupt to the target cpu.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)