Re: [PATCH] sched/core: expand sched_getaffinity(2) to return number of CPUs

From: Florian Weimer
Date: Fri Apr 05 2019 - 06:16:45 EST


* Peter Zijlstra:

> On Wed, Apr 03, 2019 at 11:08:09PM +0300, Alexey Dobriyan wrote:
>> Currently there is no easy way to get the number of CPUs on the system.

The size of the affinity mask is only related to the number of CPUs in
the system in such a way that the number of CPUs cannot be larger than
the number of bits in the affinity mask.

>> Glibc in particular shipped with 1024 CPUs support maximum at some point
>> which is quite surprising as glibc maitainers should know better.

This dates back to a time when the kernel was never going to support
more than 1024 CPUs.

A lot of distribution kernels still enforce a hard limit, which papers
over firmware bugs which tell the kernel that the system can be
hot-plugged to a ridiculous number of sockets/CPUs.

>> Another group dynamically grow buffer until cpumask fits. This is
>> inefficient as multiple system calls are done.
>>
>> Nobody seems to parse "/sys/devices/system/cpu/possible".
>> Even if someone does, parsing sysfs is much slower than necessary.
>
> True; but I suppose glibc already does lots of that anyway, right? It
> does contain the right information.

If I recall correctly my last investigation,
/sys/devices/system/cpu/possible does not reflect the size of the
affinity mask, either.

>> Patch overloads sched_getaffinity(len=0) to simply return "nr_cpu_ids".
>> This will make gettting CPU mask require at most 2 system calls
>> and will eliminate unnecessary code.
>>
>> len=0 is chosen so that
>> * passing zeroes is the simplest thing
>>
>> syscall(__NR_sched_getaffinity, 0, 0, NULL)
>>
>> will simply do the right thing,
>>
>> * old kernels returned -EINVAL unconditionally.
>>
>> Note: glibc segfaults upon exiting from system call because it tries to
>> clear the rest of the buffer if return value is positive, so
>> applications will have to use syscall(3).
>> Good news is that it proves noone uses sched_getaffinity(pid, 0, NULL).

Given that old kernels fail with EINVAL, that evidence is fairly
restricted.

I'm not sure if it's a good idea to overload this interface. I expect
that users will want to call sched_getaffinity (the system call wrapper)
with cpusetsize == 0 to query the value, so there will be pressure on
glibc to remove the memset. At that point we have an API that obscurely
fails with old glibc versions, but suceeds with newer ones, which isn't
great.

Thanks,
Florian