Re: [PATCH 1/2 v2] fdmap(2)

From: Alexey Dobriyan
Date: Tue Sep 26 2017 - 14:43:23 EST


On Sun, Sep 24, 2017 at 02:31:23PM -0700, Andy Lutomirski wrote:
> On Sun, Sep 24, 2017 at 1:06 PM, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
> > From: Aliaksandr Patseyenak <Aliaksandr_Patseyenak1@xxxxxxxx>
> >
> > Implement system call for bulk retrieveing of opened descriptors
> > in binary form.
> >
> > Some daemons could use it to reliably close file descriptors
> > before starting. Currently they close everything upto some number
> > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > (although lsof does so much in /proc that the effect is thoroughly buried).
> >
> > /proc, the only way to learn anything about file descriptors may not be
> > available. There is unavoidable overhead associated with instantiating
> > 3 dentries and 3 inodes and converting integers to strings and back.
> >
> > Benchmark:
> >
> > N=1<<22 times
> > 4 opened descriptors (0, 1, 2, 3)
> > opendir+readdir+closedir /proc/self/fd vs fdmap
> >
> > /proc 8.31 Ä 0.37%
> > fdmap 0.32 Ä 0.72%
>
> This doesn't have the semantic problem that pidmap does, but I still
> wonder why this can't be accomplished by adding a new file in /proc.

It can be done in /proc but the point of the exercise is to skip all the
overhead: in this case dcache, 1 descriptor for readdir, conversion
from binary to string.

The problem is much deeper, namely, EIATF people force everyone else
to cater to Unix shells so that they can do read() on them because
Unix shells can't do system calls like real programming languages.
The only way to fix this problem is to ignore Unix shells and start
introducing binary system calls so that normal people aren't forced
to make their programs slower than necessary.

Example: lsof(1) does close() from 3 to 1023 inclusive on startup.
I don't know why but it does it. 1 syscall = 1 us, 1000 syscalls = 1 ms
wasted because all of them will return -EBADF normally. With fdmap(2),
lsof would do 2 fdmap() calls (1 real + 1 to confirm no more descriptors
are available + 0 closes in normal situation). That's 2 syscalls vs 1020.

Obviously, for binary model to work fdmap(2) needs to be complemented
by other system calls all of which will bypass /proc for, say, extracting
/proc/$PID/fd/$i symlink content and fdinfo. Currently, if you use
fdmap(2) you still have to fish in /proc for the rest of the data.