Re: [PATCH] fs: use __fput_sync in close(2)

From: Eric W. Biederman
Date: Tue Aug 08 2023 - 14:50:47 EST



Adding a couple more people.

Mateusz Guzik <mjguzik@xxxxxxxxx> writes:

> Making close(2) delegate fput finalization with task_work_add() runs
> into a slowdown (atomics needed to do it) which is artificially worsened
> in presence of rseq, which glibc blindly uses if present (and it is
> normally present) -- they added a user memory-touching handler into
> resume_user_mode_work(), where the thread leaving the kernel lands after
> issuing task_work_add() from fput(). Said touching requires a SMAP
> round-trip which is quite expensive and it always executes when landing
> in the resume routine.
>
> I'm going to write a separate e-mail about the rseq problem later, but
> even if it gets sorted out there is still perf to gain (or rather,
> overhead to avoid).
>
> Numbers are below in the proposed patch, but tl;dr without CONFIG_RSEQ
> making things worse for the stock kernel I see about 7% increase in
> ops/s with open+close.
>
> Searching mailing lists for discussions explaining why close(2) was not
> already doing this I found a patch with the easiest way out (call
> __fput_sync() in filp_close()):
> https://lore.kernel.org/all/20150831120525.GA31015@xxxxxxxxxx/

What you need to search for is probably the opposite why is
task_work_add used in close.

Taking a quick look at the history it appears that fput was always
synchronous until a decade ago when commit 4a9d4b024a31 ("switch fput to
task_work_add") was merged.

The next two commits 3ffa3c0e3f6e ("aio: now fput() is OK from interrupt
context; get rid of manual delayed __fput()") and commit 6120d3dbb122
("get rid of ->scm_work_list") seem to demonstrate why fput was made
asynchronous. They rely on the new fput behavior to break recursive
calls and to allow fput from any context. That plus as Al talks about
having any lock held over fput can potentially cause a deadlock.

All 3 issues taken together says that a synchronous fput is a
loaded foot gun that must be used very carefully. That said
close(2) does seem to be a reliably safe place to be synchronous.

The big question is can your loop calling open then close going 7%
faster into any real world improvements? How much can it generalize?

Taking a look at close_fd, it is used in autofs, cachefiles, bpf, amoung
others. I think there is a very good argument that we can not say that
filep_close is always a safe place to call __fput_close. There is just
too much going on in some of those place. A particular possibly
dangerous example is cachefiles_ondemand_daemon_read which calls
complete after close_fd. If as Oleg suggested filp_close started always
calling __fput_sync that call to complete looks like a deadlock waiting
to happen.

I took a look exit_files() to see if it would be a reasonable candidate
for this optimization. I see: make_task_dead() -> do_exit() ->
exit_files. The function make_task_dead is designed to be called from
all kinds of awkward places so I don't see the sync variant being
safe for exit_files().

Which is a long way of saying that this only looks safe for close(2).


Are there any real world gains if close(2) is the only place this
optimization can be applied? Is the added maintenance burden worth the
speed up?


I would not export any of your new _sync variants to modules. They are
all loaded foot-guns.

>
> There was no response to it though.
>
> From poking around there is tons of filp_close() users (including from
> close_fd()) and it is unclear to me if they are going to be fine with
> such a change.
>
> With the assumption this is not going to work, I wrote my own patch
> which adds close_fd_sync() and filp_close_sync(). They are shipped as
> dedicated func entry points, but perhaps inlines which internally add a
> flag to to the underlying routine would be preferred? Also adding __ in
> front would be in line with __fput_sync, but having __filp_close_sync
> call __filp_close looks weird to me.
>
> All that said, if the simpler patch by Oleg Nestero works, then I'm
> happy to drop this one. I just would like to see this sorted out,
> whichever way.
>
> Thoughts?

Unless you can find some real world performance gains this looks like
a bad idea.

Eric