Re: strange interaction between fuse + pidns

From: Vivek Goyal
Date: Thu Jun 23 2022 - 17:55:55 EST


On Thu, Jun 23, 2022 at 11:21:25AM -0600, Tycho Andersen wrote:
> Hi all,
>
> I'm seeing some weird interactions with fuse and the pid namespace. I have a
> small reproducer here: https://github.com/tych0/kernel-utils/tree/master/fuse2
>
> fuse has the concept of "forcing" a request, which means (among other
> things) that it does an unkillable wait in request_wait_answer(). fuse
> flushes files when they are closed with this unkillable wait:
>
> $ sudo cat /proc/1544574/stack
> [<0>] request_wait_answer+0x12f/0x210
> [<0>] fuse_simple_request+0x109/0x2c0
> [<0>] fuse_flush+0x16f/0x1b0
> [<0>] filp_close+0x27/0x70
> [<0>] put_files_struct+0x6b/0xc0
> [<0>] do_exit+0x360/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] get_signal+0x140/0x870
> [<0>] arch_do_signal_or_restart+0xae/0x7c0
> [<0>] exit_to_user_mode_prepare+0x10f/0x1c0
> [<0>] syscall_exit_to_user_mode+0x26/0x40
> [<0>] do_syscall_64+0x46/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> Generally, this is OK, since the fuse_dev_release() -> fuse_abort_conn()
> wakes up this code when a fuse dev goes away (i.e. a fuse daemon is killed
> or unmounted or whatever). However, there's a problem when the fuse daemon
> itself spawns a thread that does a flush:

So in this case single process is client as well as server. IOW, one
thread is fuse server servicing fuse requests and other thread is fuse
client accessing fuse filesystem?

> since the thread has a copy of
> the fd table with an fd pointing to the same fuse device, the reference
> count isn't decremented to zero in fuse_dev_release(), and the task hangs
> forever.

So why did fuse server thread stop responding to fuse messages. Why
did it not complete flush.

Is it something to do with this init process dying in pid namespace
and it killed that fuse server thread. But it could not kill another
thread because it is in unkillable wait.

>
> Tasks can be aborted via fusectl's abort file, so all is not lost. However,
> this does wreak havoc in containers which mounted a fuse filesystem with
> this state. If the init pid exits (or crashes), the kernel tries to clean
> up the pidns:
>
> $ sudo cat /proc/1528591/stack
> [<0>] do_wait+0x156/0x2f0
> [<0>] kernel_wait4+0x8d/0x140
> [<0>] zap_pid_ns_processes+0x104/0x180
> [<0>] do_exit+0xa41/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] do_syscall_64+0x37/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> but hangs forever. This unkillable wait seems unfortunate, so I tried the
> obvious thing of changing it to a killable wait:

BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
to be set only if server probably some previous interrupt request
returned -ENOSYS.

fuse_dev_do_write() {
else if (oh.error == -ENOSYS)
fc->no_interrupt = 1;
}

So a simple workaround might be for server to implement support for
interrupting requests.

Having said that, this does sounds like a problem and probably should
be fixed at kernel level.

>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0e537e580dc1..c604dfcaec26 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> spin_unlock(&fiq->lock);
> }
> WARN_ON(test_bit(FR_PENDING, &req->flags));
> - WARN_ON(test_bit(FR_SENT, &req->flags));
> if (test_bit(FR_BACKGROUND, &req->flags)) {
> spin_lock(&fc->bg_lock);
> clear_bit(FR_BACKGROUND, &req->flags);
> @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> queue_interrupt(req);
> }
>
> - if (!test_bit(FR_FORCE, &req->flags)) {
> - /* Only fatal signals may interrupt this */
> - err = wait_event_killable(req->waitq,
> - test_bit(FR_FINISHED, &req->flags));
> - if (!err)
> - return;
> + /* Only fatal signals may interrupt this */
> + err = wait_event_killable(req->waitq,
> + test_bit(FR_FINISHED, &req->flags));

Trying to do a fatal signal killable wait sounds reasonable. But I am
not sure about the history.

- Why FORCE requests can't do killable wait.
- Why flush needs to have FORCE flag set.

> + if (!err)
> + return;
>
> - spin_lock(&fiq->lock);
> - /* Request is not yet in userspace, bail out */
> - if (test_bit(FR_PENDING, &req->flags)) {
> - list_del(&req->list);
> - spin_unlock(&fiq->lock);
> - __fuse_put_request(req);
> - req->out.h.error = -EINTR;
> - return;
> - }
> + spin_lock(&fiq->lock);
> + /* Request is not yet in userspace, bail out */
> + if (test_bit(FR_PENDING, &req->flags)) {
> + list_del(&req->list);
> spin_unlock(&fiq->lock);
> + __fuse_put_request(req);
> + req->out.h.error = -EINTR;
> + return;
> }
> + spin_unlock(&fiq->lock);
>
> /*
> - * Either request is already in userspace, or it was forced.
> - * Wait it out.
> + * Womp womp. We sent a request to userspace and now we're getting
> + * killed.
> */
> - wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> + set_bit(FR_INTERRUPTED, &req->flags);
> + /* matches barrier in fuse_dev_do_read() */
> + smp_mb__after_atomic();
> + /* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> + WARN_ON(!test_bit(FR_SENT, &req->flags));
> + queue_interrupt(req);
> }
>
> static void __fuse_request_send(struct fuse_req *req)
>
> avaialble as a full patch here:
> https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
>
> but now things are even weirder. Tasks are stuck at the killable wait, but with
> a SIGKILL pending for the thread group.

That's strange. No idea what's going on.

Thanks
Vivek
>
> root@(none):/# cat /proc/187/stack
> [<0>] fuse_simple_request+0x8d9/0x10f0 [fuse]
> [<0>] fuse_flush+0x42f/0x630 [fuse]
> [<0>] filp_close+0x96/0x120
> [<0>] put_files_struct+0x15c/0x2c0
> [<0>] do_exit+0xa00/0x2450
> [<0>] do_group_exit+0xb2/0x2a0
> [<0>] __x64_sys_exit_group+0x35/0x40
> [<0>] do_syscall_64+0x40/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> root@(none):/# cat /proc/187/status
> Name: main
> Umask: 0022
> State: S (sleeping)
> Tgid: 187
> Ngid: 0
> Pid: 187
> PPid: 185
> TracerPid: 0
> Uid: 0 0 0 0
> Gid: 0 0 0 0
> FDSize: 0
> Groups:
> NStgid: 187 3
> NSpid: 187 3
> NSpgid: 171 0
> NSsid: 160 0
> Threads: 1
> SigQ: 0/6706
> SigPnd: 0000000000000000
> ShdPnd: 0000000000000100
> SigBlk: 0000000000000000
> SigIgn: 0000000180004002
> SigCgt: 0000000000000000
> CapInh: 0000000000000000
> CapPrm: 000001ffffffffff
> CapEff: 000001ffffffffff
> CapBnd: 000001ffffffffff
> CapAmb: 0000000000000000
> NoNewPrivs: 0
> Seccomp: 0
> Seccomp_filters: 0
> Speculation_Store_Bypass: thread vulnerable
> SpeculationIndirectBranch: conditional enabled
> Cpus_allowed: f
> Cpus_allowed_list: 0-3
> Mems_allowed: 00000000,00000001
> Mems_allowed_list: 0
> voluntary_ctxt_switches: 6
> nonvoluntary_ctxt_switches: 1
>
> Any ideas what's going on here? It also seems I'm not the first person to
> wonder about this:
> https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAMp4zn9zTA_A2GJiYo5AD9V5HpeXbzzMP%3DnF0WtwbxRbV3koNA%40mail.gmail.com/#msg36598753
>
> Thanks,
>
> Tycho
>