Re: wait_for_unix_gc can cause CPU overload for well behaved programs

From: Ivan Babrou
Date: Fri Oct 20 2023 - 13:25:44 EST


On Fri, Oct 20, 2023 at 3:47 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> On Thu, 19 Oct 2023 15:35:01 -0700 Ivan Babrou <ivan@xxxxxxxxxxxxxx>
> > Hello,
> >
> > We have observed this issue twice (2019 and 2023): a well behaved
> > service that doesn't pass any file descriptors around starts to spend
> > a ton of CPU time in wait_for_unix_gc.
>
> See if the diff below works for you, which prevents concurrent spinning
> of unix_gc_lock, a variant of spin_trylock().
>
> Hillf
> --- x/net/unix/garbage.c
> +++ y/net/unix/garbage.c
> @@ -211,15 +211,10 @@ void unix_gc(void)
> struct list_head cursor;
> LIST_HEAD(not_cycle_list);
>
> + if (test_and_set_bit(0, &gc_in_progress))
> + return;
> spin_lock(&unix_gc_lock);
>
> - /* Avoid a recursive GC. */
> - if (gc_in_progress)
> - goto out;
> -
> - /* Paired with READ_ONCE() in wait_for_unix_gc(). */
> - WRITE_ONCE(gc_in_progress, true);
> -
> /* First, select candidates for garbage collection. Only
> * in-flight sockets are considered, and from those only ones
> * which don't have any external reference.
> --

This could solve wait_for_unix_gc spinning, but it wouldn't affect
unix_gc itself, from what I understand. There would always be one
socket writer or destroyer punished by running the gc still. My linked
repro code exercises that path rather than the waiting spinlock
(there's a single writer thread), so it's something you can see for
yourself.

Your patch doesn't build, so I wasn't able to try it out:

#26 154.3 /build/linux-source/net/unix/garbage.c: In function 'unix_gc':
#26 154.3 /build/linux-source/net/unix/garbage.c:214:33: error:
passing argument 2 of 'test_and_set_bit' from incompatible pointer
type [-Werror=incompatible-pointer-types]
#26 154.3 214 | if (test_and_set_bit(0, &gc_in_progress))
#26 154.3 | ^~~~~~~~~~~~~~~
#26 154.3 | |
#26 154.3 | bool * {aka _Bool *}
#26 154.3 In file included from
/build/linux-source/include/asm-generic/bitops/atomic.h:68,
#26 154.3 from
/build/linux-source/arch/arm64/include/asm/bitops.h:25,
#26 154.3 from /build/linux-source/include/linux/bitops.h:68,
#26 154.3 from /build/linux-source/include/linux/kernel.h:22,
#26 154.3 from /build/linux-source/net/unix/garbage.c:66:
#26 154.3 /build/linux-source/include/asm-generic/bitops/instrumented-atomic.h:68:79:
note: expected 'volatile long unsigned int *' but argument is of type
'bool *' {aka '_Bool *'}
#26 154.3 68 | static __always_inline bool test_and_set_bit(long
nr, volatile unsigned long *addr)
#26 154.3 |
~~~~~~~~~~~~~~~~~~~~~~~~^~~~
#26 154.3 /build/linux-source/net/unix/garbage.c:328:2: error: label
'out' defined but not used [-Werror=unused-label]
#26 154.3 328 | out:
#26 154.3 | ^~~
#26 154.3 cc1: all warnings being treated as errors