Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work on RT.

From: Marco Elver
Date: Wed Mar 13 2024 - 10:18:15 EST


On Wed, 13 Mar 2024 at 14:47, Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
>
> On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > > One part I don't get: did you let it run or did you kill it?
>
> > If I let them run they will finish and exit, no exec_child remains.
>
> > If I instead try to stop the loop that goes on forking the 100 of them,
> > then the exec_child remain spinning.
>
> > > `exec_child' spins until a signal is received or the parent kills it. So
>
> > > it shouldn't remain there for ever. And my guess, that it is in spinning
> > > in userland and not in kernel.
>
> > Checking that now:
>
> tldr; the tight loop, full details at the end.
>
> 100.00 b6: mov signal_count,%eax
> test %eax,%eax
> ↑ je b6
>
> remove_on_exec.c
>
> /* For exec'd child. */
> static void exec_child(void)
> {
> struct sigaction action = {};
> const int val = 42;
>
> /* Set up sigtrap handler in case we erroneously receive a trap. */
> action.sa_flags = SA_SIGINFO | SA_NODEFER;
> action.sa_sigaction = sigtrap_handler;
> sigemptyset(&action.sa_mask);
> if (sigaction(SIGTRAP, &action, NULL))
> _exit((perror("sigaction failed"), 1));
>
> /* Signal parent that we're starting to spin. */
> if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
> _exit((perror("write failed"), 1));
>
> /* Should hang here until killed. */
> while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?

Yes, possible.

> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:
>
> 1.44 b6: mov signal_count,%eax
> test %eax,%eax
> 98.56 ↑ je b6

It's unclear to me why that happens. But I do recall seeing it before,
and my explanation was that with too many concurrent copies of the
test the system either ran out of memory (maybe?) because the stress
test also spawns 30 parallel copies of the "exec_child" subprocess. So
with the 100 parallel copies we end up with 30 * 100 processes. Maybe
that's too much?

In any case, if the kernel didn't fall over during that kind of stress
testing, and the test itself passes when run as a single copy, then
I'd conclude all looks good.

This particular feature of perf along with testing it once before
melted Peter's and my brain [1]. I hope your experience didn't result
in complete brain-melt. ;-)

[1] https://lore.kernel.org/all/Y0VofNVMBXPOJJr7@xxxxxxxxxxxxxxxx/

Thanks,
-- Marco