Re: [RESEND][PATCH v2] sched: Add trace_sched_waking() tracepoint to sched_ttwu_pending()

From: Vishal Chourasia
Date: Tue Mar 26 2024 - 02:20:48 EST


On Wed, Mar 20, 2024 at 03:42:31PM -0700, John Stultz wrote:
> On Tue, Mar 19, 2024 at 10:24 PM Vishal Chourasia <vishalc@xxxxxxxxxxxxx> wrote:
> > On Mon, Mar 18, 2024 at 12:28:33PM -0700, John Stultz wrote:
> > > Zimuzo reported seeing occasional cases in perfetto traces where
> > > tasks went from sleeping directly to trace_sched_wakeup()
> > > without always seeing a trace_sched_waking().
> > >
> > > Looking at the code, trace_sched_wakeup() is only called in
> > > ttwu_do_wakeup()
> > >
> > > The call paths that get you to ttwu_do_wakeup() are:
> > > try_to_wake_up() -> ttwu_do_wakeup()
> > > try_to_wake_up() -> ttwu_runnable() -> ttwu_do_wakeup()
> > > try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> ttwu_do_wakeup()
> > > sched_ttwu_pending() -> ttwu_do_activate() -> ttwu_do_wakeup()
> > Notably, sched_ttwu_pending() is invoked for remote wakeups.
> >
> > Given this, I anticipate a scenario similar to the following
> > occurred: When a process (P) is to be awakened on a remote CPU,
> > the scheduler adds P to the remote CPU's wakelist,a per-CPU queue,
> > and sends an IPI to the remote CPU. This action triggers
> > sched_ttwu_pending() on the remote CPU, which then processes the
> > wakelist and wakes up the queued processes.
>
> Hey! Thanks for taking a look here and sharing this feedback.
>
> Indeed, looking closer here I do have a hard time seeing how we can
> get to sched_ttwu_pending() without having gone through the paths in
> try_to_wake_up() that would call trace_sched_wakeup():
> try_to_wake_up()->[ttwu_queue ->]
> ttwu_queue_wakelist()->sched_ttwu_pending()->...
>
> > In this scenario, the "waking trace" of P, signifying the initiation
> > of the wake-up, is recorded on the CPU where try_to_wake_up was executed.
> > Meanwhile, the "wakeup trace," denoting the completion of the wake-up,
> > is observed on the remote CPU where sched_ttwu_pending() is executed.
> >
> > Is there a possibility that something other than the above occurred
> > in your case?
>
> I suspect that the case reported may be that the task went to sleep
> right after the trace_sched_waking(). Which could result in the
> transition from sleeping to trace_sched_wakeup() without the
> trace_sched_waking() inbetween. The added latency from the remote
> wakeup probably increases the chance for this race to occur.

Hi,

# trace-cmd record -e sched:* perf record -a -g -e sched:* ./schbench
This generates perf.data and trace.dat file.

Running a simple check for equality between sched_waking and
sched_wakeup. I see two different results.

# perf script -G | grep -i "sched_waking" | wc -l
35264
# perf script -G | grep -i "sched_wakeup:" | wc -l
35314

perf report suggests missing sched_waking events. Although trace
collected through ftrace (using trace-cmd) suggests equal waking and
wakeup events.

# trace-cmd report | grep -i "sched_waking" | wc -l
35959
# trace-cmd report | grep -i "sched_wakeup:" | wc -l
35959

Below is a stack trace for a sched_wakeup event for which sched_waking
was not recorded in perf.data file,

schbench 307346 [023] 600177.250150:
sched:sched_wakeup: perf:307321 [120] CPU:023
ttwu_do_activate+0x1c0 ([kernel.kallsyms])
print_fmt_xfs_perag_intents_class+0x0 ([xfs].data)
sched_ttwu_pending+0x118 ([kernel.kallsyms])
=> __flush_smp_call_function_queue+0x19c ([kernel.kallsyms])
smp_ipi_demux_relaxed+0xec ([kernel.kallsyms])
xive_muxed_ipi_action+0x20 ([kernel.kallsyms])
__handle_irq_event_percpu+0x8c ([kernel.kallsyms])
handle_irq_event_percpu+0x2c ([kernel.kallsyms])
handle_percpu_irq+0x80 ([kernel.kallsyms])
generic_handle_irq+0x50 ([kernel.kallsyms])
__do_irq+0xb8 ([kernel.kallsyms])
__do_IRQ+0x88 ([kernel.kallsyms])
print_fmt_xfs_perag_intents_class+0x0 ([xfs].data)
do_IRQ+0x50 ([kernel.kallsyms])
hardware_interrupt_common_virt+0x28c ([kernel.kallsyms])
worker_thread+0x268 (/usr/bin/schbench)
worker_thread+0x4cc (/usr/bin/schbench)
start_thread+0x170 (/usr/lib64/glibc-hwcaps/power10/libc.so.6)
__clone+0xa0 (/usr/lib64/glibc-hwcaps/power10/libc.so.6)

Also, it has been observed for cases where sched_waking is missing only
happens when perf task is woken up.

Looking at the code in __flush_smp_call_function_queue,
it invokes all the callbacks currently queue on the given cpu, and if
sched_ttwu_pending is being called, suggesting that csd of type CSD_TYPE_TTWU
was queued. Now, this happens inside try_to_wake_up.

if there are no other way to queue a csd of type CSD_TYPE_TTWU, then
this suggests some events are being missed by perf.

>
> So I'll withdraw this change for now and sync back up to check if my
> suspicions are correct or not.
>
> Thanks so much again for pointing this out.
> -john