[RFC PATCH 0/3] DO NOT MERGE: Breaking down the experimantal diff

From: K Prateek Nayak
Date: Thu Aug 31 2023 - 06:46:20 EST


Since the diff is a concoction of a bunch of things that somehow work,
this series tries to clean it up. I've lost a bunch of things based on
David's suggestion [1], [2] and added some new logic on top that is
covered in Patch 3.

Breakdown is as follows:

- Patch 1 moves struct definition to sched.h

- Patch 2 is the above diff but more palatable with changes based on
David's comments.

- Patch 3 adds a bailout mechanism on top, since I saw the same amount
of regression with Patch2.

With these changes, following are the results for tbench 128-clients:

tip : 1.00 (var: 1.00%)
tip + v3 + series till patch 2 : 0.41 (var: 1.15%) (diff: -58.81%)
tip + v3 + full series : 1.01 (var: 0.36%) (diff: +00.92%)

Disclaimer: All the testing is done hyper-focused on tbench 128-clients
case on a dual socket 3rd Generation EPYC system (2 x 64C/128T). The
series should apply cleanly on top of tip at commit 88c56cfeaec4
("sched/fair: Block nohz tick_stop when cfs bandwidth in use") + v3 of
shared_runq series (this series)

The SHARED_RUNQ_SHARD_SZ was set to 16 throughout the testing since that
maches the sd_llc_size on the system.

P.S. I finally got to enabling lockdep and I saw the following splat
early during the boot but nothing after (so I think everything is
alright?):

================================
WARNING: inconsistent lock state
6.5.0-rc2-shared-wq-v3-fix+ #681 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
swapper/0/1 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffff95f6bb24d818 (&rq->__lock){?.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x15/0x30
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0xcc/0x2c0
_raw_spin_lock_nested+0x2e/0x40
scheduler_tick+0x5c/0x350
update_process_times+0x83/0x90
tick_periodic+0x27/0xe0
tick_handle_periodic+0x24/0x70
timer_interrupt+0x18/0x30
__handle_irq_event_percpu+0x8b/0x240
handle_irq_event+0x38/0x80
handle_level_irq+0x90/0x170
__common_interrupt+0x4f/0x110
common_interrupt+0x7f/0xa0
asm_common_interrupt+0x26/0x40
__x86_return_thunk+0x0/0x40
console_flush_all+0x2e3/0x590
console_unlock+0x56/0x100
vprintk_emit+0x153/0x350
_printk+0x5c/0x80
apic_intr_mode_init+0x85/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x5e1/0x7a0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0x92/0xa0
secondary_startup_64_no_verify+0x17e/0x18b
irq event stamp: 65081
hardirqs last enabled at (65081): [<ffffffff857723c1>] _raw_spin_unlock_irqrestore+0x31/0x60
hardirqs last disabled at (65080): [<ffffffff857720d3>] _raw_spin_lock_irqsave+0x63/0x70
softirqs last enabled at (64284): [<ffffffff848ccb7b>] __irq_exit_rcu+0x7b/0xa0
softirqs last disabled at (64269): [<ffffffff848ccb7b>] __irq_exit_rcu+0x7b/0xa0

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);

*** DEADLOCK ***

1 lock held by swapper/0/1:
#0: ffffffff8627eec8 (sched_domains_mutex){+.+.}-{4:4}, at: sched_init_smp+0x3f/0xd0

stack backtrace:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc2-shared-wq-v3-fix+ #681
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
Call Trace:
<TASK>
dump_stack_lvl+0x5c/0x90
mark_lock.part.0+0x755/0x930
? __lock_acquire+0x3e7/0x21d0
? __lock_acquire+0x2f0/0x21d0
__lock_acquire+0x3ab/0x21d0
? lock_is_held_type+0xaa/0x130
lock_acquire+0xcc/0x2c0
? raw_spin_rq_lock_nested+0x15/0x30
? free_percpu+0x245/0x4a0
_raw_spin_lock_nested+0x2e/0x40
? raw_spin_rq_lock_nested+0x15/0x30
raw_spin_rq_lock_nested+0x15/0x30
update_domains_fair+0xf1/0x220
sched_update_domains+0x32/0x50
sched_init_domains+0xd9/0x100
sched_init_smp+0x4b/0xd0
? stop_machine+0x32/0x40
kernel_init_freeable+0x2d3/0x540
? __pfx_kernel_init+0x10/0x10
kernel_init+0x1a/0x1c0
ret_from_fork+0x34/0x50
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1b/0x30
RIP: 0000:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>

References:

[1] https://lore.kernel.org/all/20230831013435.GB506447@maniforge/
[2] https://lore.kernel.org/all/20230831023254.GC506447@maniforge/

--
Thanks and Regards,
Prateek