[PATCH] x86/apic: Fix circular locking dependency between console and hrtimer locks

From: Leon Romanovsky
Date: Tue Apr 07 2020 - 13:09:41 EST


From: Leon Romanovsky <leonro@xxxxxxxxxxxx>

clockevents_switch_state() calls printk() from under hrtimer_bases.lock.
That causes lock inversion on scheduler locks because printk() can call
into the scheduler. Lockdep puts it as:

[ 728.464312] ====================================================================================================
[ 735.312580] TSC deadline timer enabled
[ 735.324143]
[ 735.324146] ======================================================
[ 735.324148] WARNING: possible circular locking dependency detected
[ 735.324150] 5.6.0-for-upstream-dbg-2020-04-03_10-44-43-70 #1 Not tainted
[ 735.324152] ------------------------------------------------------
[ 735.324154] swapper/3/0 is trying to acquire lock:
[ 735.324155] ffffffff8442c858 ((console_sem).lock){-...}-{2:2}, at: down_trylock+0x13/0x70
[ 735.324162]
[ 735.324164] but task is already holding lock:
[ 735.324165] ffff88842dfb9958 (hrtimer_bases.lock){-.-.}-{2:2}, at: lock_hrtimer_base+0x71/0x120
[ 735.324171]
[ 735.324173] which lock already depends on the new lock.
[ 735.324174]
[ 735.324175]
[ 735.324177] the existing dependency chain (in reverse order) is:
[ 735.324179]
[ 735.324180] -> #4 (hrtimer_bases.lock){-.-.}-{2:2}:
[ 735.324186] _raw_spin_lock_irqsave+0x3c/0x4b
[ 735.324187] lock_hrtimer_base+0x71/0x120
[ 735.324189] hrtimer_start_range_ns+0xc6/0xaa0
[ 735.324191] __enqueue_rt_entity+0xc44/0xf50
[ 735.324192] enqueue_rt_entity+0x79/0xc0
[ 735.324194] enqueue_task_rt+0x5c/0x2e0
[ 735.324195] activate_task+0x15a/0x2c0
[ 735.324197] ttwu_do_activate+0xcf/0x120
[ 735.324199] sched_ttwu_pending+0x160/0x230
[ 735.324200] scheduler_ipi+0x1c0/0x530
[ 735.324202] reschedule_interrupt+0xf/0x20
[ 735.324204] tick_nohz_idle_enter+0x16c/0x250
[ 735.324205] do_idle+0x90/0x530
[ 735.324207] cpu_startup_entry+0x19/0x20
[ 735.324208] start_secondary+0x2ee/0x3e0
[ 735.324210] secondary_startup_64+0xa4/0xb0
[ 735.324211]
[ 735.324212] -> #3 (&rt_b->rt_runtime_lock){-...}-{2:2}:
[ 735.324218] _raw_spin_lock+0x25/0x30
[ 735.324219] rq_online_rt+0x288/0x550
[ 735.324221] set_rq_online+0x11f/0x190
[ 735.324223] sched_cpu_activate+0x1d4/0x390
[ 735.324225] cpuhp_invoke_callback+0x1c5/0x1560
[ 735.324226] cpuhp_thread_fun+0x3f8/0x6f0
[ 735.324228] smpboot_thread_fn+0x305/0x5f0
[ 735.324229] kthread+0x2f8/0x3b0
[ 735.324231] ret_from_fork+0x24/0x30
[ 735.324232]
[ 735.324233] -> #2 (&rq->lock){-.-.}-{2:2}:
[ 735.324238] _raw_spin_lock+0x25/0x30
[ 735.324240] task_fork_fair+0x34/0x430
[ 735.324241] sched_fork+0x48a/0xa60
[ 735.324243] copy_process+0x15df/0x5970
[ 735.324244] _do_fork+0x106/0xcd0
[ 735.324246] kernel_thread+0x9e/0xe0
[ 735.324247] rest_init+0x28/0x330
[ 735.324249] start_kernel+0x6ac/0x6ed
[ 735.324251] secondary_startup_64+0xa4/0xb0
[ 735.324252]
[ 735.324253] -> #1 (&p->pi_lock){-.-.}-{2:2}:
[ 735.324258] _raw_spin_lock_irqsave+0x3c/0x4b
[ 735.324260] try_to_wake_up+0x9a/0x1700
[ 735.324261] up+0x7a/0xb0
[ 735.324263] __up_console_sem+0x3c/0x70
[ 735.324264] console_unlock+0x4f4/0xab0
[ 735.324266] con_font_op+0x907/0x1010
[ 735.324267] vt_ioctl+0x10a6/0x2890
[ 735.324269] tty_ioctl+0x257/0x1240
[ 735.324270] ksys_ioctl+0x3e9/0x1190
[ 735.324272] __x64_sys_ioctl+0x6f/0xb0
[ 735.324273] do_syscall_64+0xe7/0x12c0
[ 735.324275] entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 735.324276]
[ 735.324277] -> #0 ((console_sem).lock){-...}-{2:2}:
[ 735.324283] __lock_acquire+0x374a/0x5210
[ 735.324284] lock_acquire+0x1b9/0x920
[ 735.324286] _raw_spin_lock_irqsave+0x3c/0x4b
[ 735.324288] down_trylock+0x13/0x70
[ 735.324289] __down_trylock_console_sem+0x33/0xa0
[ 735.324291] console_trylock+0x13/0x60
[ 735.324292] vprintk_emit+0xec/0x370
[ 735.324294] printk+0x9c/0xc3
[ 735.324296] lapic_timer_set_oneshot+0x4e/0x60
[ 735.324297] clockevents_switch_state+0x1e1/0x360
[ 735.324299] tick_program_event+0xae/0xc0
[ 735.324301] hrtimer_start_range_ns+0x4b6/0xaa0
[ 735.324302] tick_nohz_idle_stop_tick+0x67c/0xa90
[ 735.324304] do_idle+0x326/0x530
[ 735.324305] cpu_startup_entry+0x19/0x20
[ 735.324307] start_secondary+0x2ee/0x3e0
[ 735.324309] secondary_startup_64+0xa4/0xb0
[ 735.324310]
[ 735.324311] other info that might help us debug this:
[ 735.324312]
[ 735.324314] Chain exists of:
[ 735.324315] (console_sem).lock --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock
[ 735.324322]
[ 735.324324] Possible unsafe locking scenario:
[ 735.324325]
[ 735.324327] CPU0 CPU1
[ 735.324328] ---- ----
[ 735.324329] lock(hrtimer_bases.lock);
[ 735.324333] lock(&rt_b->rt_runtime_lock);
[ 735.324337] lock(hrtimer_bases.lock);
[ 735.324341] lock((console_sem).lock);
[ 735.324344]
[ 735.324345] *** DEADLOCK ***
[ 735.324346]
[ 735.324348] 1 lock held by swapper/3/0:
[ 735.324349] #0: ffff88842dfb9958 (hrtimer_bases.lock){-.-.}-{2:2}, at: lock_hrtimer_base+0x71/0x120
[ 735.324356]
[ 735.324357] stack backtrace:
[ 735.324360] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-for-upstream-dbg-2020-04-03_10-44-43-70 #1
[ 735.324363] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 735.324364] Call Trace:
[ 735.324366] dump_stack+0xb7/0x10b
[ 735.324367] check_noncircular+0x37f/0x460
[ 735.324369] ? arch_stack_walk+0x7c/0xd0
[ 735.324370] ? print_circular_bug+0x4e0/0x4e0
[ 735.324372] ? mark_lock+0x1a4/0xb60
[ 735.324373] ? __lock_acquire+0x374a/0x5210
[ 735.324375] __lock_acquire+0x374a/0x5210
[ 735.324376] ? register_lock_class+0x17e0/0x17e0
[ 735.324378] ? register_lock_class+0x17e0/0x17e0
[ 735.324380] lock_acquire+0x1b9/0x920
[ 735.324381] ? down_trylock+0x13/0x70
[ 735.324383] ? check_flags.part.29+0x450/0x450
[ 735.324384] ? lock_downgrade+0x760/0x760
[ 735.324386] ? vprintk_emit+0xec/0x370
[ 735.324387] _raw_spin_lock_irqsave+0x3c/0x4b
[ 735.324389] ? down_trylock+0x13/0x70
[ 735.324390] down_trylock+0x13/0x70
[ 735.324392] __down_trylock_console_sem+0x33/0xa0
[ 735.324393] console_trylock+0x13/0x60
[ 735.324395] vprintk_emit+0xec/0x370
[ 735.324396] printk+0x9c/0xc3
[ 735.324398] ? kmsg_dump_rewind_nolock+0xd9/0xd9
[ 735.324399] lapic_timer_set_oneshot+0x4e/0x60
[ 735.324401] clockevents_switch_state+0x1e1/0x360
[ 735.324402] ? enqueue_hrtimer+0x116/0x310
[ 735.324404] tick_program_event+0xae/0xc0
[ 735.324406] hrtimer_start_range_ns+0x4b6/0xaa0
[ 735.324407] ? hrtimer_run_softirq+0x210/0x210
[ 735.324409] ? rcu_read_lock_sched_held+0xab/0xe0
[ 735.324410] ? rcu_read_lock_bh_held+0xe0/0xe0
[ 735.324412] tick_nohz_idle_stop_tick+0x67c/0xa90
[ 735.324413] ? tsc_verify_tsc_adjust+0x71/0x290
[ 735.324415] do_idle+0x326/0x530
[ 735.324416] ? arch_cpu_idle_exit+0x40/0x40
[ 735.324418] cpu_startup_entry+0x19/0x20
[ 735.324419] start_secondary+0x2ee/0x3e0
[ 735.324421] ? set_cpu_sibling_map+0x2f70/0x2f70
[ 735.324423] secondary_startup_64+0xa4/0xb0
[ 760.028504] ====================================================================================================

Fix by using deferred variant of printk which doesn't call to the scheduler.

Fixes: 279f1461432c ("x86: apic: Use tsc deadline for oneshot when available")
Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
---
It is far away from my main expertise and I'm not sure that the solution
is correct, but it definitely fixed our regression.
---
arch/x86/kernel/apic/apic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d254cebdd3c3..6706b2cd9aec 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -353,7 +353,7 @@ static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
*/
asm volatile("mfence" : : : "memory");

- printk_once(KERN_DEBUG "TSC deadline timer enabled\n");
+ printk_deferred_once(KERN_DEBUG "TSC deadline timer enabled\n");
return;
}

--
2.25.1