Re: [BUG] KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event

From: Mirsad Todorovac
Date: Fri Aug 18 2023 - 10:16:53 EST




On 8/17/23 21:32, Mirsad Todorovac wrote:
Hi,

This is your friendly bug reporter. :-)

The environment is the vanilla torvalds kernel on Ubuntu 22.04 LTS and a Ryzen 7950X assembled box.

[ 7145.500079] ==================================================================
[ 7145.500109] BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event

[ 7145.500139] write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
[ 7145.500155]  tick_nohz_idle_stop_tick+0x3b1/0x4a0
[ 7145.500173]  do_idle+0x1e3/0x250
[ 7145.500186]  cpu_startup_entry+0x20/0x30
[ 7145.500199]  start_secondary+0x129/0x160
[ 7145.500216]  secondary_startup_64_no_verify+0x17e/0x18b

[ 7145.500245] read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
[ 7145.500261]  tick_nohz_next_event+0xe7/0x1e0
[ 7145.500277]  tick_nohz_get_sleep_length+0xa7/0xe0
[ 7145.500293]  menu_select+0x82/0xb90
[ 7145.500309]  cpuidle_select+0x44/0x60
[ 7145.500324]  do_idle+0x1c2/0x250
[ 7145.500336]  cpu_startup_entry+0x20/0x30
[ 7145.500350]  start_secondary+0x129/0x160
[ 7145.500367]  secondary_startup_64_no_verify+0x17e/0x18b

[ 7145.500392] value changed: 0x0000001a -> 0xffffffff

[ 7145.500411] Reported by Kernel Concurrency Sanitizer on:
[ 7145.500421] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G             L     6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 7145.500439] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 7145.500449] ==================================================================

Please find dmesg log (what was available in the ring buffer) and the lshw output attached.

NOTE: There are no proprietary drivers loaded and Canonical livepatch isn't working for this kernel,
probably previous ACPI KCSAN BUGs turn this taint on.

Thank you very much for your time evaluating this report.

P.S.

According to Mr. Heo, I will add the decoded stacktrace to this bug report, which ought to have been done
in the first place :-/

[ 7145.500079] ==================================================================
[ 7145.500109] BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event

[ 7145.500139] write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
[ 7145.500155] tick_nohz_idle_stop_tick (kernel/time/tick-sched.c:904 kernel/time/tick-sched.c:1130)
[ 7145.500173] do_idle (kernel/sched/idle.c:215 kernel/sched/idle.c:282)
[ 7145.500186] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
[ 7145.500199] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
[ 7145.500216] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:441)

[ 7145.500245] read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
[ 7145.500261] tick_nohz_next_event (kernel/time/tick-sched.c:868)
[ 7145.500277] tick_nohz_get_sleep_length (kernel/time/tick-sched.c:1250)
[ 7145.500293] menu_select (drivers/cpuidle/governors/menu.c:283)
[ 7145.500309] cpuidle_select (drivers/cpuidle/cpuidle.c:359)
[ 7145.500324] do_idle (kernel/sched/idle.c:208 kernel/sched/idle.c:282)
[ 7145.500336] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
[ 7145.500350] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
[ 7145.500367] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:441)

[ 7145.500392] value changed: 0x0000001a -> 0xffffffff

[ 7145.500411] Reported by Kernel Concurrency Sanitizer on:
[ 7145.500421] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 7145.500439] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 7145.500449] ==================================================================

The code seems to be:


/*
* If this CPU is the one which updates jiffies, then give up
* the assignment and let it be taken by the CPU which runs
* the tick timer next, which might be this CPU as well. If we
* don't drop this here the jiffies might be stale and
* do_timer() never invoked. Keep track of the fact that it
* was the one which had the do_timer() duty last.
*/
if (cpu == tick_do_timer_cpu) {
tick_do_timer_cpu = TICK_DO_TIMER_NONE;
→ ts->do_timer_last = 1;
} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
ts->do_timer_last = 0;

}

but I am not sure why this happens.

Maybe a WRITE_ONCE(ts->do_timer_last, 1) is required?

Or is it another KCSAN false positive ...

Thank you very much for investigating.

Best regards,
Mirsad Todorovac