Re: arm64: allmodconfig: BUG: KCSAN: data-race in p9_client_cb / p9_client_rpc

From: Dominique Martinet
Date: Sat Dec 03 2022 - 10:37:34 EST


(reply out of order)

Naresh Kamboju wrote on Thu, Dec 01, 2022 at 01:13:25PM +0530:
> > (You might need to build with at least CONFIG_DEBUG_INFO_REDUCED (or not
> > reduced), but that is on by default for aarch64)
>
> Thanks for the suggestions.
> The Kconfig is enabled now.
> CONFIG_DEBUG_INFO_REDUCED=y

It looks enabled in your the config file you linked at, I don't
understand this remark?
Did you produce the trace the other day without it and rebuild the
kernel with it?
In this case you also have CONFIG_DEBUG_INFO_SPLIT set, so the vmlinux
file does not contain enough informations to retrieve line numbers or
types, and in particular addr2line cannot be used on the files you
provided.
I've never used split debug infos before, but digging old threads I'm
not too hopeful unless that changed:
https://lkml.iu.edu/hypermail/linux/kernel/1711.1/03393.html
https://sourceware.org/bugzilla/show_bug.cgi?id=22434

(...a test build later, it's still mostly useless...
normal build
$ ./scripts/faddr2line vmlinux __schedule+0x314
__schedule+0x314/0x6c0:
perf_fetch_caller_regs at include/linux/perf_event.h:1286
(inlined by) __perf_sw_event_sched at include/linux/perf_event.h:1307
(inlined by) perf_event_task_sched_out at include/linux/perf_event.h:1347
(inlined by) prepare_task_switch at kernel/sched/core.c:5053
(inlined by) context_switch at kernel/sched/core.c:5195
(inlined by) __schedule at kernel/sched/core.c:6561

split dwarf build
$ ./scripts/faddr2line vmlinux __schedule+0x314
aarch64-linux-gnu-addr2line: DWARF error: could not find abbrev number 860923
__schedule+0x314/0x780:
aarch64-linux-gnu-addr2line: DWARF error: could not find abbrev number 860923
__schedule at core.c:?

I'd tend to agree build time/space savings aren't worth the developer
time.
)

Anyway, address sanitizer used to have a kasan_symbolize.py script but
it looks like it got removed as no longer maintained, and I'm not sure
what's a good tool to just run these logs through nowadays, might want
to ask other test projects folks what they use...

> > If you still have the vmlinux binary from that build (or if you can
> > rebuild with the same options), running this text through addr2line
> > should not take you too long.
>
> Please find build artifacts in this link,
> - config
> - vmlinux
> - System.map
> https://people.linaro.org/~anders.roxell/next-20221130-allmodconfig-arm64-tuxmake-build/

So from the disassembly...

- p9_client_cb+0x84 is right before the wake_up and after the wmb(), so
I assume we're on writing req->status line 441:

---
p9_client_cb(...)
{
...
smp_wmb();
req->status = status;

wake_up(&req->wq);
---

report is about a write from 2 to 3, this makes sense we're going from
REQ_STATUS_SENT (2) to REQ_STATUS_RCVD (3).


- p9_client_rpc+0x1d0 isn't as simple to pin down as I'm having a hard
time making sense of the kcsan instrumentations...
The report is talking about a READ of 4 bytes at the same address, so
I'd expect to see an ccess to req->status (and we're likely spot on
wait_event_killable which checks req->status), but this doesn't seem to
match up with the assembly: here's the excerpt from disass around 0x1d0
= 464 (why doesn't gdb provide hex offsets..)
---
0xffff80000a46e9b8 <+440>: cmn w28, #0x200
0xffff80000a46e9bc <+444>: ccmn w28, #0xe, #0x4, ne // ne = any
0xffff80000a46e9c0 <+448>: b.eq 0xffff80000a46ecfc <p9_client_rpc+1276> // b.none
0xffff80000a46e9c4 <+452>: mov x0, x25
0xffff80000a46e9c8 <+456>: bl 0xffff800008543640 <__tsan_write4>
0xffff80000a46e9cc <+460>: mov w0, #0x2 // #2
0xffff80000a46e9d0 <+464>: str w0, [x21, #88]
0xffff80000a46e9d4 <+468>: b 0xffff80000a46ecfc <p9_client_rpc+1276>
0xffff80000a46e9d8 <+472>: mov w27, #0x1 // #1
0xffff80000a46e9dc <+476>: mov x0, x23
0xffff80000a46e9e0 <+480>: mov w1, #0x2bc // #700
0xffff80000a46e9e4 <+484>: bl 0xffff800008192d80 <__might_sleep>
---

+464 is a write to x21 (client 'c', from looking at how it is passed
into x0 for other function calls) at offset 88 (status field according
to dwarf infos from a rebuild with your config/same sources)

So, err, I'm a bit lost on this side.
But I can't really find a problem with what KCSAN complains about --
we are indeed accessing status from two threads without any locks.
Instead of a lock, we're using a barrier so that:
- recv thread/cb: writes to req stuff || write to req status
- p9_client_rpc: reads req status || reads other fields from req

Which has been working well enough (at least, without the barrier things
blow up quite fast).

So can I'll just consider this a false positive, but if someone knows
how much one can read into this that'd be appreciated.


Thanks,
--
Dominique