Re: [PATCH] locking/osq_lock: fix a data race in osq_wait_next

From: Marco Elver
Date: Tue Jan 28 2020 - 03:18:52 EST


On Tue, 28 Jan 2020 at 04:13, Qian Cai <cai@xxxxxx> wrote:
>
> > On Jan 23, 2020, at 4:36 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Jan 22, 2020 at 11:38:51PM +0100, Marco Elver wrote:
> >
> >> If possible, decode and get the line numbers. I have observed a data
> >> race in osq_lock before, however, this is the only one I have recently
> >> seen in osq_lock:
> >>
> >> read to 0xffff88812c12d3d4 of 4 bytes by task 23304 on cpu 0:
> >> osq_lock+0x170/0x2f0 kernel/locking/osq_lock.c:143
> >>
> >> while (!READ_ONCE(node->locked)) {
> >> /*
> >> * If we need to reschedule bail... so we can block.
> >> * Use vcpu_is_preempted() to avoid waiting for a preempted
> >> * lock holder:
> >> */
> >> --> if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
> >> goto unqueue;
> >>
> >> cpu_relax();
> >> }
> >>
> >> where
> >>
> >> static inline int node_cpu(struct optimistic_spin_node *node)
> >> {
> >> --> return node->cpu - 1;
> >> }
> >>
> >>
> >> write to 0xffff88812c12d3d4 of 4 bytes by task 23334 on cpu 1:
> >> osq_lock+0x89/0x2f0 kernel/locking/osq_lock.c:99
> >>
> >> bool osq_lock(struct optimistic_spin_queue *lock)
> >> {
> >> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> >> struct optimistic_spin_node *prev, *next;
> >> int curr = encode_cpu(smp_processor_id());
> >> int old;
> >>
> >> node->locked = 0;
> >> node->next = NULL;
> >> --> node->cpu = curr;
> >>
> >
> > Yeah, that's impossible. This store happens before the node is
> > published, so no matter how the load in node_cpu() is shattered, it must
> > observe the right value.
>
> Marco, any thought on how to do something about this? The worry is that
> too many false positives like this will render the tool usefulness as a
> general debug option.

This should be an instance of same-value-store, since the node->cpu is
per-CPU and smp_processor_id() should always be the same, at least
once it's published. I believe the data race I observed here before
KCSAN had KCSAN_REPORT_VALUE_CHANGE_ONLY on syzbot, and hasn't been
observed since. For the most part, that should deal with this case.

I will reply separately to your other email about the other data race.

Thanks,
-- Marco