Re: [GIT PULL rcu/next] RCU commits for 4.13

From: Paul E. McKenney
Date: Fri Jun 30 2017 - 00:02:57 EST


On Fri, Jun 30, 2017 at 10:51:26AM +0800, Boqun Feng wrote:
> On Thu, Jun 29, 2017 at 11:11:26AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 29, 2017 at 11:59:27AM -0400, Alan Stern wrote:
> > > On Thu, 29 Jun 2017, Will Deacon wrote:
> > >
> > > > [turns out I've not been on cc for this thread, but Jade pointed me to it
> > > > and I see my name came up at some point!]
> > > >
> > > > On Wed, Jun 28, 2017 at 05:05:46PM -0700, Linus Torvalds wrote:
> > > > > On Wed, Jun 28, 2017 at 4:54 PM, Paul E. McKenney
> > > > > <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > Linus, are you dead-set against defining spin_unlock_wait() to be
> > > > > > spin_lock + spin_unlock? For example, is the current x86 implementation
> > > > > > of spin_unlock_wait() really a non-negotiable hard requirement? Or
> > > > > > would you be willing to live with the spin_lock + spin_unlock semantics?
> > > > >
> > > > > So I think the "same as spin_lock + spin_unlock" semantics are kind of insane.
> > > > >
> > > > > One of the issues is that the same as "spin_lock + spin_unlock" is
> > > > > basically now architecture-dependent. Is it really the
> > > > > architecture-dependent ordering you want to define this as?
> > > > >
> > > > > So I just think it's a *bad* definition. If somebody wants something
> > > > > that is exactly equivalent to spin_lock+spin_unlock, then dammit, just
> > > > > do *THAT*. It's completely pointless to me to define
> > > > > spin_unlock_wait() in those terms.
> > > > >
> > > > > And if it's not equivalent to the *architecture* behavior of
> > > > > spin_lock+spin_unlock, then I think it should be descibed in terms
> > > > > that aren't about the architecture implementation (so you shouldn't
> > > > > describe it as "spin_lock+spin_unlock", you should describe it in
> > > > > terms of memory barrier semantics.
> > > > >
> > > > > And if we really have to use the spin_lock+spinunlock semantics for
> > > > > this, then what is the advantage of spin_unlock_wait at all, if it
> > > > > doesn't fundamentally avoid some locking overhead of just taking the
> > > > > spinlock in the first place?
> > > >
> > > > Just on this point -- the arm64 code provides the same ordering semantics
> > > > as you would get from a lock;unlock sequence, but we can optimise that
> > > > when compared to an actual lock;unlock sequence because we don't need to
> > > > wait in turn for our ticket. I suspect something similar could be done
> > > > if/when we move to qspinlocks.
> > > >
> > > > Whether or not this is actually worth optimising is another question, but
> > > > it is worth noting that unlock_wait can be implemented more cheaply than
> > > > lock;unlock, whilst providing the same ordering guarantees (if that's
> > > > really what we want -- see my reply to Paul).
> > > >
> > > > Simplicity tends to be my preference, so ripping this out would suit me
> > > > best ;)
> > >
> > > It would be best to know:
> > >
> > > (1). How spin_unlock_wait() is currently being used.
> > >
> > > (2). What it was originally intended for.
> > >
> > > Paul has done some research into (1). He can correct me if I get this
> > > wrong... Only a few (i.e., around one or two) of the usages don't seem
> > > to require the full spin_lock+spin_unlock semantics. I go along with
> > > Linus; the places which really do want it to behave like
> > > spin_lock+spin_unlock should simply use spin_lock+spin_unlock. There
> > > hasn't been any indication so far that the possible efficiency
> > > improvement Will mentions is at all important.
> > >
> > > According to Paul, most of the other places don't need anything more
> > > than the acquire guarantee (any changes made in earlier critical
> > > sections will be visible to the code following spin_unlock_wait). In
> > > which case, the semantics of spin_unlock_wait could be redefined in
> > > this simpler form.
> > >
> > > Or we could literally replace all the existing definitions with
> > > spin_lock+spin_unlock. Would that be so terrible?
> >
> > And here they are...
> >
> > spin_unlock_wait():
> >
> > o drivers/ata/libata-eh.c ata_scsi_cmd_error_handler()
> > spin_unlock_wait(ap->lock) in else-clause where then-clause has
> > a full critical section for this same lock. This use case could
> > potentially require both acquire and release semantics. (I am
> > following up with the developers/maintainers, suggesting that
> > they convert to spin_lock+spin_unlock if they need release
> > semantics.)
> >
> > This is error-handling code, which should be rare, so
> > spin_lock+spin_unlock should work fine here. Probably shouldn't
> > have bugged the maintainer, but email already sent. :-/
> >
> > o ipc/sem.c exit_sem()
> > This use case appears to need to wait only on prior critical
> > sections, as the only way we get here is if the entry has already
> > been removed from the list. An acquire-only spin_unlock_wait()
> > works here. However, this is sem-exit code, which is not a
> > fastpath, and the race should be rare, so spin_lock+spin_unlock
> > should work fine here.
> >
> > o kernel/sched/completion.c completion_done()
> > This use case appears to need to wait only on prior critical
> > sections, as the only way we get past the "if" is when the lock is
> > held by complete(), and you are only supposed to invoke complete()
> > once on a given completion. An acquire-only spin_unlock_wait()
> > works here, but the race should be rare, so spin_lock+spin_unlock
> > should also work fine here.
> >
> > o net/netfilter/nf_conntrack_core.c nf_conntrack_lock()
> > This instance of spin_unlock_wait() interacts with
> > nf_conntrack_all_lock()'s instance of spin_unlock_wait().
> > Although nf_conntrack_all_lock() has an smp_mb(), which I
> > believe provides release semantics given current implementations,
> > nf_conntrack_lock() just has smp_rmb().
> >
> > I believe that the smp_rmb() needs to be smp_mb(). Am I missing
> > something here that makes the current code safe on x86?
> >
>
> actually i think the smp_rmb() or even along with the spin_unlock_wait()
> in nf_conntrack_lock() is not needed, we could
> implementnf_conntrack_lock() as:
>
>
> void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> {
> spin_lock(lock);
> while (unlikely(smp_load_acquire(nf_conntrack_locks_all))) {
> spin_unlock(lock);
> cpu_relaxed();
> spin_lock(lock);
> }
> }
>
> because in nf_conntrack_all_unlock(), we have:
>
> smp_store_release(&nf_conntrack_locks_all, false);
> spin_unlock(&nf_conntrack_locks_all_lock);
>
> so if we exit the loop, which means we observe nf_conntrack_locks_all
> being false, we actually hold the per bucket lock and observe everything
> before the smp_store_release(), which is the same as everything in the
> critical section of nf_conntrack_locks_all_lock. Otherwise, we observe
> the nf_conntrack_locks_all being true, which means a global lock
> critical section may be on its way, we simply drop the per bucket lock
> and test whether the global lock is finished again some time later.
>
> So I think spin_unlock_wait() in the nf_conntrack_lock() just requires
> acquire semantics, at least.
>
> Maybe I miss someting?

Or perhaps I was being too paranoid.

But does the same analysis work in the case where an nf_conntrack_lock
races with an nf_contrack_all_lock()?

> > I believe that this code could use spin_lock+spin_unlock without
> > significant performance penalties -- I do not believe that
> > nf_conntrack_locks_all_lock gets significant contention.
> >
> > raw_spin_unlock_wait() (Courtesy of Andrea Parri with added commentary):
> >
> > o kernel/exit.c do_exit()
> > Seems to rely on both acquire and release semantics. The
> > raw_spin_unlock_wait() primitive is preceded by a smp_mb().
> > But this is task exit doing spin_unlock_wait() on the task's
> > lock, so spin_lock+spin_unlock should work fine here.
> >
> > o kernel/sched/core.c do_task_dead()
> > Seems to rely on the acquire semantics only. The
> > raw_spin_unlock_wait() primitive is preceded by an inexplicable
> > smp_mb(). Again, this is task exit doing spin_unlock_wait() on
> > the task's lock, so spin_lock+spin_unlock should work fine here.
> >
> > o kernel/task_work.c task_work_run()
> > Seems to rely on the acquire semantics only. This is to handle
>
> I think this one needs the stronger semantics, the smp_mb() is just
> hidden in the cmpxchg() before the raw_spin_unlock_wait() ;-)
>
> cmpxchg() sets a special value to indicate the task_work has been taken,
> and raw_spin_unlock_wait() must wait until the next critical section of
> ->pi_lock(in task_work_cancel()) could observe this, otherwise we may
> cancel a task_work while executing it.

But either way, replacing the spin_unlock_wait() with a spin_lock()
immediately followed by a spin_unlock() should work correctly, right?

Thanx, Paul

> Regards,
> Boqun
> > a race with task_work_cancel(), which appears to be quite rare.
> > So the spin_lock+spin_unlock should work fine here.
> >
> > spin_lock()/spin_unlock():
> >
> > o ipc/sem.c complexmode_enter()
> > This used to be spin_unlock_wait(), but was changed to a
> > spin_lock()/spin_unlock() pair by 27d7be1801a4 ("ipc/sem.c:
> > avoid using spin_unlock_wait()").
> >
> > Looks to me like we really can drop spin_unlock_wait() in favor of
> > momentarily acquiring the lock. There are so few use cases that I don't
> > see a problem open-coding this. I will put together yet another patch
> > series for my spin_unlock_wait() collection of patch serieses. ;-)
> >
> > > As regards (2), I did a little digging. spin_unlock_wait was
> > > introduced in the 2.1.36 kernel, in mid-April 1997. I wasn't able to
> > > find a specific patch for it in the LKML archives. At the time it
> > > was used in only one place in the entire kernel (in kernel/exit.c):
> > >
> > > void release(struct task_struct * p)
> > > {
> > > int i;
> > >
> > > if (!p)
> > > return;
> > > if (p == current) {
> > > printk("task releasing itself\n");
> > > return;
> > > }
> > > for (i=1 ; i<NR_TASKS ; i++)
> > > if (task[i] == p) {
> > > #ifdef __SMP__
> > > /* FIXME! Cheesy, but kills the window... -DaveM */
> > > while(p->processor != NO_PROC_ID)
> > > barrier();
> > > spin_unlock_wait(&scheduler_lock);
> > > #endif
> > > nr_tasks--;
> > > task[i] = NULL;
> > > REMOVE_LINKS(p);
> > > release_thread(p);
> > > if (STACK_MAGIC != *(unsigned long *)p->kernel_stack_page)
> > > printk(KERN_ALERT "release: %s kernel stack corruption. Aiee\n", p->comm);
> > > free_kernel_stack(p->kernel_stack_page);
> > > current->cmin_flt += p->min_flt + p->cmin_flt;
> > > current->cmaj_flt += p->maj_flt + p->cmaj_flt;
> > > current->cnswap += p->nswap + p->cnswap;
> > > free_task_struct(p);
> > > return;
> > > }
> > > panic("trying to release non-existent task");
> > > }
> > >
> > > I'm not entirely clear on the point of this call. It looks like it
> > > wanted to wait until p was guaranteed not to be running on any
> > > processor ever again. (I don't see why it couldn't have just acquired
> > > the scheduler_lock -- was release() a particularly hot path?)
> > >
> > > Although it doesn't matter now, this would mean that the original
> > > semantics of spin_unlock_wait were different from what we are
> > > discussing. It apparently was meant to provide the release guarantee:
> > > any future critical sections would see the values that were visible
> > > before the call. Ironic.
> >
> > Cute!!! ;-)
> >
> > Thanx, Paul
> >