Re: per-cpu operation madness vs validation

From: Christoph Lameter
Date: Wed Jul 27 2011 - 10:16:23 EST


On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> > On x86 these operations are rt safe by the pure fact that all these
> > operations are the same interrupt safe instruction.
>
> The number of instructions has nothing what so ever to do with things.

Of course it has. If its one instruction then that instruction is by
design atomic vs interrupts and preemption. No interrupt disable or
preemption disabling is necessary to ensure that that the RMV operation is
safe from interrupts or preemption.

> And while the individual ops might be preempt/irq/nmi safe, that doesn't
> say anything at all about the piece of code they're used in.

Sure only the individual RMV operation is safe. If there are multiple
operations then there are the usual issues.

> And doesn't solve the larger issue of multiple per-cpu variables forming
> a consistent piece of data.

That is a different issue from single atomic operations and AFAICT that
whas what we discussed here.

> Suppose there's two per-cpu variables, a and b, and we have an invariant
> that says that a-b := 5. Then the following code:
>
> preempt_disable();
> __this_cpu_inc(a);
> __this_cpu_inc(b);
> preempt_enable();
>
> is only correct when used from task context, an IRQ user can easily
> observe our invariant failing and none of your above validations will
> catch this.

Well yes these are multiple operations that have a sort of atomicity
requirement together. You want both variables to be updated together and
therefore the individual RMV operations wont do the trick. The algorithm
would have to be changed to take advantage of the this_cpu ops or you will
have to take the penalty of higher synchronization costs.

> Now move to -rt where the above will likely end up looking something
> like:
>
> migrate_disable();
> __this_cpu_inc(a);
> __this_cpu_inc(b);
> migrate_enable();
>
> and everybody can see the invariant failing.

Obviously. The this_cpu ops etc are not inteded to address large sections
that access per cpu data and want to make sure that others see the changes
in an atomic fashion.

> Now clearly my example is very artificial but not far fetched at all,

Sure but these issues require a different way of structuring the
algorithm. That is why I have introduced longer cmpxchg operations.

What you can do is what the slub fastpath does in order to allow updates
without caring about preemption or interrupts.

1. You get a reference to the current per cpu data.
2. You obtain a transaction id that is cpu specific and operation specifc.
3. Speculatively access an object and work with the information of that
object. Do not change anything.
4. Calculate the next transaction id
4. Replace pointer and transaction id with a new object pointer and new
tranaction id. If there is no match restart at 1.

Otherwise you are done. Any interrupt or preemption section that also does
operation on the critical data also would increment the transaction id and
cause a retry of the section.

So now you have a per cpu access that is fully immunte from interrupt and
preemption issues.

> Hence my suggestion to do something like:
>
> struct foo {
> percpu_lock_t lock;
> int a;
> int b;
> }
>
> DEFINE_PER_CPU(struct foo, foo);
>
> percpu_lock(&foo.lock);
> __this_cpu_inc(foo.a);
> __this_cpu_inc(foo.b);
> percpu_unlock(&foo.lock);

This is too expensive for my taste. If you adopt the above transaction id
scheme then you will have rt performance on per cpu sections that is
competitive with a non rt kernel. The cost involved is that the per cpu
sections have to be done differently. But there is a win both for RT and
non RT kernels.

> Possibly we could reduce all this percpu madness back to one form
> (__this_cpu_*) and require that when used a lock of the percpu_lock_t is
> taken.

percpu sections are very performance critical components of the kernel.
Taking locks could be avoided with serialization through transaction ids
f.e.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/