Re: RCU latency regression in 2.6.16-rc1

From: Paul E. McKenney
Date: Mon Jan 30 2006 - 04:59:38 EST


On Sun, Jan 29, 2006 at 09:52:44PM -0800, David S. Miller wrote:
> From: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Date: Sun, 29 Jan 2006 21:11:56 -0800
>
> > > If the size is expanded by a 2 factor (or a power of too), can your
> > > proposal works ?
> >
> > Yep!!!
> >
> > Add the following:
>
> This all sounds very exciting and promising. I'll try to find some
> spare cycles tomorrow to cook something up.

Cool! My earlier description did not get rid of all the explicit
memory barriers. The following updated pseudocode does so.

I will of course be happy to review what you come up with!!!
(And I will be interested to see if this still makes sense after
sleeping on it...)

Improvements welcome as always!!!

Thanx, Paul

------------------------------------------------------------------------

Data Structures

o Yes, the names suck, but that is why I usually am happy to
have other people come up with names...

o Having separate fvbupd and fvbrdr removes the need for
(hopefully all) explicit memory barriers.

struct hashtbl {
int nbuckets;
int fvbupd; /* updater's first valid bucket. */
int fvbrdr; /* readers' first valid bucket. */
struct hash_param params;
struct list_head buckets[0];
};

/* Both fvbupd and fvbrdr need to be initialized to -1. */

struct hashtbl *current;
struct hashtbl *not_current;

Switch Pseudocode: assumes that the switcher can block, if not,
then need to make a state machine driven by call_rcu().

o Wait until at least one grace period has elapsed since the
previous switch. Easiest to just put "synchronize_rcu()"
up front, but can take advantage of naturally occurring
grace periods that elapsed since the last switch if desired.

o Initialize the not-current array, including the hash params.
If the new array is to be different in size, allocate the
new one and free the not-current array.

o Use rcu_assign_pointer() to point not_current to (you guessed
it!) the not-current array.

o Set current->fvbupd = current->fvbrdr = 0;

o Wait for a grace period (synchronize_rcu() if blocking OK,
otherwise call_rcu()...). The delay guarantees that readers
are aware that things might be disappearing before we actually
start making them disappear.

o Loop until all buckets of current are emptied:

o current->fvbrdr = current->fvbupd

No memory barriers needed because all buckets
preceding current->fvbupd were emptied at least
one grace period ago, so any readers that were
active when any of those buckets were non-empty
have now completed.

o Remove elements from current->bucket[current->fvbupd]
until the bucket is empty or until we use up our
alloted quota of time and/or elements.

o If we empty the bucket, we of course increment
current->fvbupd, and if there is time/elements
left, could continue on the next bucket.

o Wait for a grace period.

-Really- big arrays on really big SMP machines might want
to have multiple CPUs processing buckets in parallel. In theory,
this is not hard, but the fvbrdr update gets messier.

And yes, there needs to be some sort of update-side locking.
Per-bucket locking seems like it should work -- one would
hold the bucket lock for the current array throughout,
and acquire and release the bucket lock for the non-current
array on a per-element basis. One perhaps better way is to
simply remove the elements from current, and let the readers
insert them into not_current only upon cache miss. Getting
this right requires more insight into the networking code
than I currently have.

o Use rcu_assign_pointer() to set current to not_current.

o Note: leave not_current pointing to the now-current array
to allow slow readers to complete successfully. Alternatively,
one could wait for a grace period, then set non_current to
NULL (but does not seem worth it).

Table lookup (am assuming that this does rcu_read_unlock() before
returning, but that only makes sense if you have some sort of lock
on the element being returned -- if no such lock, the rcu_read_unlock()
primitives must be removed -- the caller must rcu_read_unlock() instead):

o rcu_read_lock();

o c = rcu_dereference(current);

o Compute hash based on c->params

o If hash >= c->fvbrdr, do lookup in c->bucket[hash]
If found, rcu_read_unlock() and return it.

o If c->fvbrdr is not -1, look the item up in the non-current
table:

o c = rcu_dereference(not_current);

o Compute hash based on c->params

o Do lookup in c->bucket[hash]

Note that c->fvbrdr -could- transition to zero here, but
if it does, we are guaranteed that nothing will actually
be removed from the current table until we complete.

And note that we need the non-current lookup even if the
switch is not populating the non-current array, since the
readers would be doing so upon cache miss, right?

o rcu_read_unlock()

o return what is found or complain appropriately if nothing found.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/