Re: [PATCH tip/core/rcu 0/3] rcu: resend of grace-period stall andcleanup patches

From: Paul E. McKenney
Date: Sun Nov 22 2009 - 12:42:07 EST


On Sun, Nov 22, 2009 at 12:05:42PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> > Hello!
> >
> > This patch series is a resend of the three RCU patches that are candidates
> > for the upcoming 2.6.33 merge window, but that are not yet in -tip.
> > These are:
> >
> > 1. A fix for a grace-period-stall bug that occurs on large
> > machines.
> [...]
>
> Hi Paul,
>
> I was thinking about the last bugs you discovered. Some caracteristics
> they had in common were that they occur only on large marchines (32+ or
> 64+ CPUs). This is caused by the fact that some of your code is only
> covered by tests when the number of CPUs go over the architecture size
> (in bits).
>
> I managed to cover this kind of scenario with smaller state-space in the
> LTTng formal models (but it also applies to kernel code) by tweaking the
> code, with bitmasks, to ensure that the number of bits the code uses is,
> e.g., no more than the minimum amount of required bits. Therefore, you
> are ensured to run into overflow scenarios either more quickly or, as in
> this case, on decently-sized hardware.

You mean by setting CONFIG_RCU_FANOUT=2 in order to get three levels
of rcu_node hierarchy on an eight-CPU machine, which would otherwise
require more than 1024 CPU on a 32-bit system or more that 4096 CPUs on
a 64-bit system? ;-)

http://paulmck.livejournal.com/14969.html

But yes, the largest machine I have access to has "only" 128 CPUs,
and it is often heavily used by others. So I heartily agree with
your point, which is that we should use various techniques to test
code on smaller machines in ways that larger machines will stress it.
Of course, my favorite such technique is differential profiling, which
allows performance results collected on small machines to reveal problems
that would only show up on large machines:

http://www.rdrop.com/users/paulmck/scalability/paper/profiling.2002.06.04.pdf

(This is a revision of a paper that appeared in the 1995 MASCOTS
conference and in the 1999 Software Practice & Experience journal.)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/