Re: [PATCH, RFC] v4 scalable classic RCU implementation

From: Andrew Morton
Date: Fri Sep 05 2008 - 19:54:21 EST


On Fri, 5 Sep 2008 16:04:11 -0700 "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> ...
>
> > > +#if (NR_CPUS) <= RCU_FANOUT
> > > +# define NUM_RCU_LVLS 1
> > > +# define NUM_RCU_LVL_0 1
> > > +# define NUM_RCU_LVL_1 (NR_CPUS)
> > > +# define NUM_RCU_LVL_2 0
> > > +# define NUM_RCU_LVL_3 0
> > > +#elif (NR_CPUS) <= RCU_FANOUT_SQ
> > > +# define NUM_RCU_LVLS 2
> > > +# define NUM_RCU_LVL_0 1
> > > +# define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT - 1) / RCU_FANOUT)
> > > +# define NUM_RCU_LVL_2 (NR_CPUS)
> > > +# define NUM_RCU_LVL_3 0
> > > +#elif (NR_CPUS) <= RCU_FANOUT_CUBE
> > > +# define NUM_RCU_LVLS 3
> > > +# define NUM_RCU_LVL_0 1
> > > +# define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT_SQ - 1) / RCU_FANOUT_SQ)
> > > +# define NUM_RCU_LVL_2 (((NR_CPUS) + (RCU_FANOUT) - 1) / (RCU_FANOUT))
> > > +# define NUM_RCU_LVL_3 NR_CPUS
> > > +#else
> > > +# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
> > > +#endif /* #if (NR_CPUS) <= RCU_FANOUT */
> >
> > Using NR_CPUS for anything at all is grossly, grossly inaccurate.
> > Vendors can and will ship kernels with NR_CPUS=1024 and their customers
> > can and will run those kernels on 4-cpu machines. Lots of customers.
> >
> > That's a two-and-a-half-order-of-magnitude inaccuracy. It makes all
> > your above work meaningless.
> >
> > To be useful, these decisions should be made at runtime.
>
> I agree in principle, but this case is an exception.
>
> Suppose that we have NR_CPUS=1024 on a 4-CPU 64-bit machine. Since 64^2
> is greater than 1024, we end up with a two-level hierarchy, with one
> rcu_node structure at the root and 16 rcu_node leaf structures, each
> of which takes up a single 128-byte cache line. There will be two such
> structures in the system, one for rcu and one for rcu_bh.
>
> So I do not believe that this will be a problem.
>
> One laptops, this is even less of an issue -- NR_CPUS=8 on my laptop,
> which would reduce to a pair rcu_node structures, one for rcu, the other
> for rcu_bh.

Is it likely that anyone will ship kernels with NR_CPUS=8? What are
distros presently using, and what will they be using 1-2 years hence?

> Making the decision at runtime would bloat the code by much more than the
> extra data consumed. And introduce yet more races between online/offline
> and everything else. Besides, this way I am being bloat-compatible
> with DEFINE_PER_CPU(). ;-)
>
> > > +#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
> > > +#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
> > > +
> > > +/*
> > > + * Definition for node within the RCU grace-period-detection hierarchy.
> > > + */
> > > +struct rcu_node {
> > > + spinlock_t lock;
> > > + unsigned long qsmask; /* CPUs or groups that need to switch in */
> > > + /* order for current grace period to proceed.*/
> > > + unsigned long qsmaskinit;
> > > + /* Per-GP initialization for qsmask. */
> > > + unsigned long grpmask; /* Mask to apply to parent qsmask. */
> > > + int grplo; /* lowest-numbered CPU or group here. */
> > > + int grphi; /* highest-numbered CPU or group here. */
> > > + u8 grpnum; /* CPU/group number for next level up. */
> > > + u8 level; /* root is at level 0. */
> > > + struct rcu_node *parent;
> > > +} ____cacheline_internodealigned_in_smp;
> >
> > So this is a 4096-byte structure on some setups.
>
> You lost me on this one. On a 64-bit system, I have 8 bytes each for
> lock, qsmask, qsmaskinit, grpmask, and parent, four bytes each for grplo
> and grphi, and another byte each for grpnum and level, for a total of
> 50 bytes for each struct rcu_node, which comes to a single cache line
> for most large system. Given the default CONFIG_RCU_FANOUT=64 and
> NR_CPUS=4096, we have a two-level hierarchy with one root rcu_node
> structure and 64 leaf rcu_node structures. This gives a total of
> 65 cache lines.

____cacheline_internodealigned_in_smp will expand this structure to
4096 bytes on CONFIG_X86_VSMP=y.

> > It's a pretty big pill to swallow. Nice performance testing results
> > will help it to slide down.
>
> Well, the read side is easy -- exactly the same code sequence as for
> Classic RCU. On the update side, this is more of a bug fix for large
> numbers of CPUs, where unadorned Classic RCU is said to suffer terminal
> lock contention. I will see what I can come up with, but at the end of
> the day, this will need some testing on machines larger than the 128-CPU
> systems that I have access to.
>

OK, thanks. As it's effectively a bugfix, a full description of the
bug would grease some wheels.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/