Re: Which reminds me...

From: Paul E. McKenney
Date: Wed Jan 03 2018 - 12:55:21 EST


On Wed, Jan 03, 2018 at 05:35:07AM +0000, Lihao Liang wrote:
> Hi Paul,
>
> Happy New Year!

And to you and yours as well!

> Do you mind my asking some RCU questions, please?

Not only do I not mind, I will also actually attempt to answer them. ;-)

> 1. I have noticed that __call_rcu_core() checks whether interrupts are
> disabled and if so it returns immediately:
> http://elixir.free-electrons.com/linux/v4.15-rc6/source/kernel/rcu/tree.c#L3039
>
> And it is only called in __call_rcu():
> http://elixir.free-electrons.com/linux/v4.15-rc6/source/kernel/rcu/tree.c#L3153
>
> My understanding is that if __call_rcu() is called with interrupts
> disabled, __call_rcu_core() will return without invoking RCU core. If
> so, why?

You are correct.

The reason is that the scheduler is permitted to invoke call_rcu() while
holding various interrupt-disabled scheduler locks. RCU uses interrupts
being disabled as a proxy for this situation. If RCU were to fail to
include this check, then such calls from the scheduler would result in
self-deadlock if RCU needed to wake up the grace-period kthread.

> 2. It appears you didn't use locks in __call_rcu() to protect the
> per-CPU callback list in rcu_data. If we consider the following
> scenario: call_rcu() is invoked by CPU 0 to add a callback to its
> callback list, at the same time CPU 1 is calling rcu_barrier() to
> register an rcu_barrier_callback in CPU 0's callback list. Would it be
> a race?

No, it is not.

The reason is that _rcu_barrier() invokes smp_call_function_single()
to cause the other CPU to enqueue the callback on its own callback list.

Can you tell me why _rcu_barrier() doesn't check for enqueuing the
callback on the current CPU?

> 3. I have performed some rcuperf experiments on rcu_ops and found the
> results varied a lot between different runs. Would setting dur=30*60
> in kvm.sh be suitable? Shall I make it longer?

Hmmm... In the past, I have gotten repeatable results with short
runs on the order of a few minutes.

But why not instead use the --duration argument to the kvm.sh script?
More convenient and easier to script.

Here are some reasons that your performance might vary:

1. You have varying competing load on the same hardware, perhaps
in some other guest OS.

2. Your system has a NUMA topology and different runs are spread
differently across the NUMA nodes.

3. You are seeing random effects of thermal throttling (which messes
up scalability measurements on my laptop big time). The usual
approach is to set the CPU clock frequency to a sufficiently
low value to avoid thermal throttling.

4. You have different memory or cache layouts on different
runs, so are sometimes seeing false sharing and sometimes not.

5. If you are running on some other RCU implementation, it might
have hysteresis effects. This can happen in a number of ways,
but one way is for the fastpath to sometimes always be taken
and other times almost never be taken. This is usually a tuning
problem.

For that matter, it has been some time since I did careful
rcuperf studies of Linux-kernel RCU, so perhaps I managed to
introduce some hysteresis. If so, please bisect!

Thanx, Paul

> Many thanks,
> Lihao.
>
> On Thu, Dec 21, 2017 at 4:05 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Thu, Dec 21, 2017 at 06:30:23AM +0000, Lihao Liang wrote:
> >> On Mon, Dec 18, 2017 at 5:13 PM, Paul E. McKenney
> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> > On Fri, Dec 15, 2017 at 02:25:46AM +0000, Lihao Liang wrote:
> >
> > ...
> >
> >> > In practice, making the rcu_head structure larger will not make
> >> > RCU users at all happy.
> >>
> >> You're right...Adding an unsigned long long field to rcu_head
> >> structure would result in the following compilation error:
> >>
> >> call to '__compiletime_assert_255' declared with attribute error:
> >> BUILD_BUG_ON failed: offsetof(struct ds_entry, __refcnt) & 63
> >>
> >> I have to make my own rcu_cblist structure with additional fields...
> >
> > Or group the rcu_head structures by the grace-period number that they
> > are waiting for. ;-)
> >
> >> >> Btw, I have been reading your RCU hot-plug code and noticed the
> >> >> function rcu_cleanup_dead_cpu() which allows RCU to ignore a given
> >> >> CPU:
> >> >>
> >> >> http://elixir.free-electrons.com/linux/v3.10/source/kernel/rcutree.c#L1960
> >> >>
> >> >> Is it only for hot-plug operations? Can we use it to make RCU ignore
> >> >> an overloaded or faulty core in a given CPU?
> >> >
> >> > Not directly.
> >> >
> >> > If you need to ignore a faulty core/CPU, you should instead use a
> >> > CPU-hotplug operation to remove that core/CPU. This will invoke
> >> > rcu_cleanup_dead_cpu() along the way.
> >>
> >> Is it what is documented at
> >> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html?
> >
> > Yes. It is possible to use this at a higher level, as is done in
> > the online/offline code in kernel/torture.c
> >
> >> > Making RCU ignore an overloaded CPU is an excellent way to crash your
> >> > system, so please don't do that. And if a CPU became faulty while
> >> > executing in an RCU read-side critical section, things become very
> >> > difficult.
> >> >
> >> > Note also that RCU automatically ignores idle CPUs and CPUs executing
> >> > in usermode, not just offline CPUs. However, the ignoring of usermode
> >> > CPUs in CONFIG_NO_HZ_FULL CPUs requires each usermode CPU to execute
> >> > some RCU code.
> >>
> >> Just out of curiosity, does the organisation of CPUs in Tree RCU take
> >> into account the cpu topology of a NUMA machine?
> >
> > No.
> >
> > There have been quite a few requests for this, but no one has been willing
> > to actually run tests showing that it helps anything at the system level.
> > (My belief is that it does not help because all the common-case RCU
> > code avoids touching the combining tree, by design. Plus this would
> > require RCU to keep two sets of book because some architectures number
> > their CPUs in very odd ways. Plus I bet that they -did- run the tests,
> > saw no improvement, and didn't bother to tell me.)
> >
> > If you are using an architecture that numbers its CPUs consecutively,
> > so that each node of the NUMA machine contains the same number
> > of consecutively numbered CPUs, you can adjust the RCU_FANOUT and
> > RCU_FANOUT_LEAF Kconfig options to align the rcu_node boundaries to the
> > NUMA boundaries.
> >
> >> Btw, there is a typo in kernel/cpu.c: controll processor -> control
> >> processor (http://elixir.free-electrons.com/linux/v4.15-rc4/source/kernel/cpu.c#L1293)
> >>
> >> I can't find the maintainer of this part of code from
> >> https://www.kernel.org/doc/linux/MAINTAINERS. Shall I just send a
> >> patch to trivial@xxxxxxxxxx against the master branch of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial.git
> >
> > Run scripts/get_maintainer.pl with the patch as standard input, and it
> > will output a list of maintainers and mailing lists.
> >
> > Thanx, Paul
> >
> >> Many thanks,
> >> Lihao.
> >>
> >> >>
> >> >> >> >>
> >> >> >> >> >> > Are there any other Linux kernel test modules that only use
> >> >> >> >> >> > rcu_read_lock()/unlock() and synchronize_rcu() so that I can also play
> >> >> >> >> >> > with?
> >> >> >> >> >
> >> >> >> >> > The only ones intended to test RCU are rcutorture and rcuperf, although
> >> >> >> >> > pretty much any other Linux-kernel test suite will test RCU to at least
> >> >> >> >> > some extent.
> >> >> >> >> >
> >> >> >> >> >> >>> >> > >> Are sync and exp_sync tested very differently in these test cases?
> >> >> >> >> >> >>> >> > >
> >> >> >> >> >> >>> >> > > Not in current kernels, but the handling of exp_sync has changed
> >> >> >> >> >> >>> >> > > over time. But recall my earlier email -- there are other rcutorture
> >> >> >> >> >> >>> >> > > parameters that are set by default that will result in warnings, cbflood
> >> >> >> >> >> >>> >> > > and the like.
> >> >> >> >> >> >>> >> >
> >> >> >> >> >> >>> >> > But even with the same parameters, why sync complains but exp_sync
> >> >> >> >> >> >>> >> > doesn't, providing both use the same function?
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Exactly how did it complain? Something like this?
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> rcu_torture_writer: gp_normal without primitives
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Or something different?
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> As I want to test .sync, so I only set the torture parameter gp_sync
> >> >> >> >> >> >>> to true and provide an implementation to .sync. The NULL pointer
> >> >> >> >> >> >>> exception is caused by the following line of code as gp_normal is
> >> >> >> >> >> >>> false and .exp_sync is NULL.
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> http://elixir.free-electrons.com/linux/v4.12.9/source/kernel/rcu/rcutorture.c#L1120
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> Shall the above if statement be if (gp_sync) instead of if (gp_normal)?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Wait... If you are setting .sync but not .exp_sync, you would also need
> >> >> >> >> >> >> to set the rcutorture module parameters to prevent rcutorture from ever
> >> >> >> >> >> >> testing expedited grace periods. That should mean setting gp_normal but
> >> >> >> >> >> >> not gp_exp, which would mean that control would always go to line 1121.
> >> >> >> >> >> >> Or you could just make .exp_sync reference the same function as .sync.
> >> >> >> >> >> >>
> >> >> >> >> >> >> You could argue that this is a usability bug in rcutorture, and you would
> >> >> >> >> >> >> be right. If you would like, you could send me a patch that issued a
> >> >> >> >> >> >> warning if the gp_normal and gp_exp parameters didn't make sense given
> >> >> >> >> >> >> the rcu_torture_ops entry selected. This would add to the big "if"
> >> >> >> >> >> >> statement in rcu_torture_writer().
> >> >> >> >> >> >>
> >> >> >> >> >> >> Please note that such patches require a surprising amount of testing,
> >> >> >> >> >> >> given the rather large number of possible combinations of rcu_torture_ops
> >> >> >> >> >> >> fields and rcutorture module-parameter settings.
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> > If there are only three types wait primitives:
> >> >> >> >> >> >
> >> >> >> >> >> > gp_cond -> cond_sync
> >> >> >> >> >> > gp_exp ->exp_sync
> >> >> >> >> >> > gp_sync/gp_normal -> sync
> >> >> >> >> >> >
> >> >> >> >> >> > However, gp_normal is also related to deferred_free as indicated at
> >> >> >> >> >> > http://elixir.free-electrons.com/linux/v4.12.9/source/kernel/rcu/rcutorture.c#L1000
> >> >> >> >> >> >
> >> >> >> >> >> > This might be confusing to a new user of rcutorture (probably only to
> >> >> >> >> >> > me though...). It might be clearer to use gp_sync to test sync and
> >> >> >> >> >> > gp_normal (or a different name such as gp_deferred_free) to test
> >> >> >> >> >> > deferred_free only.
> >> >> >> >> >>
> >> >> >> >> >> So the new dependencies become:
> >> >> >> >> >>
> >> >> >> >> >> gp_cond -> cond_sync
> >> >> >> >> >> gp_exp ->exp_sync
> >> >> >> >> >> gp_sync -> sync
> >> >> >> >> >> gp_normal (or gp_deferred_free) -> deferred_free
> >> >> >> >> >
> >> >> >> >> > I might be willing to consider a patch to this effect, but it took me
> >> >> >> >> > a few rounds to get the current approach working naturally, so please
> >> >> >> >> > take the time to study it. I don't -think- there is any effect on the
> >> >> >> >> > rcutorture scripting, but it would be good to double-check.
> >> >> >> >> >
> >> >> >> >> > Thanx, Paul
> >> >> >> >> >
> >> >> >> >> >> Best,
> >> >> >> >> >> Lihao.
> >> >> >> >> >>
> >> >> >> >> >> > Am I missing something here?
> >> >> >> >> >> >
> >> >> >> >> >> > Many thanks,
> >> >> >> >> >> > Lihao.
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>