Re: [RFC 0/2] srcu: Remove pre-flip memory barrier

From: Mathieu Desnoyers
Date: Tue Dec 20 2022 - 12:00:43 EST


On 2022-12-19 20:04, Joel Fernandes wrote:
On Mon, Dec 19, 2022 at 7:55 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:

On Sun, Dec 18, 2022 at 8:49 PM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
[...]

On 2022-12-18 14:13, Joel Fernandes (Google) wrote:
[...]


I think this is a bit dangerous. Yes there is the preemption thing you
mentioned above, but that is bounded since you can only have a fixed
number of tasks that underwent that preemption, and it is quite rare
in the sense, each reader should get preempted just after sampling idx
but not incrementing lock count.

However, if we scan while new readers appear (outside of the above
preemption problem), we can have counter wrap causing a false match
much quicker.
The scan loop is:
check_readers(idx) {
count_all_unlocks(idx);
smp_mb();
count_all_locks(idx);
bool done = (locks == unlocks)
if (done) {
// readers are done, end scan for this idx.
} else {
// try again later
}
}

So if check_readers() got preempted just after the smp_mb(), then you
can have lots of tasks enter and exit the read-side critical section
and increment the locks count. Eventually locks == unlocks will
happen, and it is screwed. Sure this is also theoretical, but yeah
that issue can be made "worse" by scanning active readers
deliberately, especially when such readers can also nest arbitrarily.

As a result, flipping between periods 0/1 is just relevant for forward
progress, not for correctness.

Sure, agreed, forward progress.

Adding to the last statement "But also correctness as described above".

Exactly how many entry/exit of the read-side critical section while the
grace period is preempted do you need to trigger this ?

It depends on how many readers are active during the preemption of the
scan code. Say the preemption happened after per-CPU unlock counts
were totalled. Then AFAICS, if there are N active readers which need
the grace period to wait, you need (2^sizeof(int) - N) number of
lock+unlock to happen.

Sorry, here I meant (2^(sizeof(unsigned long) * 8) - N) which is 2^32
on 32-bit systems.

And 2^64 on 64-bit systems.


thanks,

- Joel


On a 64-bit system, where 64-bit counters are used, AFAIU this need to
be exactly 2^64 read-side critical sections.

Yes, but what about 32-bit systems?

The overflow indeed happens after 2^32 increments, just like seqlock.
The question we need to ask is therefore: if 2^32 is good enough for seqlock, why isn't it good enough for SRCU ?


There are other synchronization algorithms such as seqlocks which are
quite happy with much less protection against overflow (using a 32-bit
counter even on 64-bit architectures).

The seqlock is an interesting point.

For practical purposes, I suspect this issue is really just theoretical.

I have to ask, what is the benefit of avoiding a flip and scanning
active readers? Is the issue about grace period delay or performance?
If so, it might be worth prototyping that approach and measuring using
rcutorture/rcuscale. If there is significant benefit to current
approach, then IMO it is worth exploring.

The main benefit I expect is improved performance of the grace period implementation in common cases where there are few or no readers present, especially on machines with many cpus.

It allows scanning both periods (0/1) for each cpu within the same pass, therefore loading both period's unlock counters sitting in the same cache line at once (improved locality), and then loading both period's lock counters, also sitting in the same cache line.

It also allows skipping the period flip entirely if there are no readers present, which is an -arguably- tiny performance improvement as well.

Thanks,

Mathieu


Or am I missing your point ?

No, I think you are not. Let me know if I missed something.

Thanks,

- Joel





thanks,

- Joel

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com