Re: tasks-trace RCU: question about grace period forward progress

From: Mathieu Desnoyers
Date: Thu Feb 25 2021 - 10:48:55 EST


----- On Feb 25, 2021, at 10:36 AM, paulmck paulmck@xxxxxxxxxx wrote:

> On Thu, Feb 25, 2021 at 09:22:48AM -0500, Mathieu Desnoyers wrote:
>> Hi Paul,
>>
>> Answering a question from Peter on IRC got me to look at rcu_read_lock_trace(),
>> and I see this:
>>
>> static inline void rcu_read_lock_trace(void)
>> {
>> struct task_struct *t = current;
>>
>> WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
>> barrier();
>> if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
>> t->trc_reader_special.b.need_mb)
>> smp_mb(); // Pairs with update-side barriers
>> rcu_lock_acquire(&rcu_trace_lock_map);
>> }
>>
>> static inline void rcu_read_unlock_trace(void)
>> {
>> int nesting;
>> struct task_struct *t = current;
>>
>> rcu_lock_release(&rcu_trace_lock_map);
>> nesting = READ_ONCE(t->trc_reader_nesting) - 1;
>> barrier(); // Critical section before disabling.
>> // Disable IPI-based setting of .need_qs.
>> WRITE_ONCE(t->trc_reader_nesting, INT_MIN);
>> if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
>> WRITE_ONCE(t->trc_reader_nesting, nesting);
>> return; // We assume shallow reader nesting.
>> }
>> rcu_read_unlock_trace_special(t, nesting);
>> }
>>
>> AFAIU, each thread keeps track of whether it is nested within a RCU read-side
>> critical
>> section with a counter, and grace periods iterate over all threads to make sure
>> they
>> are not within a read-side critical section before they can complete:
>>
>> # define rcu_tasks_trace_qs(t) \
>> do { \
>> if (!likely(READ_ONCE((t)->trc_reader_checked)) && \
>> !unlikely(READ_ONCE((t)->trc_reader_nesting))) { \
>> smp_store_release(&(t)->trc_reader_checked, true); \
>> smp_mb(); /* Readers partitioned by store. */ \
>> } \
>> } while (0)
>>
>> It reminds me of the liburcu urcu-mb flavor which also deals with per-thread
>> state to track whether threads are nested within a critical section:
>>
>> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90
>> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125
>>
>> static inline void _urcu_mb_read_lock_update(unsigned long tmp)
>> {
>> if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) {
>> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr,
>> _CMM_LOAD_SHARED(urcu_mb_gp.ctr));
>> cmm_smp_mb();
>> } else
>> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + URCU_GP_COUNT);
>> }
>>
>> static inline void _urcu_mb_read_lock(void)
>> {
>> unsigned long tmp;
>>
>> urcu_assert(URCU_TLS(urcu_mb_reader).registered);
>> cmm_barrier();
>> tmp = URCU_TLS(urcu_mb_reader).ctr;
>> urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK);
>> _urcu_mb_read_lock_update(tmp);
>> }
>>
>> The main difference between the two algorithm is that task-trace within the
>> kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either
>> incremented or flipped between 0 and 1 by the grace period. This allow RCU
>> readers
>> outermost nesting starting after the beginning of the grace period not to
>> prevent
>> progress of the grace period.
>>
>> Without this, a steady flow of incoming tasks-trace-RCU readers can prevent the
>> grace period from ever completing.
>>
>> Or is this handled in a clever way that I am missing here ?
>
> There are several mechanisms designed to handle this. The following
> paragraphs describe these at a high level.
>
> The trc_wait_for_one_reader() is invoked on each task. It uses the
> try_invoke_on_locked_down_task(), which, if the task is currently not
> running, keeps it that way and invokes trc_inspect_reader(). If the
> locked-down task is in a read-side critical section, the need_qs field
> is set, which will cause the task's next rcu_read_lock_trace() to report
> the quiescent state.

I suspect you meant "rcu_read_unlock_trace()" here.

>
> If read-side memory barriers have been enabled, trc_inspect_reader()
> is able to check for a reader being active, and if not, reports the
> quiescent state. If there is a reader, trc_inspect_reader() reports
> failure, which is another path to the following paragraph.
>
> If the task could not be locked down due its currently running,
> then trc_wait_for_one_reader() attempts to send an IPI, which results in
> trc_read_check_handler() rechecking for a read-side critical section
> and either reporting the quiescent state immediately or proceding in the
> same way that trc_inspect_reader() does. The trc_read_check_handler()
> of course checks to make sure that the target task is still running
> before doing anything. If the attempt to send the IPI fails, then
> the task is rechecked in a later pass.
>
> So what sequence of events did you find that causes these mechanisms
> to fail?

The explanation you provide takes care of my concerns, so I don't have
any remaining problematic scenario in mind.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com