Re: [PATCH v2 4/4] rcu/kvfree: Use a polled API to speedup a reclaim process

From: Paul E. McKenney
Date: Fri Dec 02 2022 - 14:14:31 EST


On Fri, Dec 02, 2022 at 01:54:17PM +0100, Uladzislau Rezki wrote:
> >
> > A couple more questions interspersed below upon further reflection.
> >
> > Thoughts?
> >
> See below my thoughts:
>
> > Thanx, Paul
> >
> > > ---
> > > kernel/rcu/tree.c | 47 +++++++++++++++++++++++++++++++++++++++--------
> > > 1 file changed, 39 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index c94c17194299..44279ca488ef 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2741,11 +2741,13 @@ EXPORT_SYMBOL_GPL(call_rcu);
> > > /**
> > > * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
> > > * @list: List node. All blocks are linked between each other
> > > + * @gp_snap: Snapshot of RCU state for objects placed to this bulk
> > > * @nr_records: Number of active pointers in the array
> > > * @records: Array of the kvfree_rcu() pointers
> > > */
> > > struct kvfree_rcu_bulk_data {
> > > struct list_head list;
> > > + unsigned long gp_snap;
> > > unsigned long nr_records;
> > > void *records[];
> > > };
> > > @@ -2762,13 +2764,15 @@ struct kvfree_rcu_bulk_data {
> > > * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
> > > * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
> > > * @head_free: List of kfree_rcu() objects waiting for a grace period
> > > + * @head_free_gp_snap: Snapshot of RCU state for objects placed to "@head_free"
> > > * @bulk_head_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
> > > * @krcp: Pointer to @kfree_rcu_cpu structure
> > > */
> > >
> > > struct kfree_rcu_cpu_work {
> > > - struct rcu_work rcu_work;
> > > + struct work_struct rcu_work;
> > > struct rcu_head *head_free;
> > > + unsigned long head_free_gp_snap;
> > > struct list_head bulk_head_free[FREE_N_CHANNELS];
> > > struct kfree_rcu_cpu *krcp;
> > > };
> > > @@ -2964,10 +2968,11 @@ static void kfree_rcu_work(struct work_struct *work)
> > > struct rcu_head *head;
> > > struct kfree_rcu_cpu *krcp;
> > > struct kfree_rcu_cpu_work *krwp;
> > > + unsigned long head_free_gp_snap;
> > > int i;
> > >
> > > - krwp = container_of(to_rcu_work(work),
> > > - struct kfree_rcu_cpu_work, rcu_work);
> > > + krwp = container_of(work,
> > > + struct kfree_rcu_cpu_work, rcu_work);
> > > krcp = krwp->krcp;
> > >
> > > raw_spin_lock_irqsave(&krcp->lock, flags);
> > > @@ -2978,12 +2983,29 @@ static void kfree_rcu_work(struct work_struct *work)
> > > // Channel 3.
> > > head = krwp->head_free;
> > > krwp->head_free = NULL;
> > > + head_free_gp_snap = krwp->head_free_gp_snap;
> > > raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > >
> > > // Handle the first two channels.
> > > - for (i = 0; i < FREE_N_CHANNELS; i++)
> > > + for (i = 0; i < FREE_N_CHANNELS; i++) {
> > > + // Start from the tail page, so a GP is likely passed for it.
> > > + list_for_each_entry_safe_reverse(bnode, n, &bulk_head[i], list) {
> > > + // Not yet ready? Bail out since we need one more GP.
> > > + if (!poll_state_synchronize_rcu(bnode->gp_snap))
> > > + break;
> > > +
> > > + list_del_init(&bnode->list);
> > > + kvfree_rcu_bulk(krcp, bnode, i);
> > > + }
> > > +
> > > + // Please note a request for one more extra GP can
> > > + // occur only once for all objects in this batch.
> > > + if (!list_empty(&bulk_head[i]))
> > > + synchronize_rcu();
> >
> > Does directly invoking synchronize_rcu() instead of using queue_rcu_work()
> > provide benefits, for example, reduced memory footprint?
> >
> queue_rcu_work() will delay freeing of all objects in a batch. We can
> make use of it but it should be only for the ones which still require
> a grace period. A memory footprint and a time depends on when our
> callback is invoked by the RCU-core to queue the reclaim work.
>
> Such time can be long, because it depends on many factors:
>
> - scheduling delays in waking gp;
> - scheduling delays in kicking nocb;
> - delays in waiting in a "cblist":
> - dequeuing and invoking f(rhp);
> - delay in waking our final reclaim work and giving it a CPU time.
>
> This patch combines a possibility to reclaim asap for objects which
> passed a grace period and requesting one more GP for the ones which
> have not passed it yet.

Understood. It would be necessary to split the list in order to
immediately reclaim those whose grace periods have completed.
Then the remaining objects (only those whose grace periods have
not completed) would be passed to queue_rcu_work().

> > If not, it would be good to instead use queue_rcu_work() in order
> > to avoid an unnecessary context switch in this workqueue handler.
> >
> I went by the most easiest way from code perspective since i do not
> see problems with a current approach from testing and personal point
> of views.

I am worried about corner cases where memory is low and RCU grace periods
are being delayed and workqueues is running short of ktheads.

> If we are about to do that i need to add extra logic to split ready
> and not ready pointers for direct reclaim and the rest over the
> queu_rcu_work().

Agreed.

> I can check how it goes.

Please!

> > My concern is that an RCU CPU stall might otherwise end up tying up more
> > workqueue kthreads as well as more memory.
> >
> There is a limit. We have two batches, one work for each. Suppose the
> reclaim kthread is stuck in synchronize_rcu() so it does not do any
> progress. In this case same work can be only in pending state and
> nothing more no matter how many times the queue_work() is invoked:
>
> 2 * num_possible_cpus();
>
> If we end up in RCU stall we will not be able to reclaim anyway.

Understood.

The goal is not to make progress because as you say, we cannot make any
progress until the RCU grace period completes. The goal is instead to
avoid tying up workqueue kthreads while in that sad state.

Thanx, Paul