Re: [Resend PATCH] psi : calc cfs task memstall time more precisely

From: Zhaoyang Huang
Date: Tue Nov 09 2021 - 20:37:25 EST


On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > CC peterz as well for rt and timekeeping magic
> >
> > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > >
> > > In an EAS enabled system, there are two scenarios discordant to current design,
> > >
> > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > RT task usually preempts CFS task in little core.
> > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > ignore the preempted time by RT, DL and Irqs.
>
> It ignores preemption full-stop. I don't see why RT/IRQ should be
> special cased here.
As Johannes comments, what we are trying to solve is mainly the
preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves.
Could you please catch up the recent reply of Dietmar, which maybe
provide more information.
>
> > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > via the proportion of cfs_rq's utilization on the whole rq.
>
>
> > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > +{
> > > + struct rq *rq = task_rq(current);
> > > + unsigned long growth_fixed = (unsigned long)growth;
> > > +
> > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > + return growth_fixed;
> > > +
> > > + if (current->in_memstall)
> > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > + - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > +
> > > + return growth_fixed;
> > > +}
> > > +
> > > static void init_triggers(struct psi_group *group, u64 now)
> > > {
> > > struct psi_trigger *t;
> > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > > }
> > >
> > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > + delta = psi_memtime_fixup(delta);
> >
> > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > period of an active reclaimer, since it's technically not stalled on
> > memory during this time but on CPU.
> >
> > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > are sleeping on refaults swapins, since they are not affected by what
> > is going on on the CPU.
>
> I think that focus on RT/IRQ is mis-guided here, and the implementation
> is horrendous.
>
> So the fundamental question seems to be; and I think Johannes is the one
> to answer that: What time-base do these metrics want to use?
>
> Do some of these states want to account in task-time instead of
> wall-time perhaps? I can't quite remember, but vague memories are
> telling me most of the PSI accounting was about blocked tasks, not
> running tasks, which makes all this rather more complicated.
memstall time is counted as exit - enter, which include both blocked
and running stat. However, we think the blocked time introduced by
preemption of RT/IRQ/DL are memstall irrelevant(should be eliminated),
while the ones between CFS tasks could be. Thanks for the mechanism of
load tracking, the implementation could be simple by calculating the
proportion of CFS_UTIL among the whole core's capacity.
>
> Randomly scaling time as proposed seems almost certainly wrong. What
> would that make the stats mean?
It is NOT randomly scaling, but scales in each record_times for CFS tasks.