Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

From: Rafael J. Wysocki
Date: Thu Aug 04 2016 - 17:14:41 EST


On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote:
> On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle <steve.muckle@xxxxxxxxxx> wrote:
> > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle <steve.muckle@xxxxxxxxxx> wrote:
> > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > >> > ...
> > >> >> For this purpose, define a new cpufreq_update_util() flag
> > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> > >> >> cpufreq_update_util() in the in_iowait case. That generally
> > >> >> requires cpufreq_update_util() to be called directly from there,
> > >> >> because update_load_avg() is not likely to be invoked in that
> > >> >> case.
> > >> >
> > >> > I didn't follow why the cpufreq hook won't likely be called if
> > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> > >> > and calls update_cfs_rq_load_avg (triggers the hook).
> > >>
> > >> In practice it turns out that in the majority of cases when in_iowait
> > >> is set the second loop will not run.
> > >
> > > My understanding of enqueue_task_fair() is that the first loop walks up
> > > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > > the second loop updates the rest of the hierarchy that was already
> > > enqueued.
> > >
> > > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > > (meaning the whole hierarchy is traversed in the first loop and the
> > > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > > result in the cpufreq hook being called, via enqueue_entity() ->
> > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> >
> > But then it's rather difficult to pass the IO flag to this one, isn't it?
> >
> > Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> > p->in_iowait is set.
> >
> > If you can find a clever way to do it without adding an extra call
> > site, that's fine by me, but in any case the extra
> > cpufreq_update_util() invocation should not be too expensive.
>
> I was under the impression that function pointer calls were more
> expensive, and in the shared policy case there is a nontrivial amount of
> code that is run in schedutil (including taking a spinlock) before we'd
> see sugov_should_update_freq() return false and bail.

That's correct in principle, but we only do that if p->in_iowait is set,
which is somewhat special anyway and doesn't happen every time for sure.

So while there is overhead theoretically, I'm not even sure if it is measurable.

> Agreed that getting knowledge of p->in_iowait down to the existing hook
> is not easy. I spent some time fiddling with that. It seemed doable but
> somewhat gross due to the required flag passing and modifications
> to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
> pursuing I can keep working on it and post a draft.

Well, that's a Peter's call. :-)

> But I also wonder if the hooks are in the best location. They are
> currently deep in the PELT code. This may make sense from a theoretical
> standpoint, calling them whenever a root cfs_rq utilization changes, but
> it also makes the hooks difficult to correlate (for policy purposes such
> as this iowait change) with higher level logical events like a task
> wakeup. Or load balance where we probably want to call the hook just
> once after a load balance is complete.

I generally agree. We still need to ensure that the hools will be invoked
frequently enough, though, even if HZ is 100.

> This is also an issue for the remote wakeup case where I currently have
> another invocation of the hook in check_preempt_curr(), so I can know if
> preemption was triggered and skip a remote schedutil update in that case
> to avoid a duplicate IPI.
>
> It seems to me worth evaluating if a higher level set of hook locations
> could be used. One possibility is higher up in CFS:
> - enqueue_task_fair, dequeue_task_fair
> - scheduler_tick
> - active_load_balance_cpu_stop, load_balance

Agreed, that's worth checking.

> Though this wouldn't solve my issue with check_preempt_curr. That would
> probably require going further up the stack to try_to_wake_up() etc. Not
> yet sure what the other hook locations would be at that level.

That's probably too far away from the root cfs_rq utilization changes IMO.

Thanks,
Rafael