Re: [PATCH v7 2/4] sched: Rewrite runnable load and utilization average tracking

From: Yuyang Du
Date: Tue May 19 2015 - 12:04:03 EST


Hi Ben,

Thanks a lot for reviewing.

On Mon, May 18, 2015 at 01:55:21PM -0700, bsegall@xxxxxxxxxx wrote:
> > therefore, by this rewrite, we have an entirely updated cfs_rq at the time
> > we update it:
> >
> > t1: update cfs_rq { e1_new, e2_new, ..., en_new }
> >
> > t2: update cfs_rq { e1_new, e2_new, ..., en_new }
> >
> > ...
>
> This requires a u32*u32->u64 multiply that the old code did not, but
> only on period rollover which already has a u64 div, so that's probably
> fine. The non-rollover one would only need one with
> SCHED_LOAD_RESOLUTION active, and this code should probably
> scale_load_down anyway (the old code did in the equivalent place).

Oh, yes, I haven't scale_load_down the weight, I should do that.

> I need to read the code more thoroughly to be sure everything else is ok

Please read it thoroughly. I would appreciate that.

> This also causes an entity's load to not immediately change when its
> weight does, which is in some ways more sensible, but not how it's ever
> worked.

The entity's load changes immediately since the last update time, which
is the same as the old one. But the load before the last update time won't
change, as the weight is always embeded into the load_sum. This may lead
to the question of how we account for the history of load or what the history
is. I won't debate on this. For tasks, arguably the weight does not change
much, so the impact of either is limited. But for group entity, as the weight
changes a lot, I'd prefer the weight should be taken into the history.

> >
> > 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's
> > per CPU cfs_rqs in whether or not blocked_load_average contributes to the load.
> >
> > The basic idea behind runnable load average (the same as utilization) is that
> > the blocked state is taken into account as opposed to only accounting for the
> > currently runnable state. Therefore, the average should include both the
> > runnable/running and blocked load averages. This rewrite does that.
> >
> > In addition, we also combine runnable/running and blocked averages of all entities
> > into the cfs_rq's average, and update it together at once. This is based on the
> > fact that:
> >
> > update(runnable) + update(blocked) = update(runnable + blocked)
> >
> > This also significantly reduces the codes as we don't need to separately maintain
> > runnable/running load and blocked load.
>
> This means that races and such around migrate have the opportunity to
> end up reducing the entire average, rather than just the blocked load
> (which was going towards zero anyway). This may not be the end of the
> world, but iirc was part of (all of?) the reason for the separation to
> begin with.

Since "update(runnable) + update(blocked) = update(runnable + blocked)"
is "cheap" and logically right to do, just do it.

IIRC, last time, you said that the separation is due to some benchmark
regression...

> I looked at the diff, but it quickly became unreadable due to git
> finding "common" blank lines, and it doesn't apply to tip/master or the
> other trees I tried.

It is on mainline, v4.1-rc1, v4.1-rc4 is fine too. I just looked at
the tip tree, the proc.c is changed, so I need to rebase the code, I
guess/hope it is not changed much. Will update the patch series later.

> > - /*
> > - * These sums represent an infinite geometric series and so are bound
> > - * above by 1024/(1-y). Thus we only need a u32 to store them for all
> > - * choices of y < 1-2^(-32)*1024.
> > - * running_avg_sum reflects the time that the sched_entity is
> > - * effectively running on the CPU.
> > - * runnable_avg_sum represents the amount of time a sched_entity is on
> > - * a runqueue which includes the running time that is monitored by
> > - * running_avg_sum.
> > - */
> > - u32 runnable_avg_sum, avg_period, running_avg_sum;
> > + u64 last_update_time;
> > + u64 load_sum, util_sum;

> util_sum can still just be u32, yes?

u32 is enough, as it has no weight, and anyway only one task is running.

Thanks,
Yuyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/