Re: [PATCH 1/4] sched/fair: Be less aggressive in calling cpufreq_update_util()

From: Christian Loehle
Date: Mon Dec 11 2023 - 13:47:39 EST


On 08/12/2023 01:52, Qais Yousef wrote:
> Due to the way code is structured, it makes a lot of sense to trigger
> cpufreq_update_util() from update_load_avg(). But this is too aggressive
> as in most cases we are iterating through entities in a loop to
> update_load_avg() in the hierarchy. So we end up sending too many
> request in an loop as we're updating the hierarchy.

If this is actually less aggressive heavily depends on the workload,
I can argue the patch is more aggressive, as you call cpufreq_update_util
at every enqueue and dequeue, instead of just at enqueue.
For an I/O workload it is definitely more aggressive, see below.

>
> Combine this with the rate limit in schedutil, we could end up
> prematurely send up a wrong frequency update before we have actually
> updated all entities appropriately.
> [SNIP]


> @@ -6704,14 +6677,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> */
> util_est_enqueue(&rq->cfs, p);
>
> - /*
> - * If in_iowait is set, the code below may not trigger any cpufreq
> - * utilization updates, so do it here explicitly with the IOWAIT flag
> - * passed.
> - */
> - if (p->in_iowait)
> - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
> for_each_sched_entity(se) {
> if (se->on_rq)
> break;
> @@ -6772,6 +6737,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> enqueue_throttle:
> assert_list_leaf_cfs_rq(rq);
>
> + cpufreq_update_util(rq, p->in_iowait ? SCHED_CPUFREQ_IOWAIT : 0);
> +
> hrtick_update(rq);
> }
>
> @@ -6849,6 +6816,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> dequeue_throttle:
> util_est_update(&rq->cfs, p, task_sleep);
> + cpufreq_update_util(rq, 0);

This is quite critical, instead of only calling the update
at enqueue (with SCHED_CPUFREQ_IOWAIT if applicable) it is
now called at every enqueue and dequeue. The only way for
schedutil (intel_pstate too?) to build up a value of
iowait_boost > 128 is a large enough rate_limit_us, as even
for just a in_iowait task the enqueue increases the boost and
its own dequeue could reduce it already. For just a basic
benchmark workload and 2000 rate_limit_us this doesn't seem
to be that critical, anything below 200 rate_limit_us didn't
show any iowait boosting > 128 anymore on my system.
Of course if the workload does more between enqueue and
dequeue (time until task issues next I/O) already larger
values of rate_limit_us will disable any significant
iowait boost benefit.

Just to add some numbers to the story:
fio --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1
fio --time_based --name=fiotest --filename=/dev/mmcblk2 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1

All results are sorted:
With this patch and rate_limit_us=2000:
(Second line is without iowait boosting, results are sorted):
[3883, 3980, 3997, 4018, 4019]
[2732, 2745, 2782, 2837, 2841]
/dev/mmcblk2
[4136, 4144, 4198, 4275, 4329]
[2753, 2975, 2975, 2975, 2976]

Without this patch and rate_limit_us=2000:
[3918, 4021, 4043, 4081, 4085]
[2850, 2859, 2863, 2873, 2887]
/dev/mmcblk2
[4277, 4358, 4380, 4421, 4425]
[2796, 3103, 3128, 3180, 3200]

With this patch and rate_limit_us=200:
/dev/nvme0n1
[2470, 2480, 2481, 2484, 2520]
[2473, 2510, 2517, 2534, 2572]
/dev/mmcblk2
[2286, 2338, 2440, 2504, 2535]
[2360, 2462, 2484, 2503, 2707]

Without this patch and rate_limit_us=200:
/dev/nvme0n1
[3880, 3956, 4010, 4013, 4016]
[2732, 2867, 2937, 2937, 2939]
/dev/mmcblk2
[4783, 4791, 4821, 4855, 4860]
[2653, 3091, 3095, 3166, 3202]

I'm currently working on iowait boosting and seeing where it's
actually needed and how it could be improved, so always interested
in anyone's thoughts.

(The second line here doesn't provide additional
information, I left it in to compare for reproducibility).
All with CONFIG_HZ=100 on an rk3399.

Best Regards,
Christian

> [SNIP]