[BUG] Interger overflow when calculating imbalance for CFS scheduler

From: Tingjia Cao
Date: Mon Apr 10 2023 - 13:25:28 EST


We have identified an issue with the rebalance algorithm of CFS
scheduler when using kernel versions 6.0 or 6.3-rc5. Specifically, the
calculate_imbalance function in kernel/sched/fair.c may produce
incorrect results due to an integer overflow bug.

The algorithm is designed to pull some tasks from the busiest group to
the local group. But when both groups are or will become overloaded,
the algorithm doesn't want to push the local group above the average
load of the sched domain. However, in some cases, the calculation of
imbalance can be wrong, causing meaningless migration and even
amplifying the imbalance.

The problem arises when the function executes the following lines:
env->imbalance = min(
(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

In some cases (e.g. when local->group_type = group_fully_busy and
busiest->group_type = group_overloaded), the system cannot guarantee
sds->avg_load being larger than local->avg_load. Since both
sds->avg_load and local->avg_load are unsigned long, it will cause an
integer overflow when executing (sds->avg_load - local->avg_load). The
calculated imbalance is not the minimum of (busiest - sds) and (sds -
local) anymore. As a result, the rebalance algorithm is likely to do
some meaningless migration and make the local_group's load exceed the
sched_domain's avg load after the migration.


Reproduce
=============
We attach a patch and a module that can trace the local->avg_load,
sds->avg_load, and busiest->avg_load.

We used the following workloads to test the algorithm:
- We firstly create two task group:
mkdir /sys/fs/cgroup/t0
mkdir /sys/fs/cgroup/t1
echo 100000 10000 > /sys/fs/cgroup/t0/cpu.max

- Then, we run the following programs in a shell (the c files are attached):
./test0 &
./test0 &
./test0 &
./test1 &

In test0, we clone 9 processes to cgroup t0 and make them sleep
occasionally. In test 1, we clone 10 processes to cgroup t1 and make
them run continuously.

- Then, we use the trace module we provided in the attachment to get
the trace, here are example trace items we get:
392168005763 nsecs IC 236 103 104 133
392168015305 nsecs LB 8 1 85 133

The first line means that in the calculate_imbalance function,
1) busiest->avg_load = 236
2) local->avg_load = 104
3) sds->avg_load = 103
It will cause an integer overflow and get a very large imbalance 133.
The expected value is min(236 - 103, 103 - 104) = -1.

The second line means that the wrong imbalance calculation will lead
to an unreasonable migration from cpu 8 to cpu 1, though the
local_group's load will exceed the sds->avg_load a lot after the
migration. The migrated task's h_load is 85, the env->nr_balanced_fail
= 0. If the imbalance = -1, it shouldn't do the migration. However,
with the wrong imbalance 133, it'll do the task migration.

Kernel Info
===============
Host OS: ubuntu20.04
Processor: Two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz (forbid
hyperthread)
Kernel Version: 6.3-rc5, 6.0

Here's the link in the bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=217316

Best,
Tingjia

Attachment: bug_report.zip
Description: Zip archive