EEVDF virtual time calculation rationale

From: rohan470
Date: Tue Jan 02 2024 - 21:39:28 EST

Next message: 孟敬姿: "capability checks in sk_setsockopt() and __sock_cmsg_send() inconsistent with the documentation"
Previous message: David Rientjes: "Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

I have a question regarding the calculation of virtual time in the new EEVDF system.
I'm quite new to Linux kernel development, so forgive me if my question sounds naive or if my assumptions regarding the kernel's scheduling mechanisms are incorrect.

I read Stoica et al's EEVDF papers (both the 37 page technical report [1], and the 1996 RTSS conference edition [2]). I then spent the past few weeks studying the new EEVDF code.
I noticed a difference in how virtual time is calculated in the paper versus how the kernel does it.

The paper defines virtual time as:
V(t)=\int_{0}^{t}\frac{1}{\sum_{j\in A(\tau)}w_j}d\tau

However, if I understand correctly, the kernel does not calculate virtual time with this formula.
Instead, it takes the \Sum lag_i = 0 invariant of EEVDF and the definition of lag (lag_i = S - s_i = w_i * (V - v_i)), and rearranges the formula to solve for virtual time, V.
This means virtual time is effectively calculated as a weighted average of the tasks' virtual runtimes.

Why does the kernel calculate virtual time this way? Does it produce the exact same value as the integral solution?
Was this inherited from the CFS system?

One issue I thought of with the integral solution is that the definition of time 0 is somewhat ambiguous. Is it when the system boots, or when the runqueue is initialized, or something else?
Could it be set to any arbitrary time constant, as long as it's before the first time we try to calculate virtual time?
Unfortunately, the paper doesn't answer these questions, since it omits the implementation of the get_current_vt function.

Could an integral version be implemented like this?: we store the virtual time in a variable, virtual_time. When the scheduler runs periodically from an interrupt, we update virtual time with: virtual_time += (1/sum_of_weights)*(t-t'), where t is the current time and t' is the last time we updated the virtual_time variable. When a task leaves/joins the system, we update virtual time with: virtual_time = virtual_time +/- (lag_i(t)/(sum_of_weights)), where sum_of_weights is the new sum after the leave/join.

For the periodic update (not the leave/join) of virtual_time, we effectively break the integral into two parts: one for the 0 to t' interval, and another for the t' to t interval. Since the integral for 0 to t' is already calculated, we don't have to worry about the variations in sum_of_weights during that interval. Since we update virtual_time and sum_of_weights whenever a task leaves/joins the system, then t' is the last time the weights could have changed before t. Therefore, we know sum_of_weights stays constant between t' and t.

I'm curious to learn more about why the weighted average method is used, and whether or not the integral solution is possible to use in the kernel.

Thanks,
-Rohan

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
[2] https://ieeexplore.ieee.org/abstract/document/563725?casa_token=eZ1XTM6ux9QAAAAA:10ncyLGX4mjYATqMTdwBRgQTkIjE4MHtYvYKLXVuaubP_cdWr8YeX-TALqd-94OxhDp248wAWFqe

Next message: 孟敬姿: "capability checks in sk_setsockopt() and __sock_cmsg_send() inconsistent with the documentation"
Previous message: David Rientjes: "Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]