Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

From: Reinette Chatre
Date: Thu Nov 16 2023 - 14:48:52 EST


Hi Tony,

On 11/15/2023 1:54 PM, Tony Luck wrote:
> On Wed, Nov 15, 2023 at 08:09:13AM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/9/2023 1:27 PM, Luck, Tony wrote:
>>>>> Maybe additional an mount option "mba_MBps_total" so the user can pick
>>>>> total instead of local?
>>>>
>>>> Is this something for which a remount is required? Can it not perhaps be
>>>> changed at runtime?
>>>
>>> In theory, yes. But I've been playing with a patch that adds a writable info/
>>> file to allow runtime switch:
>>>
>>> # ls -l /sys/fs/resctrl/info/MB/mba_MBps_control
>>> -rw-r--r--. 1 root root 0 Nov 9 10:57 /sys/fs/resctrl/info/MB/mba_MBps_control
>>> ]# cat /sys/fs/resctrl/info/MB/mba_MBps_control
>>> total
>>>
>>> and found that it's a bit tricky to switch out the MBM event from the
>>> state machine driving the feedback loop. I think the problem is in the
>>> code that tries to stop the control loop from switching between two
>>> throttling levels every second:
>>>
>>> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
>>> new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
>>> } else if (cur_msr_val < MAX_MBA_BW &&
>>> (user_bw > (cur_bw + delta_bw))) {
>>> new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
>>> } else {
>>> return;
>>> }
>>>
>>> The code drops down one percentage step if current bandwidth is above
>>> the desired target. But stepping back up checks to see if "cur_bw + delta_bw"
>>> is below the target.
>>>
>>> Where does "delta_bw" come from? Code uses the Boolean flag "pmbm_data->delta_comp"
>>> to request the once-per-second polling compute the change in bandwidth on the
>>> next poll after adjusting throttling MSRs.
>>>
>>> All of these values are in the "struct mbm_state" which is a per-event-id structure.
>>>
>>> Picking an event at boot time works fine. Likely also fine at mount time. But
>>> switching at run-time seems to frequently end up with a very large value in
>>> "delta_bw" (as it compares current & previous for this event ... and it looks
>>> like things changed from zero). Net effect is that throttling is increased when
>>> processes go over their target for the resctrl group, but throttling is never decreased.
>>
>> This is not clear to me. Would the state not also start from zero at boot and mount
>> time? From what I understand the state is also reset to zero on monitor group creation.
>
> Yes. All of boot, mount, mkdir start a group in a well defined state
> with no throttling applied (schemata shows bandwitdh limit as 2^32
> MBytes/sec). If the user sets some realistic limit, and the group
> MBM measurement exceeds that limit, then the MBA MSR for the group
> is dropped from 100% to 90% and the delta_comp flag set to record
> the delta_bw on the next 1-second poll.
>
> The value of delta_bw is only used when looking to reduce throttling.
> To be in that state this group must have been in a state where
> throttling was increased ... which would result in delta_bw being
> set up.
>
> Now look at what happens when switching from local to total for the
> first time. delta_bw is zero in the structures recording total bandwidth
> information. But more importanly so is prev_bw. If the code above
> changes throttling value and requests an updated calulation of delta_bw,
> that will be done using a value of prev_bw==0. I.e. delta_bw will be
> set to the current bandwidth. That high value will likely block attempts
> to reduce throttling.

Thank you for the detailed explanation. I think there are ways in which
to make this transition smoother, for example to not compute delta_bw
if there is no history (no "prev_bw_bytes"). But that would just fix
the existing algorithm without addressing the other issues you raised
with this algorithm.

>
> Maybe when switching MBM source events the prev_bw value should be
> copied from old source structures to new source structures as a rough
> guide to avoid crazy actions. But that could also be wrong when
> switching from total to local for a group that has poor NUMA
> localization and total bandwidth is far higher than local.
>
>>> The whole heuristic seems a bit fragile. It works well for test processes that have
>>> constant memory bandwidth. But I could see it failing in scenarios like this:
>>>
>>> 1) Process is over MB limit
>>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
>>> 3) Process blocks on some event and uses no bandwidth in next one second
>>> 4) Next poll. Linux computes delta_bw as abs(cur_bw - m->prev_bw). cur_bw is zero,
>>> so delta_bw is set to full value of bandwidth that process used when over budget
>>> 5) Process resumes running
>>> 6) Linux sees process using less than target, but cur_bw + delta_bw is above target,
>>> so Linux doesn't adjust throttling
>>>
>>> I think the goal was to avoid relaxing throttling and letting a resctrl group go back over
>>> target bandwidth. But that doesn't work either for groups with highly variable bandwidth
>>> requirements.
>>>
>>> 1) Group is over budget
>>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
>>> 3) Group forks additional processes. New bandwidth from those offsets the reduction due to throttling
>>> 4) Next poll. Linux sees bandwidth is unchanged. Sets delta_bw = 0.
>>> 5) Next poll. Groups aggregate bandwidth is fractionally below target. Because delta_bw=0, Linux
>>> reduces throttling.
>>> 6) Group goes over target.
>>>
>>
>> I'll defer to you for the history about this algorithm. I am not familiar with how
>> broadly this feature is used but I have not heard about issues with it. It does
>> seem as though there is some opportunity for investigation here.
>
> I sure I could construct an artificial test case to force this scenario.
> But maybe:
> 1) It never happens in real life
> 2) It happens, but nobody noticed
> 3) People figured out the workaround (set schemata to a really big
> MBytes/sec value for a second, and then back to desired value).
> 4) Few people use this option
>
> I dug again into the lore.kernel.org archives. Thomas complained
> that is wasn't "calibration" (as Vikas had descibed in in V1) but
> seems to have otherwise been OK with it as a heuristic.
>
> https://lore.kernel.org/all/alpine.DEB.2.21.1804041037090.2056@xxxxxxxxxxxxxxxxxxxxxxx/
>
>
> I coded up and tested the below patch as a possible replacement heuristic.
> But I also wonder whether just letting the feedback loop flip throttling
> up and down between throttling values above/below the target bandwidth
> would really be so bad. It's just one MSR write that can be done from
> the current CPU and would result in average bandwidth closer to the
> user requested target.

The proposed heuristic seem to assume that the bandwidth used has
a linear relationship to the throttling percentage. It seems to set
aside the reasons that motivated this "delta_bw" in the first place:

> - * This is because (1)the increase in bandwidth is not perfectly
> - * linear and only "approximately" linear even when the hardware
> - * says it is linear.(2)Also since MBA is a core specific
> - * mechanism, the delta values vary based on number of cores used
> - * by the rdtgrp.

>From the above I understand that reducing throttling by 10% does not
imply that bandwidth consumed will increase by 10%. A new heuristic like
this may thus decide not to relax throttling expecting that doing so would
cause bandwidth to go over limit while the non-linear increase may result
in bandwidth consumed not going over limit when throttling is relaxed.

I am also curious if only using target bandwidth would be bad.

I looked through the spec and was not able to find any information
guiding to the cost of adjusting the allocation once per second
(per resource group per domain). The closest I could find was
the discussion of a need of a "fine-grained software controller" where
it is not clear if "once per second" can be considered "fine grained".

Reinette