Re: [RFC]Add new mdev interface for QoS

From: Gao, Ping A
Date: Tue Aug 08 2017 - 08:49:14 EST



On 2017/8/8 14:42, Kirti Wankhede wrote:
>
> On 8/7/2017 1:11 PM, Gao, Ping A wrote:
>> On 2017/8/4 5:11, Alex Williamson wrote:
>>> On Thu, 3 Aug 2017 20:26:14 +0800
>>> "Gao, Ping A" <ping.a.gao@xxxxxxxxx> wrote:
>>>
>>>> On 2017/8/3 0:58, Alex Williamson wrote:
>>>>> On Wed, 2 Aug 2017 21:16:28 +0530
>>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>>>>>
>>>>>> On 8/2/2017 6:29 PM, Gao, Ping A wrote:
>>>>>>> On 2017/8/2 18:19, Kirti Wankhede wrote:
>>>>>>>> On 8/2/2017 3:56 AM, Alex Williamson wrote:
>>>>>>>>> On Tue, 1 Aug 2017 13:54:27 +0800
>>>>>>>>> "Gao, Ping A" <ping.a.gao@xxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>> On 2017/7/28 0:00, Gao, Ping A wrote:
>>>>>>>>>>> On 2017/7/27 0:43, Alex Williamson wrote:
>>>>>>>>>>>> [cc +libvir-list]
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 26 Jul 2017 21:16:59 +0800
>>>>>>>>>>>> "Gao, Ping A" <ping.a.gao@xxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The vfio-mdev provide the capability to let different guest share the
>>>>>>>>>>>>> same physical device through mediate sharing, as result it bring a
>>>>>>>>>>>>> requirement about how to control the device sharing, we need a QoS
>>>>>>>>>>>>> related interface for mdev to management virtual device resource.
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g. In practical use, vGPUs assigned to different quests almost has
>>>>>>>>>>>>> different performance requirements, some guests may need higher priority
>>>>>>>>>>>>> for real time usage, some other may need more portion of the GPU
>>>>>>>>>>>>> resource to get higher 3D performance, corresponding we can define some
>>>>>>>>>>>>> interfaces like weight/cap for overall budget control, priority for
>>>>>>>>>>>>> single submission control.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I suggest to add some common attributes which are vendor agnostic in
>>>>>>>>>>>>> mdev core sysfs for QoS purpose.
>>>>>>>>>>>> I think what you're asking for is just some standardization of a QoS
>>>>>>>>>>>> attribute_group which a vendor can optionally include within the
>>>>>>>>>>>> existing mdev_parent_ops.mdev_attr_groups. The mdev core will
>>>>>>>>>>>> transparently enable this, but it really only provides the standard,
>>>>>>>>>>>> all of the support code is left for the vendor. I'm fine with that,
>>>>>>>>>>>> but of course the trouble with and sort of standardization is arriving
>>>>>>>>>>>> at an agreed upon standard. Are there QoS knobs that are generic
>>>>>>>>>>>> across any mdev device type? Are there others that are more specific
>>>>>>>>>>>> to vGPU? Are there existing examples of this that we can steal their
>>>>>>>>>>>> specification?
>>>>>>>>>>> Yes, you are right, standardization QoS knobs are exactly what I wanted.
>>>>>>>>>>> Only when it become a part of the mdev framework and libvirt, then QoS
>>>>>>>>>>> such critical feature can be leveraged by cloud usage. HW vendor only
>>>>>>>>>>> need to focus on the implementation of the corresponding QoS algorithm
>>>>>>>>>>> in their back-end driver.
>>>>>>>>>>>
>>>>>>>>>>> Vfio-mdev framework provide the capability to share the device that lack
>>>>>>>>>>> of HW virtualization support to guests, no matter the device type,
>>>>>>>>>>> mediated sharing actually is a time sharing multiplex method, from this
>>>>>>>>>>> point of view, QoS can be take as a generic way about how to control the
>>>>>>>>>>> time assignment for virtual mdev device that occupy HW. As result we can
>>>>>>>>>>> define QoS knob generic across any device type by this way. Even if HW
>>>>>>>>>>> has build in with some kind of QoS support, I think it's not a problem
>>>>>>>>>>> for back-end driver to convert mdev standard QoS definition to their
>>>>>>>>>>> specification to reach the same performance expectation. Seems there are
>>>>>>>>>>> no examples for us to follow, we need define it from scratch.
>>>>>>>>>>>
>>>>>>>>>>> I proposal universal QoS control interfaces like below:
>>>>>>>>>>>
>>>>>>>>>>> Cap: The cap limits the maximum percentage of time a mdev device can own
>>>>>>>>>>> physical device. e.g. cap=60, means mdev device cannot take over 60% of
>>>>>>>>>>> total physical resource.
>>>>>>>>>>>
>>>>>>>>>>> Weight: The weight define proportional control of the mdev device
>>>>>>>>>>> resource between guests, itâs orthogonal with Cap, to target load
>>>>>>>>>>> balancing. E.g. if guest 1 should take double mdev device resource
>>>>>>>>>>> compare with guest 2, need set weight ratio to 2:1.
>>>>>>>>>>>
>>>>>>>>>>> Priority: The guest who has higher priority will get execution first,
>>>>>>>>>>> target to some real time usage and speeding interactive response.
>>>>>>>>>>>
>>>>>>>>>>> Above QoS interfaces cover both overall budget control and single
>>>>>>>>>>> submission control. I will sent out detail design later once get aligned.
>>>>>>>>>> Hi Alex,
>>>>>>>>>> Any comments about the interface mentioned above?
>>>>>>>>> Not really.
>>>>>>>>>
>>>>>>>>> Kirti, are there any QoS knobs that would be interesting
>>>>>>>>> for NVIDIA devices?
>>>>>>>>>
>>>>>>>> We have different types of vGPU for different QoS factors.
>>>>>>>>
>>>>>>>> When mdev devices are created, its resources are allocated irrespective
>>>>>>>> of which VM/userspace app is going to use that mdev device. Any
>>>>>>>> parameter we add here should be tied to particular mdev device and not
>>>>>>>> to the guest/app that are going to use it. 'Cap' and 'Priority' are
>>>>>>>> along that line. All mdev device might not need/use these parameters,
>>>>>>>> these can be made optional interfaces.
>>>>>>> We also define some QoS parameters in Intel vGPU types, but it only
>>>>>>> provided a default fool-style way. We still need a flexible approach
>>>>>>> that give user the ability to change QoS parameters freely and
>>>>>>> dynamically according to their requirement , not restrict to the current
>>>>>>> limited and static vGPU types.
>>>>>>>
>>>>>>>> In the above proposal, I'm not sure how 'Weight' would work for mdev
>>>>>>>> devices on same physical device.
>>>>>>>>
>>>>>>>> In the above example, "if guest 1 should take double mdev device
>>>>>>>> resource compare with guest 2" but what if guest 2 never booted, how
>>>>>>>> will you calculate resources?
>>>>>>> Cap is try to limit the max physical GPU resource for vGPU, it's a
>>>>>>> vertical limitation, but weight is a horizontal limitation that define
>>>>>>> the GPU resource consumption ratio between vGPUs. Cap is easy to
>>>>>>> understand as it's just a percentage. For weight. for example, if we
>>>>>>> define the max weight is 16, the vGPU_1 who get weight 8 should been
>>>>>>> assigned double GPU resources compared to the vGPU_2 whose weight is 4,
>>>>>>> we can translate it to this formula: resource_of_vGPU_1 = 8 / (8+4) *
>>>>>>> total_physical_GPU_resource.
>>>>>>>
>>>>>> How will vendor driver provide max weight to userspace
>>>>>> application/libvirt? Max weight will be per physical device, right?
>>>>>>
>>>>>> How would such resource allocation reflect in 'available_instances'?
>>>>>> Suppose in above example, vGPU_1 is of 1G FB with weight 8, vGPU_2 with
>>>>>> 1G FB with weight 4 and vGPU_3 with 1G FB with weight 4. Now you have 1G
>>>>>> FB free but you have reached max weight, so will you make
>>>>>> available_instances = 0 for all types on that physical GPU?
>>>>> No, per the algorithm above, the available scheduling for the remaining
>>>>> mdev device is N / (8 + 4 + 4 + N), where N is 1-16 (or maybe 0-16,
>>>>> we'd need to define or make the range discoverable, 16 seems rather
>>>>> arbitrary). We can always add new scheduling participants. AIUI,
>>>>> Intel uses round-robin scheduling now, where you could consider all
>>>>> mdev devices to have the same weight. Whether we consider that to be a
>>>>> weight of 16 or zero or 8 doesn't really matter.
>>>> QoS is to control the device's process capability like GPU
>>>> rendering/computing that can be time multiplexing, not used to control
>>>> the dedicated partition resources like FB, so there is no impact on
>>>> 'available_instances'.
>>>>
>>>> if vGPU_1 weight=8, vGPU_2 weight=4;
>>>> then vGPU_1_res = 8 / (8 + 4) * total, vGPU_2_res = 4 / (8 + 4) * total;
>>>> if vGPU_3 created with weight 2;
>>>> then vGPU_1_res = 8 /(8 + 4 + 2) * total, vGPU_2_res = 4 / (8 + 4 + 2) *
>>>> total, vGPU_3_res = 2 / (8 + 4 + 2) * total.
>>>>
>>>> The resource allocation of vGPU_1 and vGPU_2 have been dynamically
>>>> changed after vGPU_3 creating, that's weight doing as it's to define the
>>>> relationship of all the vGPUs, the performance degradation is meet
>>>> expectation. The end-user should know about such behavior.
>>>>
>>>> However the argument on weight let me has some self-reflection, does the
>>>> end-user real need weight? does weight has actually application
>>>> requirement? Maybe the cap and priority are enough?
>>> What sort of SLAs do you want to be able to offer? For instance if I
>>> want to be able to offer a GPU in 1/4 increments, how does that work?
>>> I might sell customers A & B 1/4 increment each and customer C a 1/2
>>> increment. If weight is removed, can we do better than capping A & B
>>> at 25% each and C at 50%? That has the downside that nobody gets to
>>> use the unused capacity of the other clients. The SLA is some sort of
>>> "up to X% (and no more)" model. With weighting it's as simple as making
>>> sure customer C's vGPU has twice the weight of that given to A or B.
>>> Then you get an "at least X%" SLA model and any customer can use up to
>>> 100% if the others are idle. Combining weight and cap, we can do "at
>>> least X%, but no more than Y%".
>>>
>>> All of this feels really similar to how cpusets must work since we're
>>> just dealing with QoS relative to scheduling and we should not try to
>>> reinvent scheduling QoS. Thanks,
>>>
>> Yeah, that's also my original thoughts.
>> Since we get aligned about the QoS basic definition, I'm going to
>> prepare the code in kernel side. How about the corresponding part in
>> libvirt? Implemented separately after the kernel interface finalizing?
>>
> Ok. These interfaces should be optional since all vendors drivers of
> mdev may not support such QoS.
>

Sure, all of them are optional, it's freely to choose or even not to choose.

Thanks,
Ping