Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

From: Parav Pandit
Date: Thu Sep 10 2015 - 13:46:57 EST


On Thu, Sep 10, 2015 at 10:19 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Parav.
>
> On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
>> This is one old white paper, but most of the reasoning still holds true on RDMA.
>> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf
>
> Just read it. Much appreciated.
>
> ...
>> These resources include are- QP (queue pair) to transfer data, CQ
>> (Completion queue) to indicate completion of data transfer operation,
>> MR (memory region) to represent user application memory as source or
>> destination for data transfer.
>> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> (Address handle), FLOW, PD (protection domain), user context etc.
>
> It's kinda bothering that all these are disparate resources.

Actually not. They are linked resources. Every QP needs associated one
or two CQ, one PD.
Every QP will use few MRs for data transfer.
Here is the good programming guide of the RDMA APIs exposed to the
user space application.

http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
So first version of the cgroups patch will address the control
operation for section 3.4.


> I suppose that each restriction comes from the underlying hardware and
> there's no accepted higher level abstraction for these things?
>
There is higher level abstraction which is through the verbs layer
currently which does actually expose the hardware resource but in
vendor agnostic way.
There are many vendors who support these verbs layer, some of them
which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
which support these verbs are in <drivers/infiniband/hw/> kernel tree.

There is higher level APIs above the verb layer, such as MPI,
libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
They all rely on the hardware resource. All of these higher level
abstraction is accepted and well used by certain application class. It
would be long discussion to go over them here.


>> >> This patch-set allows limiting rdma resources to set of processes.
>> >> It extend device cgroup controller for limiting rdma device limits.
>> >
>> > I don't think this belongs to devcg. If these make sense as a set of
>> > resources to be controlled via cgroup, the right way prolly would be a
>> > separate controller.
>> >
>>
>> In past there has been similar comment to have dedicated cgroup
>> controller for RDMA instead of merging with device cgroup.
>> I am ok with both the approach, however I prefer to utilize device
>> controller instead of spinning of new controller for new devices
>> category.
>> I anticipate more such need would arise and for new device category,
>> it might not be worth to have new cgroup controller.
>> RapidIO though very less popular and upcoming PCIe are on horizon to
>> offer similar benefits as that of RDMA and in future having one
>> controller for each of them again would not be right approach.
>>
>> I certainly seek your and others inputs in this email thread here whether
>> (a) to continue to extend device cgroup (which support character,
>> block devices white list) and now RDMA devices
>> or
>> (b) to spin of new controller, if so what are the compelling reasons
>> that it can provide compare to extension.
>
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.

Some of the higher level examples I gave above will adapt to resource
allocation failure. Some are actually adaptive to few resource
allocation failure, they do query resources. But its not completely
there yet. Once we have this notion of limited resource in place,
abstraction layer would adapt to relatively smaller value of such
resource.
These higher level abstraction is mainstream. Its shipped at least in
Redhat Enterprise Linux.

> Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level,

I agree that resource won't be common in next-gen other transport
whenever they arrive.
But with my existing background working on some of those transport,
they appear similar in nature and it might seek similar knobs.

> so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.
>
>> Current scope of the patch is limited to RDMA resources as first
>> patch, but for fact I am sure that there are more functionality in
>> pipe to support via this cgroup by me and others.
>> So keeping atleast these two aspects in mind, I need input on
>> direction of dedicated controller or new one.
>>
>> In future, I anticipate that we might have sub directory to device
>> cgroup for individual device class to control.
>> such as,
>> <sys/fs/cgroup/devices/
>> /char
>> /block
>> /rdma
>> /pcie
>> /child_cgroup..1..N
>> Each controllers cgroup access files would remain within their own
>> scope. We are not there yet from base infrastructure but something to
>> be done as it matures and users start using it.
>
> I don't think that jives with the rest of cgroup and what generic
> block or pcie attributes are directly exposed to applications and need
> to be hierarchically controlled via cgroup?
>
I do agree that currently cgroup doesn't have notion of sub cgroup or
above hierarchy today.
so until than I was considering to implement it under devices cgroup
as generic place without the hierarchy shown above.
Therefore current interface is at device cgroup level.

If you are suggesting to have rdma cgroup as separate entity for near
future, its fine with me.
Later on when next-gen arrives we might have scope to make rdma cgroup
as more generic one. But than it might look like what I described
above.

In past I have discussions with Liran Liss from Mellanox as well on
this topic and we also agreed to have such cgroup controller.
He has recent presentation at Linux foundation event indicating to
have cgroup for RDMA.
Below is the link to it.
http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf
Slides 1 to 7 and slide 13 will give you more insight to it.
Liran and I had similar presentation to RDMA audience with less slides
in RDMA openfabrics summit in March 2015.

I am ok to create separate cgroup for rdma, if community thinks that way.
My preference would be still use device cgroup for above extensions
unless there are fundamental issues that I am missing.
I would let you make the call.
Rdma and other is just another type of device with different
characteristics than character or block, so one device cgroup with sub
functionalities can allow setting knobs.
Every device category will have their own set of knobs for resources,
ACL, limits, policy.
And I think cgroup is certainly better control point than sysfs or
spinning of new control infrastructure for this.
That said, I would like to hear your and communities view on how they
would like to see this shaping up.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/