RFC rdma cgroup

From: Parav Pandit
Date: Wed Oct 28 2015 - 04:29:30 EST


Hi All,

Based on the review comments, feedback, discussion from/with Tejun,
Haggai, Doug, Jason, Liran, Sean, ORNL team, I have updated the design
as below.

This is fairly strong and simple design, addresses most of the points
raised to cover current RDMA use cases.
Feel free to skip design guidelines section and jump to design section
below if you find it too verbose. I had to describe it to set the
context and address comments from our past discussion.

Design guidelines:
-----------------------
1. There will be new rdma cgroup for accounting rdma resources
(instead of extending device cgroup).
Rationale: RDMA tracks different type of resources and it functions
differently than device cgroup. Though device cgroup could have been
extended for more generic nature, community feels that its better to
create RDMA cgroup, which might have more features than just resource
limit enforcement in future.

2. RDMA cgroup will allow resource accounting, limit enforcement on
per cgroup, per rdma device basis (instead of resource limiting across
all devices).
Rationale: this give granular control when multiple devices exist in the system.

3. Resources are not defined by the RDMA cgroup. Resources are defined
by RDMA/IB subsystem and optionally by HCA vendor device drivers.
Rationale: This allows rdma cgroup to remain constant while RDMA/IB
subsystem can evolve without the need of rdma cgroup update. A new
resource can be easily added by the RDMA/IB subsystem without touching
rdma cgroup.

4. RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.
Rationale:
(a) RDMA verbs are very well defined set of resource abstraction in
Linux kernel stack for many years now and in use by many applications
directly working with RDMA resources in varied manner. Instead of
replicating code in every vendor driver, RDMA uverbs layer will
enforce such resource limits (with help of rdma cgroup).
(b) IB verbs resource is also a vendor agnostic representation of RDMA
resource; therefore its done at RDMA uverbs level.

6. RDMA uverbs layer will not do accounting of hw vendor specific resources.
Rationale: RDMA uverbs layer is not aware of which hw resource maps to
which verb resource and by how much amount. Therefore hw resource
accounting, charging, uncharging has to happen by the vendor driver.
This is optional and left to the HCA vendor device driver to
implement. HCA driver best knows on how to keep the mapping, therefore
its left to HCA vendor driver to do the accounting.

7. RDMA cgroup will provide unified APIs through which both RDMA
subsystem and vendor defined RDMA resource can be charged, uncharged
by verb layer and HCA driver respectively.

8. RDMA cgroup initial version will support only hard limits without
any kind of reservation of resources or ranges. In future it might be
extended for more dynamic nature.
Rationale: Typically RDMA resources are stateful resource unlike cpu
and RDMA resources don't follow work conserving nature.

9. Resource limit enforcement is hierarchical.

10. Process migration from one to other cgroup with active RDMA
resource is highly discouraged.

11. When process is migrated with active RDMA resources, rdma cgroup
continues to charge original cgroup.
Rationale:
Unlike other POSIX calls, RDMA resources are not defined as POSIX
level. These resources sit behind a file descriptor.
Multiple processes forked, belonging to different thread group, can
possibly placed in different cgroup sharing same rdma resources.
It could be well done where its allocated by one thread group and
release by other thread group from different cgroup.
Resource usage hierarchy can bet easily get complex even though that
is not primary use case.
Typically all processes which want to use RDMA resources will be part
of one leaf cgroup throughout their life cycle.
Therefore its not worth to complicate design around process migration.

Design:
---------
1. New RDMA cgroup defines resource pool object, that connects cgroup
subsystem to RDMA subsystem.
2. Resource pool object is per cgroup, per device entity that is
managed, controlled, configured by the administrator via cgroup
interface.
3. There can be at maximum of 64 resources per resource pool (such as
MR, QP, AH, PD etc and other hardware resources). To manage resources
beyond 64, it will require RDMA cgroup subsystem update. This will be
done in future if at all its needed.

4. RDMA cgroup defines two class of resources.
(a) verb resources - tracks RDMA verb layer resources
(b) hw resources - tracks HCA HW specific resources
5. verbs resource template is defined by RDMA uverbs layer.
6. hw resource template is defined by HCA vendor driver. This is
optional and should be done by those driver which doesn't have one to
one mapping with verb resource and hw resource.

7. Processes in a cgroup without any configured limit (or in other
words without resource pools) has max limits of the resources. If one
of the resource limit is configured, that particular resource will be
enforced, rest will enjoy upto their maximum limit.

8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
hw resource pool per such device.
(Nothing stops to have more devices and pools, but design is around
this use case).

9. Resource pool object is created in following situations.
(a) administrative operation is done to set the limit and no previous
resource pool exist for the device of interest for the cgroup.
(b) no resource limits were configured, but IB/RDMA subsystem tries to
charge the resource. so that when applications are running without
limits and later on when limits are enforced, during uncharging, it
correctly uncharges them, otherwise usage count will drop to negative.
This is done using default resource pool.
Instead of implementing any sort of time markers, default pool
simplifies the design.
(c) When process migrate from one to other cgroup, resource is
continue to be owned by the creator cgroup (rather css).
After process migration, whenever new resource is created in new
cgroup, it will be owned by new cgroup.

10. Resource pool is destroyed if it was of default type (not created
by administrative operation) and its the last resource getting
deallocated. Resource pool created as administrative operation is not
deleted, as its expected to be used in near future.

13. if administrative command tries to delete all the resource limit
with active resources per device, RDMA cgroup just marks the pool as
default pool with maximum limits.
----------------------------------------------------------------

Examples:
#configure resource limit:
echo mlx4_0 mr=100 qp=10 ah=2 cq=10 >
/sys/fs/cgroup/rdma/1/rdma.resource.verb.limit
echo ocrdma1 mr=120 qp=20 ah=2 cq=10 >
/sys/fs/cgroup/rdma/2/rdma.resource.verb.limit

#query resource limit:
cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit
#output:
mlx4_0 mr=100 qp=10 ah=2 cq=10
ocrdma1 mr=120 qp=20 cq=10

#delete resource limit:
echo mlx4_0 del > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit

#query resource list:
cat /sys/fs/cgroup/rdma/1/rdma.resource.verb.list
mlx4_0 mr qp ah pd cq

cat /sys/fs/cgroup/rdma/1/rdma.hw.verb.list
vendor1 hw_qp hw_cq hw_timer

#configure hw specific resource limit
echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.resource.hw.limit

-------------------------------------------------------------------------

I have completed initial development of above design. I am currently
testing this design.
I will post the patch soon once I am done validating it.

Let me know if there are any design comments.

Regards,
Parav Pandit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/