[RFC] hwloc: Add support for exporting latency, bandwidth topology through calibration

From: Chengchang Tang
Date: Wed Dec 01 2021 - 04:45:23 EST


Currently, hwloc can export hardware and network locality for applications to obtain and set their affinity. However, in many scenarios, the information provided by the topology is not enough, for example, it cannot reflect the actual memory latency and bandwidth data between different schedule domain. We hope to provide more detailed and precise information of HW capabilities in hwloc by adding several new calibration tools, so that application can achieve a more refined design to achieve higher performance and fully tap the capabilities of the HW.

We mainly focus on exposing memory/bus bandwidth, cache coherence/bus communication latency etc to users. Those topology information has neither standard ACPI nor dts interface to export, but they can be beneficial of user applications. Some examples,
1. the memory bandwidth while we spread tasks between multiple clusters vs. gather them in one cluster
2. the memory bandwidth while we spread tasks between multiple NUMA nodes vs. gather them in one NUMA
3. the cache synchronization latency while we spread tasks between multiple clusters vs. gather them in one cluster
4. the cache synchronization latency while we spread tasks between multiple NUMA nodes vs. gather them in one NUMA node
5. bus bandwidth and congestion in complex topology, for example, for the below topology
node 1 - node0 - node2 - node3
the bus between node0 and node2 might become bottleneck as the communications between node1 and node3 also depend on it.
numa distance can't describe this kind of complex bus topology at all.
6. I/O bandwidth and latency while we access I/O devices such as accelerators, networks, storages from the NUMA node which devices belong to vs. from different NUMA nodes.
...

If possible, we also can export more such as IPC bandwidth and latency(for example, pipe), spinlock/mutex latency etc. Calibration tools will provide these data about different entities at some certain topology levels so that application could select the spreading and gathering strategy of threads according to this data.

The design of the calibration tool will be similar to netloc. Three steps are required to use the calibration tool.

The first step is to get data about system bandwidth, latency, etc by running some benchmark tests since the standard operating system does not support providing this information. The raw data will be saved in files. This step may need to be performed by a privilege user.

The second step is to convert the original file generated in the previous step into a file in a readable format by the calibration tool. No privileges are required for this step.

In the third step, the application could obtain the calibration information of the system through a C APIs exposed by calibration tool and hwloc commands can be also extended to show these new information. The source of the calibration data is the readable file generated in the second step. E.g. hwloc_get_mem_bandwidth(hwloc_topology_t topology, unsigned idx1, unsigned idx2) could be used to get the memory bandwidth ability between idx1 and idx2 in some topology type.