Re: [RFC PATCH 00/13] x86 User Interrupts support

From: Sohil Mehta
Date: Thu Jan 06 2022 - 21:09:13 EST

Next message: Jiasheng Jiang: "[PATCH v3] ALSA: intel_hdmi: Check for error num after setting mask"
Previous message: Johnson Wang: "Re: [PATCH 1/2] soc: mediatek: pwrap: add pwrap driver for MT8186 SoC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Chrisma,

On 12/22/2021 8:17 AM, Chrisma Pakha wrote:

The following is our understanding of the proposed User Interrupt.

Thank you for giving this some thought.

We have been exploring how user-level interrupts (UIs) can be used to
improve performance and programmability in several different areas:
e.g., parallel programming, memory management, I/O, and floating-point
libraries.

Can you please share more details on this? It would really help improve the API design.

# Current Use Cases

The Current RFC is focused on sending an interrupt from one user-space
thread (UST) to another user-space thread (UST2UST). These threads
could be in different processes, as long as the sender has access to
the receiver's User Interrupt File Descriptor (uifd). Based on our
understanding, UIs are currently targeted as a low overhead
alternative for the current IPC mechanisms.

That's correct.

# Preparing for future use cases
> If someone could point out an example for Kernel to
user-space thread (K2UST) UI, we would appreciate it.

The idea here is improve the kernel-to-user event notification latency. Theoretically, this can be useful when the kernel sees event completion on one cpu but it want to signal (notify) a thread actively running on some other CPU. The receiver thread can save some cycles by avoiding ring transitions to receive the event.

IO_URING is one of the examples for kernel-to-user event notifications. We are evaluating whether providing a UINTR based completion mechanism can have benefit over eventfd based completions. The benefits in practice are yet to be measured and proven.

In our work, we have also been exploring precise UIs from the
currently running thread. We call these CPU to UST (CPU2UST) UIs.
For example, a SIGSEGV generated by writing to a read-only page, a
SIGFPE generated by dividing a number by zero.

It is definitely possible in future to delivery CPU events as User Interrupts. The hardware architecture for this is still being worked on internally.

Though our focus isn't on exceptions being delivered as User Interrupts. Do you have details on what type of benefit is expected?

- QUESTION: Is there is a rough draft/plan that we can refer to that describes the
current thinking on these three cases.

- QUESTION: Are there use cases for K2UST, or is K2UST the same as CPU2UST?

No, K2UST isn't the same as CPU2UST. We would expect limited benefits from K2UST but on the other hand CPU2UST can provide significant speedup since it avoids the kernel completely.

Unfortunately, due to the large scope of the feature, the hardware architecture development is happening in stages. I don't have detailed plans for each of the sources of User Interrupts.

Here is our rough plan:

1. Provide a common infrastructure to receive User Interrupts. This is independent of the source of the interrupt. The intention here is to keep the software APIs generic and extendable so that future sources can be added without causing much disturbance to the older APIs.

2. Introduce various sources of User Interrupts in stages:

UST2UST - This RFC. Available in the upcoming Sapphire Rapids processor.

K2UST - Also available in upcoming Sapphire Rapids. Working towards proving the value before sending something out.

D2UST - Future processor. Hardware architecture being worked on internally. Not much to share right now.

CPU2UST - Future processor. Hardware architecture being worked on internally. Not much to share right now.

# Basic Understanding

The overall description you have mentioned below looks good to me. I have added some minor comments for clarification.

Also, the abbreviations that you have used are somewhat different from the ones I have used in the patches.

First, we would like to make sure that our understanding of the terminology and the data structures is correct.

- User Interrupt Vector (UIV): The identity of the user interrupt.
- User Interrupt Target Table (UITT):
This allows the sender to locate the "address" of the receiver through the uifd.

The UITT refers to the 'UPID' address which is different from the uifd that you mention below.

Below outlines our understanding of the current API for UIs.

All of the statements below seem accurate.

However, some of the restrictions below are due to hardware design and some are mainly due to the software implementation. The software design and APIs might change significantly as this patch series evolves.

Please feel free to provide input wherever you think the APIs can be improved.

- Each thread that can receive UIs has exactly one handler
registered with `uintr_register_handler` (a syscall).
- Each thread that registers a handler calls `uintr_create_fd` for
every user-level interrupt vector (UIV) that they expect to receive.
- The only information delivered to the handler is the UIV.
- There are 64 UIVs that can be used per thread.

Though only one generic handler is registered with the hardware, an application can choose to implement 64 unique sub-handlers in user space based on each unique UIV.

- A thread that wants to send a UI must register the receiver's uifd with `uintr_register_sender` (a syscall).
This returns an index the sender uses to locate the receiver.
- `_senduipi(index)` sends a user interrupt to a particular destination.
The sender's UITT and index determine the destination.
- A thread uses `_stui` (and `_clui`) to enable (and disable) the reception of UIs.
- As for now, there is no mechanism to mask a particular UIV.
- A UI is delivered to the receiver immediately only if it is currently running.
- If a thread executes the `uintr_wait()`, it will be scheduled only after receiving a UI.
There is no guarantee on the delay between the processor receiving the UI and when the thread is scheduled.
- If a thread is the target of a UI and another thread is running, or the target thread is blocked in the kernel,
then the target thread will handle the UI when it is next scheduled.
- Ordinary interrupts (interrupt delivered with CPL=0) have a higher priority over user interrupts.
- UI handler only saves general-purpose registers (e.g., do not save floating-point registers).

The saving and restoring of the registers is done by gcc when the muintr flag along with the 'interrupt' attribute is used. Applications can choose to save floating point registers as part of the interrupt handler as well.

To make it easier for applications we are working on implementing a thin library that can help with some of this common functionality like saving floating point registers or redirecting to 64 sub-handlers.

- User Interrupts with higher UIV are given a higher priority than those with smaller UIV.

## Private UITT

The Current RFC focuses on a private UITT where each thread has its own
UITT. Thus, different threads executing `_senduipi(index1)` with the
same `index1` may cause different receiver threads to be interrupted.

That's right.

In many cases, the receiver of an interrupt needs to know which thread
sent the interrupt. If we understand the proposal correctly, there are
only 64 user-level interrupt vectors (UIVs), and the UIV is the only
information transmitted to the receiver. The UIV itself allows the
receiver to distinguish different senders through careful management
of the receiver's UIV.

That's correct. User Interrupts mainly provide a door bell mechanism with the actual data expected to be shared through some existing mechanism.

If multiple senders want to share the same interrupt vector then they would have to rely on some sort of shared memory (or similar) mechanism to relay the relevant information to the receiver. This would likely come with some latency cost.

- QUESTION: Given the code below where the same UIV is registered twice:
```c
uintr_fd1 = uintr_create_fd(vector1, flags)
uintr_fd2 = uintr_create_fd(vector1, flags)
```
Would `uintr_fd1` be the same as `uintr_fd2`, or would it be registered with a different index in the UITT table?

In the current design, if the same thread tries to register the same vector again the second uintr_create_fd() would fail with a EBUSY error code.

- QUESTION: If it is registered in a different index, would the
receiver be able to distinguish the sender if `uintr_fd1` and
`uintr_fd2` are used from two different threads?

- QUESTION: What is the intended future use of the `flags` argument?

In the uintr_create_fd() call, flags would be used to provide options such as O_CLOEXEC. In general, I added flags argument to all the system calls to keep them extendable when new boolean options need to be added.

## Shared UITT

In the case of the shared UITT model, all the threads share the same
UITT and thus, if two different threads execute `_senduipi(index)`
with the same index, they would both cause an interrupt in the
same destination/receiver.

- QUESTION: Since both threads use the same entry (same
destination/receiver), does this mean that the receiver will not be
able to distinguish the sender of the interrupt?

Yes. However this is true even in case of a private UITT. This isn't because the senders used the same UITT index rather it is the result of the senders generating the same UIVs.

For example, even if a receiver created 2 FDs with 2 unique vectors.

uintr_fd1 = uintr_create_fd(vector1, flags)
uintr_fd2 = uintr_create_fd(vector2, flags)

In case of the a private UITT, both sender threads can register themselves with uintr_fd1. They might get different uitt indexes returned to them. But when they generate a User interrupt using their respective index, the end result would be the same. The receiver will see the same vector1 being generated. There is no way for the receiver to distinguish the sender without some additional information being shared somewhere.

# Multi-threaded parallel programming example

One of the uses for UIs that we have been exploring is combining the
message-passing and shared memory models for parallel programming. In
our approach, message-passing is used for synchronization and shared
memory for data sharing. The message passing part of the programming
pattern is based loosely on Active Messages (See ISCA92), where a
particular thread can turn off/on interrupts to ignore incoming
messages so they can execute critical sections without having to
notify any other threads in the system.

This look like a good fit for the User IPI (UST2UST) implementation in this RFC. Have you had a chance to evaluate the current API design for this usage?

Also, is any of the above work publicly available?

- QUESTION: Is there any data on the performance impact of `_stui` and `_clui`?

_stui and _clui are expected to have very minimal overhead since they only modify a local flag. I'll try to measure this next time I am doing some performance measurement.

Thanks,
Sohil

Next message: Jiasheng Jiang: "[PATCH v3] ALSA: intel_hdmi: Check for error num after setting mask"
Previous message: Johnson Wang: "Re: [PATCH 1/2] soc: mediatek: pwrap: add pwrap driver for MT8186 SoC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]