Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH

From: Akihiko Odaki
Date: Sat Nov 18 2023 - 05:41:50 EST


On 2023/10/18 4:19, Akihiko Odaki wrote:
On 2023/10/18 4:03, Alexei Starovoitov wrote:
On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:

On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2023/10/16 1:07, Alexei Starovoitov wrote:
On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0448700890f7..298634556fab 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -988,6 +988,7 @@ enum bpf_prog_type {
          BPF_PROG_TYPE_SK_LOOKUP,
          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
          BPF_PROG_TYPE_NETFILTER,
+       BPF_PROG_TYPE_VNET_HASH,

Sorry, we do not add new stable program types anymore.

@@ -6111,6 +6112,10 @@ struct __sk_buff {
          __u8  tstamp_type;
          __u32 :24;              /* Padding, future use. */
          __u64 hwtstamp;
+
+       __u32 vnet_hash_value;
+       __u16 vnet_hash_report;
+       __u16 vnet_rss_queue;
   };

we also do not add anything to uapi __sk_buff.

+const struct bpf_verifier_ops vnet_hash_verifier_ops = {
+       .get_func_proto         = sk_filter_func_proto,
+       .is_valid_access        = sk_filter_is_valid_access,
+       .convert_ctx_access     = bpf_convert_ctx_access,
+       .gen_ld_abs             = bpf_gen_ld_abs,
+};

and we don't do ctx rewrites like this either.

Please see how hid-bpf and cgroup rstat are hooking up bpf
in _unstable_ way.

Can you describe what "stable" and "unstable" mean here? I'm new to BPF
and I'm worried if it may mean the interface stability.

Let me describe the context. QEMU bundles an eBPF program that is used
for the "eBPF steering program" feature of tun. Now I'm proposing to
extend the feature to allow to return some values to the userspace and
vhost_net. As such, the extension needs to be done in a way that ensures
interface stability.

bpf is not an option then.
we do not add stable bpf program types or hooks any more.

Does this mean eBPF could not be used for any new use cases other than
the existing ones?

It means that any new use of bpf has to be unstable for the time being.

Can you elaborate more about making new use unstable "for the time being?" Is it a temporary situation? What is the rationale for that? Such information will help devise a solution that is best for both of the BPF and network subsystems.

I would also appreciate if you have some documentation or link to relevant discussions on the mailing list. That will avoid having same discussion you may already have done in the past.

Hi,

The discussion has been stuck for a month, but I'd still like to continue figuring out the way best for the whole kernel to implement this feature. I summarize the current situation and question that needs to be answered before push this forward:

The goal of this RFC is to allow to report hash values calculated with eBPF steering program. It's essentially just to report 4 bytes from the kernel to the userspace.

Unfortunately, however, it is not acceptable for the BPF subsystem because the "stable" BPF is completely fixed these days. The "unstable/kfunc" BPF is an alternative, but the eBPF program will be shipped with a portable userspace program (QEMU)[1] so the lack of interface stability is not tolerable.

Another option is to hardcode the algorithm that was conventionally implemented with eBPF steering program in the kernel[2]. It is possible because the algorithm strictly follows the virtio-net specification[3]. However, there are proposals to add different algorithms to the specification[4], and hardcoding the algorithm to the kernel will require to add more UAPIs and code each time such a specification change happens, which is not good for tuntap.

In short, the proposed feature requires to make either of three compromises:

1. Compromise on the BPF side: Relax the "stable" BPF feature freeze once and allow eBPF steering program to report 4 more bytes to the kernel.

2. Compromise on the tuntap side: Implement the algorithm to the kernel, and abandon the capability to update the algorithm without changing the kernel.

IMHO, I think it's better to make a compromise on the BPF side (option 1). We should minimize the total UAPI changes in the whole kernel, and option 1 is much superior in that sense.

Yet I have to note that such a compromise on the BPF side can risk the "stable" BPF feature freeze fragile and let other people complain like "you allowed to change stable BPF for this, why do you reject [some other request to change stable BPF]?" It is bad for BPF maintainers. (I can imagine that introducing and maintaining widely different BPF interfaces is too much burden.) And, of course, this requires an approval from BPF maintainers.

So I'd like to ask you that which of these compromises you think worse. Please also tell me if you have another idea.

Regards,
Akihiko Odaki

[1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html
[2] https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@xxxxxxxxxx/
[3] https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003
[4] https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@xxxxxxxxxxxxxx/