Re: [RFC net-next 0/9] net: bridge: vlan: Multiple Spanning Trees

From: Nikolay Aleksandrov
Date: Wed Feb 16 2022 - 11:13:06 EST


On 16/02/2022 17:56, Tobias Waldekranz wrote:
> On Wed, Feb 16, 2022 at 17:28, Nikolay Aleksandrov <nikolay@xxxxxxxxxx> wrote:
>> On 16/02/2022 15:29, Tobias Waldekranz wrote:
>>> The bridge has had per-VLAN STP support for a while now, since:
>>>
>>> https://lore.kernel.org/netdev/20200124114022.10883-1-nikolay@xxxxxxxxxxxxxxxxxxx/
>>>
>>> The current implementation has some problems:
>>>
>>> - The mapping from VLAN to STP state is fixed as 1:1, i.e. each VLAN
>>> is managed independently. This is awkward from an MSTP (802.1Q-2018,
>>> Clause 13.5) point of view, where the model is that multiple VLANs
>>> are grouped into MST instances.
>>>
>>> Because of the way that the standard is written, presumably, this is
>>> also reflected in hardware implementations. It is not uncommon for a
>>> switch to support the full 4k range of VIDs, but that the pool of
>>> MST instances is much smaller. Some examples:
>>>
>>> Marvell LinkStreet (mv88e6xxx): 4k VLANs, but only 64 MSTIs
>>> Marvell Prestera: 4k VLANs, but only 128 MSTIs
>>> Microchip SparX-5i: 4k VLANs, but only 128 MSTIs
>>>
>>> - By default, the feature is enabled, and there is no way to disable
>>> it. This makes it hard to add offloading in a backwards compatible
>>> way, since any underlying switchdevs have no way to refuse the
>>> function if the hardware does not support it
>>>
>>> - The port-global STP state has precedence over per-VLAN states. In
>>> MSTP, as far as I understand it, all VLANs will use the common
>>> spanning tree (CST) by default - through traffic engineering you can
>>> then optimize your network to group subsets of VLANs to use
>>> different trees (MSTI). To my understanding, the way this is
>>> typically managed in silicon is roughly:
>>>
>>> Incoming packet:
>>> .----.----.--------------.----.-------------
>>> | DA | SA | 802.1Q VID=X | ET | Payload ...
>>> '----'----'--------------'----'-------------
>>> |
>>> '->|\ .----------------------------.
>>> | +--> | VID | Members | ... | MSTI |
>>> PVID -->|/ |-----|---------|-----|------|
>>> | 1 | 0001001 | ... | 0 |
>>> | 2 | 0001010 | ... | 10 |
>>> | 3 | 0001100 | ... | 10 |
>>> '----------------------------'
>>> |
>>> .-----------------------------'
>>> | .------------------------.
>>> '->| MSTI | Fwding | Lrning |
>>> |------|--------|--------|
>>> | 0 | 111110 | 111110 |
>>> | 10 | 110111 | 110111 |
>>> '------------------------'
>>>
>>> What this is trying to show is that the STP state (whether MSTP is
>>> used, or ye olde STP) is always accessed via the VLAN table. If STP
>>> is running, all MSTI pointers in that table will reference the same
>>> index in the STP stable - if MSTP is running, some VLANs may point
>>> to other trees (like in this example).
>>>
>>> The fact that in the Linux bridge, the global state (think: index 0
>>> in most hardware implementations) is supposed to override the
>>> per-VLAN state, is very awkward to offload. In effect, this means
>>> that when the global state changes to blocking, drivers will have to
>>> iterate over all MSTIs in use, and alter them all to match. This
>>> also means that you have to cache whether the hardware state is
>>> currently tracking the global state or the per-VLAN state. In the
>>> first case, you also have to cache the per-VLAN state so that you
>>> can restore it if the global state transitions back to forwarding.
>>>
>>> This series adds support for an arbitrary M:N mapping of VIDs to
>>> MSTIs, proposing one solution to the first issue. An example of an
>>> offload implementation for mv88e6xxx is also provided. Offloading is
>>> done on a best-effort basis, i.e. notifications of the relevant events
>>> are generated, but there is no way for the user to see whether the
>>> per-VLAN state has been offloaded or not. There is also no handling of
>>> the relationship between the port-global state the the per-VLAN ditto.
>>>
>>> If I was king of net/bridge/*, I would make the following additional
>>> changes:
>>>
>>> - By default, when a VLAN is created, assign it to MSTID 0, which
>>> would mean that no per-VLAN state is used and that packets belonging
>>> to this VLAN should be filtered according to the port-global state.
>>>
>>> This way, when a VLAN is configured to use a separate tree (setting
>>> a non-zero MSTID), an underlying switchdev could oppose it if it is
>>> not supported.
>>>
>>> Obviously, this adds an extra step for existing users of per-VLAN
>>> STP states and would thus not be backwards compatible. Maybe this
>>> means that that is impossible to do, maybe not.
>>>
>>> - Swap the precedence of the port-global and the per-VLAN state,
>>> i.e. the port-global state only applies to packets belonging to
>>> VLANs that does not make use of a per-VLAN state (MSTID != 0).
>>>
>>> This would make the offloading much more natural, as you avoid all
>>> of the caching stuff described above.
>>>
>>> Again, this changes the behavior of the kernel so it is not
>>> backwards compatible. I suspect that this is less of an issue
>>> though, since my guess is that very few people rely on the old
>>> behavior.
>>>
>>> Thoughts?
>>>
>>
>> Interesting! Would adding a new (e.g. vlan_mst_enable) option which changes the behaviour
>> as described help? It can require that there are no vlans present to change
>> similar to the per-port vlan stats option.
>
> Great idea, I did not know that that's how vlan stats worked. I will
> definitely look into it, thanks!
>
>> Also based on that option you can alter
>> how the state checks are performed. For example, you can skip the initial port state
>> check, then in br_vlan_allowed_ingress() you can use the port state if vlan filtering
>> is disabled and mst enabled and you can avoid checking it altogether if filter && mst
>> are enabled then always use the vlan mst state. Similar changes would have to happen
>> for the egress path. Since we are talking about multiple tests the new MST logic can
>> be hidden behind a static key for both br_handle_frame() and later stages.
>
> Makes sense.
>
> So should we keep the current per-VLAN state as-is then? And bolt the
> MST on to the side? I.e. should `struct net_bridge_vlan` both have `u8
> state` for the current implementation _and_ a `struct br_vlan_mst *`
> that is populated for VLANs tied to a non-zero MSTI?
>

Good question. The u8 we should keep for the quick access/cache of state, the
ptr we might escape by keeping just the mst id and fetching it, i.e. it could
be just 2 bytes instead of 8, but that's not really a problem if it will be
used just for bookkeeping in slow paths. It can always be pushed in the end of
the struct and if a ptr makes things simpler and easier it's ok.
So in short yes, I think we can keep both.

>> This set needs to read a new cache line to fetch mst ptr for all packets in the vlan fast-path,
>> that is definitely undesirable. Please either cache that state in the vlan and update it when
>> something changes, or think of some way which avoids that cache line in fast-path.
>> Alternative would be to make that cache line dependent on the new option, so it's needed
>> only when mst feature is enabled.
>
> If we go with the approach I suggested above, then the current `u8
> state` on `struct net_bridge_vlan` could be that cache, right?
>

Yep, that's the idea.

> With the current implementation, it is set directly - in the new MST
> mode all grouped VLANs would have their states updated when updating the
> MSTI's state.

Right

Cheers,
Nik