Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children

From: Jamal Hadi Salim
Date: Tue Jun 06 2023 - 11:39:59 EST


Hi Vladimir,

On Mon, Jun 5, 2023 at 12:53 PM Vladimir Oltean <vladimir.oltean@xxxxxxx> wrote:
>
> Hi Jamal,
>
> On Mon, Jun 05, 2023 at 11:44:17AM -0400, Jamal Hadi Salim wrote:
> > I havent been following - but if you show me sample intended tc
> > configs for both s/w and hardware offloads i can comment.
>
> There is not much difference in usage between the 2 modes. IMO the software
> data path logic is only a simulation for demonstrative purposes of what the
> shaper is intended to do. If hardware offload is available, it is always
> preferable. Otherwise, I'm not sure if anyone uses the pure software
> scheduling mode (also without txtime assist) for a real life use case.
>

Thanks for the sample. Understood on the software twin being useful
for simulation (our standard rule is you need to have a software twin)
- it is great you conform to that.

I took a cursory glance at the existing kernel code in order to get
better context (I will make more time later). Essentially this qdisc
is classful and is capable of hardware offload. Initial comments are
unrelated to the patchset (all this Klingon DCB thingy configuration
like a flag 0x2 is still magic to me).

1)Just some details become confusing in regards to offload vs not; F.e
class grafting (taprio_graft()) is de/activating the device but that
seems only needed for offload. Would it not be better to have those
separate and call graft_offload vs graft_software, etc? We really need
to create a generic document on how someone would write code for
qdiscs for consistency (I started working on one but never completed
it - if there is a volunteer i would be happy to work with one to
complete it).
2) It seems like in mqprio this qdisc can only be root qdisc (like
mqprio) and you dont want to replace the children with other types of
qdiscs i.e the children are always pfifo? i.e is it possible or
intended for example to replace 8001:x with bfifo etc? or even change
the pfifo queue size, etc?
3) Offload intention seems really to be bypassing enqueue() and going
straigth to the driver xmit() for a specific DMA ring that the skb is
mapped to. Except for the case where the driver says it's busy and
refuses to stash the skb in ring in which case you have to requeue to
the appropriate child qdisc/class. I am not sure how that would work
here - mqprio gets away with it by not defining any of the
en/de/requeue() callbacks - but likely it will be the lack of requeue
that makes it work.

I will read the other thread you pointed to when i get a moment.

cheers,
jamal

> I was working with something like this for testing the code paths affected
> by these changes:
>
> #!/bin/bash
>
> add_taprio()
> {
> local offload=$1
> local extra_flags
>
> case $offload in
> true)
> extra_flags="flags 0x2"
> ;;
> false)
> extra_flags="clockid CLOCK_TAI"
> ;;
> esac
>
> tc qdisc replace dev eno0 handle 8001: parent root stab overhead 24 taprio \
> num_tc 8 \
> map 0 1 2 3 4 5 6 7 \
> queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> max-sdu 0 0 0 0 0 200 0 0 \
> base-time 200 \
> sched-entry S 80 20000 \
> sched-entry S a0 20000 \
> sched-entry S 5f 60000 \
> $extra_flags
> }
>
> add_cbs()
> {
> local offload=$1
> local extra_flags
>
> case $offload in
> true)
> extra_flags="offload 1"
> ;;
> false)
> extra_flags=""
> ;;
> esac
>
> max_frame_size=1500
> data_rate_kbps=20000
> port_transmit_rate_kbps=1000000
> idleslope=$data_rate_kbps
> sendslope=$(($idleslope - $port_transmit_rate_kbps))
> locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
> hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
> tc qdisc replace dev eno0 parent 8001:8 cbs \
> idleslope $idleslope \
> sendslope $sendslope \
> hicredit $hicredit \
> locredit $locredit \
> $extra_flags
> }
>
> # this should always fail
> add_second_taprio()
> {
> tc qdisc replace dev eno0 parent 8001:7 taprio \
> num_tc 8 \
> map 0 1 2 3 4 5 6 7 \
> queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> max-sdu 0 0 0 0 0 200 0 0 \
> base-time 200 \
> sched-entry S 80 20000 \
> sched-entry S a0 20000 \
> sched-entry S 5f 60000 \
> clockid CLOCK_TAI
> }
>
> ip link set eno0 up
>
> echo "Offload:"
> add_taprio true
> add_cbs true
> add_second_taprio
> mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
> sleep 5
> tc -s class show dev eno0
> tc qdisc del dev eno0 root
>
> echo "Software:"
> add_taprio false
> add_cbs false
> add_second_taprio
> mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
> sleep 5
> tc -s class show dev eno0
> tc qdisc del dev eno0 root
>
> > In my cursory look i assumed you wanted to go along the path of mqprio
> > where nothing much happens in the s/w datapath other than requeues
> > when the tx hardware path is busy (notice it is missing an
> > enqueue/deque ops). In that case the hardware selection is essentially
> > of a DMA ring based on skb tags. It seems you took it up a notch by
> > infact having a choice of whether to have pure s/w or offload path.
>
> Yes. Actually the original taprio design always had the enqueue()/dequeue()
> ops involved in the data path, then commit 13511704f8d7 ("net: taprio
> offload: enforce qdisc to netdev queue mapping") retrofitted the mqprio
> model when using the "flags 0x2" argument.


> If you have time to read, the discussion behind that redesign was here:
> https://lore.kernel.org/netdev/20210511171829.17181-1-yannick.vignon@xxxxxxxxxxx/