Re: [PATCH RFC net-next v1 1/6] ethtool: add interface to read Tx hardware timestamping statistics

From: Rahul Rameshbabu
Date: Fri Mar 08 2024 - 00:30:06 EST

Next message: Joel Fernandes: "Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO"
Previous message: Vishal Chourasia: "[RFC 0/1] sched: Deprecate DOUBLE_TICK feature"
Next in thread: Jacob Keller: "Re: [PATCH RFC net-next v1 1/6] ethtool: add interface to read Tx hardware timestamping statistics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 07 Mar, 2024 19:29:08 -0800 Jacob Keller <jacob.e.keller@xxxxxxxxx> wrote:
> On 3/7/2024 10:47 AM, Rahul Rameshbabu wrote:
>> Hi Jacob,
>>
>> On Mon, 26 Feb, 2024 11:54:49 -0800 Jacob Keller <jacob.e.keller@xxxxxxxxx> wrote:
>>> On 2/23/2024 3:43 PM, Rahul Rameshbabu wrote:
>>>>
>>>> On Fri, 23 Feb, 2024 14:48:51 -0800 Jacob Keller <jacob.e.keller@xxxxxxxxx>
>>>> wrote:
>>>>> On 2/23/2024 2:21 PM, Rahul Rameshbabu wrote:
>>>>>> Do you have any example of a case of skipping timestamp information that
>>>>>> is not related to lack of delivery over time? I am wondering if this
>>>>>> case is more like a hardware error or not. Or is it more like something
>>>>>> along the lines of being busy/would impact line rate of timestamp
>>>>>> information must be recorded?
>>>>>>
>>>>>
>>>>> The main example for skipped is the event where all our slots are full
>>>>> at point of timestamp request.
>>>>
>>>> This is what I was guessing as the main (if not only reason). For this
>>>> specific reason, I think a general "busy" stats counter makes sense.
>>>> mlx5 does not need this counter, but I can see a lot of other hw
>>>> implementations needing this. (The skipped counter name obviously should
>>>> be left only in the ice driver. Just felt "busy" was easy to understand
>>>> for generalized counters.)
>>>
>>> Yea, I don't expect this would be required for all hardware but it seems
>>> like a common approach if you have limited slots for Tx timestamps
>>> available.
>>>
>> Sorry to bump this thread once more, but I had a question regarding the
>> Intel driver in regards to this. Instead of having a busy case when all
>> the slots are full, would it make sense to stop the netdev queues in
>> this case, we actually do this in mlx5 (though keep in mind that we have
>> a dedicated queue just for port/phy timestamping that we start/stop).
>>
>> Maybe in your case, you can have a mix of HW timestamping and non-HW
>> timestamping in the same queue, which is why you have a busy case?
>>
>
> We don't use a dedicated queue. The issue isn't queue capacity so much
> as it is the number of slots in the PHY for where it can save the
> timestamp data.

In mlx5, we use a dedicated queue just for the purpose of HW
timestamping because we actually do have a similar slot mechanism. We
call it metadata. We have a limit of 256 entries. We steer PTP traffic
specifically (though we will be changing this to any HW timestamped
traffic with the work Kory is doing) to this queue by matching against
the protocol and port. All other traffic goes to the normal queues that
cannot consume the timestamping slots. When all the slots are occupied,
we stop the timestamping queue rather than throwing some busy error.

>
> In practice the most common application (ptp4l) synchronously waits for
> timestamps, and only has one outstanding at a time. Likely due to
> limitations with original hardware that only supported one outstanding
> Tx timestamp.
>
>> Wanted to inquire about this before sending out a RFC v2.
>
> That's actually an interesting approach to change to a dedicated queue
> which we could lock and start/stop it when the indexes are full. How
> does that interact with the stack UDP and Ethernet stacks? Presumably
> when you go to transmit, you'd need to pick a queue and if its stopped
> you'd have to drop or tell the stack?

Let me share a pointer in mlx5 for how we do the queue selection. Like I
mentioned, we steer ptp traffic specifically, but we can change this to
just steer any skb that indicates hw timestamping.

* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.h?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n71

Then, here is how we manage stopping and waking the queue (we tell the
core stack about this so we do not have to drop traffic due to some kind
of busy state because our metadata/slots are all consumed).

* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n775
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n257
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c?id=3aaa8ce7a3350d95b241046ae2401103a4384ba2#n397

>
> I think I remember someone experimenting with returning NETDEV_TX_BUSY
> when the slots were full, but in practice this caused a lot of issues.
> None of the other devices we have with only a single slot (one set of
> registers, ixgbe, i40e, igb, e1000) did that either.

So we experimented that even with a single slot (we had reasons for
testing this), the dedicated queue for timestamping worked out nicely. I
really would suggest investigating this model since I think it might
play out nicely for the Intel family.

>
> If this queue model behaves in a sane way (or if we can communicate
> something similar by reporting back up the stack without needing a
> dedicated queue?) that could be better than the current situation.

I personally really like the dedicated queue in the device drivers, but
if we want to instead model this slot management work in the core netdev
stack, I do not think that is a bad endeavor either (when slots are
full, hw timestamping traffic is held back till they become available).
I do think the netif_tx_wake_queue/netif_tx_stop_queue + dedicated HW
timestamping queue does work out nicely.

Let me know your thoughts on this. If you think it's an interesting idea
to explore, lets not add the busy counter now in this series. I already
dropped the late counter. We can add the busy counter later on if you
feel this model I have shared is not viable for Intel. I wanted to avoid
introducing too many counters pre-emptively that might not actually be
consumed widely. I had a thought that what you presented with slots is
very similar to what we have with metadata in mlx5, so I thought that
maybe handling the management of these slots in a different way with
something like a dedicated queue for HW timestamping could make the
design cleaner.

--
Thanks,

Rahul Rameshbabu

Next message: Joel Fernandes: "Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO"
Previous message: Vishal Chourasia: "[RFC 0/1] sched: Deprecate DOUBLE_TICK feature"
Next in thread: Jacob Keller: "Re: [PATCH RFC net-next v1 1/6] ethtool: add interface to read Tx hardware timestamping statistics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]