[PATCH] net: dsa: ksz9477_ptp: fix race condition between IRQ thread and deferred xmit kthread

From: Vladimir Oltean
Date: Tue Oct 19 2021 - 17:39:37 EST


Two-step PTP TX timestamping for the ksz9477 driver works as follows:

1. ksz9477_port_deferred_xmit() initializes a completion structure and
queues the PTP skb to the DSA master

2. DSA master sends the packet to the switch, which forwards it to the
egress port and the TX timestamp is taken.

3. Switch raises its PTP IRQ and the ksz9477_ptp_port_interrupt()
handler is run.

4. The PTP timestamp is read, and the completion structure is signaled.

5. PTP interrupts are rearmed for the next timestampable skb.

6. The deferred xmit kthread is woken up by the completion. It collects
the TX timestamp from the irq kthread, it annotates the skb clone
with that timestamp, delivers it to the socket error queue, and
exits.

7. The deferred xmit kthread gets rescheduled with the next
timestampable PTP packet and the steps from 1 are executed again,
identically.

There is an issue in the fact that steps 5 and 6 might not actually run
in this exact order. Step 6, the deferred xmit kthread getting woken up
by the completion, might happen as soon as the completion is signaled at
step 4. In that case, the deferred xmit kthread might run to completion
and we might reach step 7, while step 5 (write-1-to-clear to the IRQ
status register, to rearm the interrupt, has _not_ yet run).

If the deferred xmit kthread makes enough progress with the _next_ PTP
skb, such that it actually manages to enqueue it to the DSA master, and
that makes it all the way to the hardware, which takes another TX
timestamp, we have a problem if the IRQ kthread has not cleared the PTP
TX timestamp status yet.

If it clears the PTP status register now, it has effectively eaten a TX
timestamp.

The implication is that the completion for this second PTP skb will time
out, but otherwise, the system will keep chugging on, it will not be
forever stuck. The IRQ kthread does not get rearmed because it has no
reason to (the PTP IRQ is cleared), and the deferred xmit kthread will
free the skb for the completion that timed out, and carry on with its
life. The next skb can go through the cycle 1-6 just fine.

The problem which makes the above scenario possible is that we clear the
interrupt status after we signal the completion. Do it before, and the
interrupt handler is free to do whatever it wishes until it returns.

Signed-off-by: Vladimir Oltean <vladimir.oltean@xxxxxxx>
---
drivers/net/dsa/microchip/ksz9477_ptp.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz9477_ptp.c b/drivers/net/dsa/microchip/ksz9477_ptp.c
index c646689cb71e..bc3f0283859a 100644
--- a/drivers/net/dsa/microchip/ksz9477_ptp.c
+++ b/drivers/net/dsa/microchip/ksz9477_ptp.c
@@ -1106,6 +1106,11 @@ irqreturn_t ksz9477_ptp_port_interrupt(struct ksz_device *dev, int port)
if (ret)
return IRQ_NONE;

+ /* Clear interrupt(s) (W1C) */
+ ret = ksz_write16(dev, addr, data);
+ if (ret)
+ return IRQ_NONE;
+
if (data & PTP_PORT_XDELAY_REQ_INT) {
/* Timestamp for Pdelay_Req / Delay_Req */
struct ksz_device_ptp_shared *ptp_shared = &dev->ptp_shared;
@@ -1128,11 +1133,6 @@ irqreturn_t ksz9477_ptp_port_interrupt(struct ksz_device *dev, int port)
complete(&prt->tstamp_completion);
}

- /* Clear interrupt(s) (W1C) */
- ret = ksz_write16(dev, addr, data);
- if (ret)
- return IRQ_NONE;
-
return IRQ_HANDLED;
}


About the only difference seems to be that ACK-ing the interrupt is done
at the end of ksz_ptp_irq_thread_fn(), while complete(&port->tstamp_msg_comp)
is called from ksz_ptp_msg_thread_fn() - which is called by handle_nested_irq()
IIUC.

>
> +/* Time stamp tag is only inserted if PTP is enabled in hardware. */
> +static void ksz_xmit_timestamp(struct dsa_switch *ds, struct sk_buff *skb,
> + unsigned int port)
> +{
> + struct sk_buff *clone = KSZ_SKB_CB(skb)->clone;
> + struct ksz_tagger_data *tagger_data;
> + struct ptp_header *ptp_hdr;
> + unsigned int ptp_type;
> + u32 tstamp_raw = 0;
> + u8 ptp_msg_type;
> + s64 correction;
> +
> + if (!clone)
> + goto out_put_tag;
> +
> + /* Use cached PTP type from ksz_ptp_port_txtstamp(). */
> + ptp_type = KSZ_SKB_CB(clone)->ptp_type;
> + if (ptp_type == PTP_CLASS_NONE)
> + goto out_put_tag;
> +
> + ptp_hdr = ptp_parse_header(skb, ptp_type);
> + if (!ptp_hdr)
> + goto out_put_tag;
> +
> + tagger_data = ksz_tagger_data(ds);
> + if (!tagger_data->is_ptp_twostep)
> + goto out_put_tag;
> +
> + if (tagger_data->is_ptp_twostep(ds, port))
> + goto out_put_tag;
> +
> + ptp_msg_type = KSZ_SKB_CB(clone)->ptp_msg_type;
> + if (ptp_msg_type != PTP_MSGTYPE_PDELAY_RESP)
> + goto out_put_tag;
> +
> + correction = (s64)get_unaligned_be64(&ptp_hdr->correction);
> +
> + /* For PDelay_Resp messages we will likely have a negative value in the
> + * correction field (see ksz9477_rcv()). The switch hardware cannot
> + * correctly update such values (produces an off by one error in the UDP
> + * checksum), so it must be moved to the time stamp field in the tail
> + * tag.
> + */
> + if (correction < 0) {
> + struct timespec64 ts;
> +
> + /* Move ingress time stamp from PTP header's correction field to
> + * tail tag. Format of the correction filed is 48 bit ns + 16
> + * bit fractional ns.
> + */
> + ts = ns_to_timespec64(-correction >> 16);
> + tstamp_raw = ((ts.tv_sec & 3) << 30) | ts.tv_nsec;
> +
> + /* Set correction field to 0 and update UDP checksum. */
> + ptp_header_update_correction(skb, ptp_type, ptp_hdr, 0);
> + }
> +
> + /* For PDelay_Resp messages, the clone is not required in
> + * skb_complete_tx_timestamp() and should be freed here.
> + */
> + kfree_skb(clone);
> + KSZ_SKB_CB(skb)->clone = NULL;
> +
> +out_put_tag:
> + put_unaligned_be32(tstamp_raw, skb_put(skb, KSZ9477_PTP_TAG_LEN));
> +}