Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors

From: Leon Romanovsky
Date: Wed Jan 04 2023 - 01:48:42 EST


On Wed, Jan 04, 2023 at 10:27:33AM +0530, Rajat Khandelwal wrote:
> Hi Bjorn,
>
> Thanks for the acknowledgement.
>
> On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
> > [+cc Paul, Sasha, Leon, Frederick]
> >
> > (Please cc folks who have commented on previous versions of your
> > patch.)
> >
> > On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> > > There are many instances where correctable errors tend to inundate
> > > the message buffer. We observe such instances during thunderbolt PCIe
> > > tunneling.

<...>

> > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> > > [54982.838817] igc 0000:2b:00.0: [12] Timeout
> > Please remove the timestamps; they don't contribute to understanding
> > the problem.
>
> --> Sure.

Please don't add "-->" or any marker to replies. It breaks mail color
scheme.

>
> >
> > > This gets repeated continuously, thus inundating the buffer.
> > Did you verify that we actually clear the Correctable Error Status
> > register?
>
> --> This patch targets only rate limiting the correctable errors since they are
> non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
> connections. It doesn't have an impact anywhere else.
> As per your suggestion in the igc patch, I found rate limiting as a doable option
> currently. Have eradicated any kind of masking the bits.

You didn't answer on the asked question. "Did you verify that we actually clear
the Correctable Error Status register?".

Thanks