Re: pci error recovery procedure

From: Linas Vepstas
Date: Tue Sep 05 2006 - 15:14:42 EST


On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:
> >
> > Again, consider the multi-function cards. On pSeries, I can only enable
> > DMA on a per-slot basis, not a per-function basis. So if one driver
> > enables DMA before some other driver has reset appropriately, everything
> > breaks.
> Does here 'reset' mean hardware slot reset?

I should have said: If one driver of a multi-function card enables DMA before
another driver has stabilized its harware, then everything breaks.

> Then, if the slot is always reset, there will be no the problem.

But that assumes that a hardware #RST will always be done. The API
was designed to get away from this requirement.

> If mmio_enabled is not used currently, I think we could delete it firstly. Later on,
> if a platform really need it, we could add it, so we could keep the simplied codes.

It would be very difficult to add it later. And it would be especially
silly, given that someone would find this discussion in the mailing list
archives.

> Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
> to map to pSeries platform hardware operation steps. I know little about pSeries hardware,

The hardware was designed that way because the hardware engineers
thought that this is what the device driver writers would need.
Thay are there to map to actual recovery steps that actual device
drivers might want to do.

> but is it possible to merge such hardware steps from software point of view?

The previous email explained why this would be a bad idea.

> > The platform. By "electrical reset", I mean "dropping the #RST pin low
> > for 200mS". Only the platform can do this.
> Thanks for your explanation. I assume after the electrical reset, all device
> functions of the device slot will go back to the initial status before
> attaching their drivers.

Maybe. Depends on what yur BIOS does. On pSeries, I also need to
set up the adress BARs

> I found a problem of e1000 driver when testing its error handlers. After the NIC is resumed,
> its RX/TX packets numbers are crazy.

Hmm. There is a patch to prevent this from happening. I thought
it was applied a long time ago. e1000_update_stats() should include the
lines:

if (pdev->error_state && pdev->error_state != pci_channel_io_normal)
return;

which is enough to prevent crazy stats on my machine.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/