Re: [Intel-wired-lan] [PATCH] igb: add parameter to ignore nvm checksum validation

From: Florian Fainelli
Date: Fri May 17 2019 - 12:49:39 EST


On 5/17/19 9:36 AM, Daniel Walker wrote:
> On Fri, May 17, 2019 at 08:16:34AM -0700, Alexander Duyck wrote:
>> On Thu, May 16, 2019 at 6:48 PM Florian Fainelli <f.fainelli@xxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On 5/16/2019 6:03 PM, Daniel Walker wrote:
>>>> On Thu, May 16, 2019 at 03:02:18PM -0700, Florian Fainelli wrote:
>>>>> On 5/16/19 12:55 PM, Nikunj Kela (nkela) wrote:
>>>>>>
>>>>>>
>>>>>> On 5/16/19, 12:35 PM, "Jeff Kirsher" <jeffrey.t.kirsher@xxxxxxxxx> wrote:
>>>>>>
>>>>>> On Wed, 2019-05-08 at 23:14 +0000, Nikunj Kela wrote:
>>>>>> >> Some of the broken NICs don't have EEPROM programmed correctly. It
>>>>>> >> results
>>>>>> >> in probe to fail. This change adds a module parameter that can be
>>>>>> >> used to
>>>>>> >> ignore nvm checksum validation.
>>>>>> >>
>>>>>> >> Cc: xe-linux-external@xxxxxxxxx
>>>>>> >> Signed-off-by: Nikunj Kela <nkela@xxxxxxxxx>
>>>>>> >> ---
>>>>>> >> drivers/net/ethernet/intel/igb/igb_main.c | 28
>>>>>> >> ++++++++++++++++++++++------
>>>>>> >> 1 file changed, 22 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> >NAK for two reasons. First, module parameters are not desirable
>>>>>> >because their individual to one driver and a global solution should be
>>>>>> >found so that all networking device drivers can use the solution. This
>>>>>> >will keep the interface to change/setup/modify networking drivers
>>>>>> >consistent for all drivers.
>>>>>>
>>>>>>
>>>>>> >Second and more importantly, if your NIC is broken, fix it. Do not try
>>>>>> >and create a software workaround so that you can continue to use a
>>>>>> >broken NIC. There are methods/tools available to properly reprogram
>>>>>> >the EEPROM on a NIC, which is the right solution for your issue.
>>>>>>
>>>>>> I am proposing this as a debug parameter. Obviously, we need to fix EEPROM but this helps us continuing the development while manufacturing fixes NIC.
>>>>>
>>>>> Then why even bother with sending this upstream?
>>>>
>>>> It seems rather drastic to disable the entire driver because the checksum
>>>> doesn't match. It really should be a warning, even a big warning, to let people
>>>> know something is wrong, but disabling the whole driver doesn't make sense.
>>>
>>> You could generate a random Ethernet MAC address if you don't have a
>>> valid one, a lot of drivers do that, and that's a fairly reasonable
>>> behavior. At some point in your product development someone will
>>> certainly verify that the provisioned MAC address matches the network
>>> interface's MAC address.
>>> --
>>> Florian
>>
>> The thing is the EEPROM contains much more than just the MAC address.
>> There ends up being configuration for some of the PCIe interface in
>> the hardware as well as PHY configuration. If that is somehow mangled
>> we shouldn't be bringing up the part because there are one or more
>> pieces of the device configuration that are likely wrong.
>>
>> The checksum is being used to make sure the EEPROM is valid, without
>> that we would need to go through and validate each individual section
>> of the EEPROM before enabling the the portions of the device related
>> to it. The concern is that this will become a slippery slope where we
>> eventually have to code all the configuration of the EEPROM into the
>> driver itself.
>
>
> I don't think you can say because the checksum is valid that all data contained
> inside is also valid. You can have a valid checksum , and someone screwed up the
> data prior to the checksum getting computed.
>
>
>> We need to make the checksum a hard stop. If the part is broken then
>> it needs to be addressed. Workarounds just end up being used and
>> forgotten, which makes it that much harder to support the product.
>> Better to mark the part as being broken, and get it fixed now, than to
>> have parts start shipping that require workarounds in order to
>> function.o
>
> I don't think it's realistic to define the development process for large
> corporations like Cisco, or like what your doing , to define the development
> process for all corporations and products which may use intel parts. It's better
> to be flexible.

Nikunj indicated that "manufacturing fixes NIC" so that sounds like a
workaround for an issue that would not affect a final product, in which
case, keeping downstream changes for development boards/intermediate
revisions of a product and focusing on relevant upstreaming changes for
the actual product would make a lot more sense, no?
--
Florian