Re: [PATCH RESEND] iwlwifi, Do not implement thermal zone unless ucode is loaded

From: Kalle Valo
Date: Thu Jul 14 2016 - 04:02:04 EST


Prarit Bhargava <prarit@xxxxxxxxxx> writes:

> On 07/13/2016 03:24 AM, Luca Coelho wrote:
>
>> I totally agree with Emmanuel and Kalle. We should not change this.
>> It is a design decision to return an error when the interface is
>> down, this is very common with other subsystems as well.
>
> Please show me another subsystem or driver that does this. I've looked around
> the kernel but cannot find one that updates the firmware and implements new
> features on the fly like this. I have come across several drivers that allow
> for an update, but they do not implement new features based on the
> firmware.
>
> Additionally, what happens when someone back revs firmware versions (which
> happens far more than you and I would expect)? Does that mean I now go from a
> functional system to a non-functional system wrt to userspace?

I'm not following, what do you mean exactly? Why are you talking
updating the firmware?

So when we talk about "loading firmware" we mean that the driver pushes
the firmware image to to the chipset. And then the interface is down the
chipset is powered down and the RAM on it will be erased. That's the
general idea anyway, I haven't checked how iwlwifi exactly works in this
case but Luca or Emmanuel can correct me.

>> The userspace should be able to handle errors and report something
>> like "unavailable" when this kind of error is returned.
>
> I myself have made the same arguments wrt to cpufreq code & bad userspace
> choices. I just went through this a few months back with what went from a
> simple patch and turned out to be a hideous patch in cpufreq. You cannot break
> userspace like this.

Don't get me wrong, I'm a strong supporter of stable user space
interfaces and I always try to adher to that. But there's a limit for
everything. If I'm understanding correctly, what you mean is that the
kernel should never return an error because an application doesn't
handle errors gracefully. Sorry, but that doesn't make sense to me.

> See commit 51443fbf3d2c ("cpufreq: intel_pstate: Fix intel_pstate powersave
> min_perf_pct value"). What should have been a trivial change resulted in a
> massive change because of broken userspace.

In that cpufreq case I understand, it was about a combination of
configuration values which broke the user space. But here we are just
dealing with a simple error value, nothing fancy.

>> I'm not sure EIO is the best we can have, but for me that's exactly
>> what it is. The thermal zone *is* there, but cannot be accessed
>> because the firmware is not available. I'm okay to change it to EBUSY,
>> if that would help userspace, but I think that's a bit misleading. The
>> device is not busy, on the contrary, it's not even running at all.
>>
>
> I understand that, but by returning -EIO we end up with an error.
>
>> Furthermore, I don't think this is "breaking userspace" in the sense of
>> being a regression.
>
> I run (let's say 4.5 kernel). sensors works. I update to 4.7. sensors doesn't
> work. How is that not a regression? That's _exactly_ what it should be
> reported as.

Sure, it's a regression in a way. But that's how the user space app you
are using is implemented, the same problem would happen with any driver
returning errors.

>> The userspace API has always been implemented with
>> the possibility of returning errors. It's not a good design if a
>> single device returning an error causes all the other devices to also
>> fail.
>>
>
> If that were the case we would never have to worry about "breaking userspace"?
> For any kernel change I could just say that the userspace design was bad and be
> done with it. Why fix anything then?

Because we are talking about a simple error value.

> I don't see any harm in waiting to register the sysfs files for hwmon until the
> firmware has been validated.

I'm against of that because it's bad software design. It's standard
practise in Linux that drivers register their capabilities during driver
probe time so that user space can query them whenever needed. I assume a
properly behaving user space app would want to know about all the
available sensors once the driver is initialised and your suggestion
would break that.

> IIUC, the up/down'ing of the device doesn't happen that often (during
> initial boot, and suspend/resume, switching wifi connections,
> shutdown?).

Basically it can happen anytime, this is fully controlled by user space.
There's no point of trying to make any assumptions as they won't hold
anyway.

> This would make the iwlwifi community happy (IMO) and
> sensors would still work. At the same time I could write a patch for
> lm-sensors to fix this issue if it comes up in future versions.
> [Aside: I'm going to have the reproducing system available today and
> will test this out. It looks like just moving some code around.]

Another option, but still a bad one I don't like, is that you change the
kernel interface to ignore all errors from drivers (like iwlwifi). This
way drivers don't need to make ugly workarounds.

> The bottom line is that lm-sensors is currently broken with this change in
> iwlwifi. AFAICT, no other thermal device returns an error this way, and IMO
> that means the iwlwifi driver is doing something new and unexpected wrt to
> userspace.

I haven't checked but I suspect ath10k has a similar problem when
interface is down.

--
Kalle Valo