Re: [PATCH] error-injection: Add prompt for function error injection

From: Chris Mason
Date: Tue Nov 22 2022 - 12:43:31 EST


On 11/22/22 5:39 AM, Borislav Petkov wrote:
> On Mon, Nov 21, 2022 at 03:36:08PM -0800, Alexei Starovoitov wrote:
>> The commit log is bogus and the lack of understanding what
>
> You mean that:
>
> Documentation/fault-injection/fault-injection.rst
>
> ?
>
> I don't want any of that possible in production setups. And until you
> give me a sane argument why it is good to have in production setups
> generically, this is end of story.
>

I think there are a few different sides to this:

- it makes total sense that we all have wildly different ideas about
which tools should be available in prod. Making this decision more fine
grained seems reasonable.

- fault injection for testing: we have a stage of qualification that
does error injection against the prod kernel. It helps to have this
against the debug kernel too, but that misses some races etc. I always
just assumed distros and partners did some fault injection tests against
the prod kernel builds?

- fault injection for debugging: it doesn't happen often but at some
point we run out of ideas and start making different functions fail in
prod to figure out why we're not prodding.

- overriding return values for security fixes: also not a common thing,
but it's a tool we've used. There are usually better long term fixes,
but it happens.

Stepping back to the big picture of debugging systems with bpf in use, I
love hearing (and telling) stories of debugging difficult problems. As
far as I know, BPF telling lies hasn't really been a problem for us, so
even though it's a huge tangent, if you have specific examples of
problems you've seen, I'm really interested in hearing more.

When I talk about production, both overall stability and validating new
kernels, if I compare the BPF subsystem with MM, filesystems, cgroups,
the scheduler, networking, and all things Jens, the systems BPF
developers put in place are working really well for me.

If I expand the discussion to the BPF programs themselves, there have
been rare issues. Still completely on par with the rest of the kernel
subsystems and within the noise in comparison with hardware failures.

In other words, I really do care about the concerns you're expressing
here, and I'm usually first in line to complain when random people make
my job harder. I'm just not seeing these issues with BPF, and I see
them actively trying to increase safety over time.

-chris