Re: [PATCH V4 18/20] rv/monitor: Add safe watchdog monitor

From: Steven Rostedt
Date: Tue Jun 28 2022 - 15:42:02 EST


On Thu, 16 Jun 2022 16:53:54 -0700
Guenter Roeck <linux@xxxxxxxxxxxx> wrote:

> > >> +
> > >> +struct automaton_safe_wtd automaton_safe_wtd = {
> > >> +    .state_names = {
> > >> +        "init",
> > >> +        "closed_running",
> > >> +        "closed_running_nwo",
> > >> +        "nwo",
> > >> +        "opened",
> > >> +        "opened_nwo",
> > >> +        "reopened",
> > >> +        "safe",
> > >> +        "safe_nwo",
> > >> +        "set",
> > >> +        "set_nwo",
> > >> +        "started",
> > >> +        "started_nwo",
> > >> +        "stoped"
> > >> +    },
> > >> +    .event_names = {
> > >> +        "close",
> > >> +        "nowayout",
> > >> +        "open",
> > >> +        "other_threads",
> > >> +        "ping",
> > >> +        "set_safe_timeout",
> > >> +        "start",
> > >> +        "stop"
> > >> +    },
> > >> +    .function = {

I think it could become much more readable if you added comments here that
show what each column is. That way, we do not need to try to remember it.

And maybe even a diagram in the comments. If we can automate the generating
of an ASCII DFA diagram then that would be awesome too :-)

> > >> +        {                          -1,                nwo_safe_wtd,             opened_safe_wtd,               init_safe_wtd,                          -1,                          -1,                         -1,                          -1 },
> > >> +        {                          -1, closed_running_nwo_safe_wtd,           reopened_safe_wtd,     closed_running_safe_wtd,                          -1,                          -1,                         -1,                          -1 },
> > >> +        {                          -1, closed_running_nwo_safe_wtd,        started_nwo_safe_wtd, closed_running_nwo_safe_wtd,                          -1,                          -1,                         -1,                          -1 },
> > >> +        {                          -1,                nwo_safe_wtd,         opened_nwo_safe_wtd,                nwo_safe_wtd,                          -1,                          -1,                         -1,                          -1 },
> > >> +        {               init_safe_wtd,                          -1,                          -1,                          -1,                          -1,                          -1,           started_safe_wtd,                          -1 },
> > >> +        {                nwo_safe_wtd,                          -1,                          -1,                          -1,                          -1,                          -1,       started_nwo_safe_wtd,                          -1 },
> > >> +        {     closed_running_safe_wtd,                          -1,                          -1,                          -1,                          -1,                set_safe_wtd,                          1,             opened_safe_wtd },
> > >> +        {     closed_running_safe_wtd,                          -1,                          -1,                          -1,               safe_safe_wtd,                          -1,                          1,             stoped_safe_wtd },
> > >> +        { closed_running_nwo_safe_wtd,                          -1,                          -1,                          -1,           safe_nwo_safe_wtd,                          -1,                         -1,                          -1 },
> > >> +        {                          -1,                          -1,                          -1,                          -1,               safe_safe_wtd,                          -1,                         -1,                          -1 },
> > >> +        {                          -1,                          -1,                          -1,                          -1,           safe_nwo_safe_wtd,                          -1,                         -1,                          -1 },
> > >> +        {     closed_running_safe_wtd,                          -1,                          -1,                          -1,                          -1,                set_safe_wtd,                         -1,             stoped_safe_wtd },
> > >> +        { closed_running_nwo_safe_wtd,                          -1,                          -1,                          -1,                          -1,            set_nwo_safe_wtd,                         -1,                          -1 },
> > >> +        {               init_safe_wtd,                          -1,                          -1,                          -1,                          -1,                          -1,                         -1,                          -1 },
> > >> +    },
> > >> +    .initial_state = init_safe_wtd,
> > >> +    .final_states = { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
> > >
> > > I find this event table all but impossible to verify.
> >
> > It is a matrix. Lines are states, and columns are events.
> >
> > On a given state/line, receiving a given event/column, the data is the next
> > state/row.
> >
>
> I am aware of that, and I did program state machines before.
>
> > For instance, let's say "init" (row 0), event "nwo" (column 1), and the next
> > state is the "nwo" (row 3).
> >
> > -1 means invalid/blocked state (yeah, maybe it is better to have an #define
> > INVALID_STATE -1).
> >
> > This is the C representation of an automaton, following the formal definition of
> > a deterministic automaton. I've added an explanation of this representation in
> > the documentation (patch 15, file da_monitor_synthesis.rst).
> >
> > A deeper look into this subject is here (peer-reviewed conference paper at
> > Software Engineer and Formal Methods 2019):
> > https://bristot.me/wp-content/uploads/2019/09/paper.pdf <https://bristot.me/wp-content/uploads/2019/09/paper.pdf>
> >
> > One could translate it back to the automaton's graphical format... to a format
> > of by a tool used to analyze automaton properties... that is the good point of
> > using a well-established formalism. (The bad part is that they are often
> > boring... c'est la vie :-)).
> >
>
> If the above state machine fails, no one but the authors will be able to even
> remotely figure out what happened, and if the watchdog driver is at fault or
> its monitor. It is a state machine making assumptions about state transitions,
> sure, but who knows if those asssumptions are even remotely correct or match
> reality. For example, I have no idea if the lack of a 'ping' function is handled
> correctly, if the lack of a 'stop' function is handled correctly, or what
> happens if any of the driver functions returns an error.
>
> I already found three assumptions which do not or not necessarily match
> reality:
>
> - The function to read the remaining timeout is optional and must not be
> used unconditionally, and its lack is not an error.
> - The requested timeout (and pretimeout) do not have to match the actually
> configured timeout, and userspace must not rely on the assumption that
> the values match.
> - The code assumes that the process opening the watchdog and the process
> accessing it are the same. While that is in general the case, it might
> well be that some application opens the watchdog and then handles it
> from a child process.
>
> And that is just after briefly browsing through the code.
>
> I am open to suggestions from others, but at this point I have serious doubts
> that this code is maintainable in the kernel.
>

Let me give some background on this. Various safety critical users
(automotive, power plants, etc) are looking at using Linux in the field.
Obviously, Linux itself is too big to verify properly. In fact, even small
OSs that say they are verifiable seldom are. Part of the solution is to add
a way to check that Linux is running as expected, and be able to detect
when it is not. When that happens, you fall into a "fail-safe" mode (for
example, if you have an autonomous driving vehicle, it could sound an alarm
and inform the driver to go into manual mode, or it is going to simply
"pull over". Note, this is just a hypothetical example, not something I've
seen in reality).

But to implement this, you need to verify the verifier. In this case, the
watchdog timer is what verifies that Linux is running properly. It is
possible to verify parts of the kernel when you limit the inputs of the
kernel. This means we are not trying to verify all use cases of the
watchdog timer. That would be pretty much impossible, or take decades to
complete. We only need to verify the use cases used in safety critical
systems.

Do not confuse this with static analyzers or other general purpose tooling
to find bugs. This is not the purpose of the run time verify. It is just to
prove that certain use cases will perform as expected, given a limited
input.

Years ago there was an interview with Linus where he was asked if he would
trust Linux in real-time safety critical projects. And Linus's response was
something to the effect that he would, as those projects have a very
limited use case, and the code that actually gets run is not as big as you
would think. And that code can be verified.

That's exactly what we are doing here. There are very strict standards on
what is considered safe for safety critical projects and by limiting the
use cases of the code we are verifying, we can achieve those objectives.

But your complaint about not being able to read or understand the table is
legitimate, and that needs to be addressed. And what the limiting input is
must also be commented as well, that way people will not get confused when
they see a legitimate use case not covered by the DFA.

Action Items:
1. Add column title to the matrix.
2. Ideally, also add an ASCII DFA diagram.
3. And finally, comment what the range of inputs are (what use cases are
being covered).

-- Steve