Re: [RFC+Patch] Formal models as source of truth for Software Architects.

From: Jonas Oberhauser
Date: Sun Nov 05 2023 - 16:06:48 EST


Hi Mathew,


Am 11/5/2023 um 8:20 AM schrieb Mathew, Cherry G.*:
Hi Jonas,

First of all, sorry for my late reply - I note with thanks, that you
spent some time and attention on your email, so I wanted to spend some
time composing a thoughtful reply. I'm replying in context below:


Jonas Oberhauser<jonas.oberhauser@xxxxxxxxxxxxxxx> writes:
[...]


> A few years ago we also built systems to extract models from C
> code (for TLA+ back then), but those systems are now totally
> unused (and unmaintained) due to poor scalability of exploratory
> state-based model checking.

> And then there's the issue that many engineers don't know how to
> write reasonable temporal assertions, especially not ones that can
> give sufficient guarantees to be worth the extra effort and
> maintenance.

So my arrival at "D3" and spin was a progression of looking at what's
available (in the open domain), current consensus around development
methodology, and progress in hardware capacity.

I've used TLA+ for protocol verification, and while it's great for that,
I didn't see it translate to and from implementation code.


I meant that we implemented an internal tool to transpile from C to PlusCal.

It sounded like a great idea at the time. But then it quickly fell out of use.


> If we use a simplified model of what the system should be doing to
> specify, then usually we are not looking for equivalence but only
> for uni-directional simulation: the spec usually needs to have
> more degrees of freedom than the implementation, which is just a
> single implementation.

So spin has two approaches - one uses the inline "C" to glue the spec as
a sort of "test driver" for implementation code if required. The other
uses a separate source level model extractor, to rebuild the implicit
model in the implementation C code. I am leaning towards the latter,
because it seems to help with streamlining the software development
process as an iteration - which helps especially in early nascent stages
of design & development. What I'm beginning to see is that once design
"settles down", then the iterations become less disruptive. But
initially, the Architect/Engineer hats need very good interaction.


> What has worked well for us is anything that works

> - directly on the code

> - doesn't require learning any kind of math to find bugs with

> - is scalable enough to find bugs on real code

> - doesn't require a lot of extra effort from the engineers
> (compared to the extra assurance you get)

> - doesn't need to be adapted whenever the code changes

Have you seen spoq ?

https://www.usenix.org/conference/osdi23/presentation/li-xupeng

They use this approach - one of the reasons I'm not fully convinced by
this approach is because, I believe that the source of truth being in
implementation code limits flexibility, for eg: reuse. If you notice my
original patch, the model is a reference for both the NetBSD and Linux
drivers - nothing stops it from further OSs (as long as the codeflow
logic doesn't vary too much - and even then, the model could itself be
made modular to accommodate different FSMs)


I have seen it, but it's not exactly what I have in mind. I wouldn't say that it doesn't require learning any kind of math to find bugs with.
You're still writing Coq proofs, just with more automation. Nevertheless it's going in the direction I'm thinking of.

For reuse, I think the main issue is that implementation code is always a source of truth - the truth of what's really going to be executed.
If we want to have a second source of truth, it should be a different truth, such as "assumptions of the other parts of the system".

Since you already have this source of truth, if you make a different implementation in another kernel, you can compare what the original driver was doing with what your new implementation is doing.
There's no need to have yet another copy of what the driver might be doing.



> I find in software, usually blueprint and skyscraper are in some
> sense the same thing. Or in some sense, the skyscraper is
> extracted automatically by a compiler/interpreter and installation
> system - we never actually dig anything ourselves, but have the
> ability to experiment directly on the blueprint as we build it.

> That's because languages already try to provide us reasonable
> abstraction mechanisms by which we can simplify the code and make
> it more "model-like".

Can you give me an example of how this works/a pre-existing tool that
helps with this simplification ? I'm actually currently looking at
re-writing modex to do precisely this (but with the translation
end-point being something close to the original model).


I think any higher level language, including C, goes into this direction. Some are just a lot better at building abstractions and describing the code more model-like than tiniest-implementation-detail-like.

(and sometimes that's not what you want).


> So writing the blueprint twice - once in some modelling language
> and once in a language meant for execution - doesn't really
> provide a lot of benefit. But it requires doing changes at two
> levels whenever the software design changes. It just violates DRY.

IMHO, this would be true if:

a) The implementation language was expressive enough to cleanly
encapsulate the model.

b) If the toolchain around it were able to do model verification (eg: by
exploring LTL bounds).


We are building internal tools for such b) things. Not quite exhaustive formal verification tools, but tools that can express and check more complex properties at source level without false positives.

(They may have false negatives, but that's not really a showstopper. False positives would be.)


I've heard good things about rustlang wrt. a), but not b) - would be
keen to know more.

> What makes sense in our experience is identifying properties you
> would like the system to have and devising a specific checking
> mechanism for that. For example, you might specify the protocol
> of how the device communicates with the system and what
> assumptions it has (like "don't start a second command while the
> first one hasn't finished"), and then check that the driver code
> obeys this policy. Then you can change the software code without
> touching the protocol specification, and just re-verify.

> In the easiest case, the protocol can be a mock device process
> that is inserted via a form of dependency injection to talk to the
> real driver code during verification, and if the code isn't too
> complex you can exhaustively check several test cases.


Ok, from what I understand, this is a "system model" - when I talk about
D3, I'm purely talking about the software model. This model may be part
of a larger scope "system model".


I'm not using the word "system" in any "full system" kind of way. I just mean any component that interacts with some outside world, such as an algorithm, a driver, or the whole kernel.

Usually we are looking for two types of bugs:

 1) the code for some subtle reason isn't doing what one would expect from reading the code, e.g., UB, some wrong assumption of atomicity, etc.

 2) the code is doing what it seems to be doing, but that's not the right thing

The former can usually be specified in a semi-generic way without knowing anything about the application, and without modeling too much of the external system.
In the easiest cases, things like -fsanitize=undefined may already catch most of these issues, especially when used in combination with some other tools (e.g. fuzzers).

But for the latter, we need some way to specify at least what the wrong things are, even if we can't specify what the right thing is.
So there it makes sense to specify some of the assumptions of the other systems that interact with our system.

E.g., I might not be able to specify a video game, but at least I can check that its sequence of opengl calls is well-formed (no half-defined objects for example).

But for that I need to explain what it means for a sequence of opengl calls to be well-formed.


Event-B
http://www.event-b.org/index.html
https://web-archive.southampton.ac.uk/deploy-eprints.ecs.soton.ac.uk/111/1/sld.ch1.intro.pdf

is an interesting (closed) industry take on what you seem to be
describing - and they seem to have taken a more "D3" like approach -
I don't have much insight into their experiences though, although I'm
speaking to one of their practitioners.


Thanks. I haven't had an opportunity to look into event-b more closely yet. Maybe at some point in the future.


> Only if the code is highly unreadable for specific reasons like
> performance or talking to legacy ABIs, it can make sense to have
> an "executable comment" that implements the same functional
> behavior in a very concise, readable way (with crud
> performance/mock APIs replacing the actual behavior), and check
> that the real code and the "executable comment" do "the same
> thing".

I'm a bit skeptical about the "doxygen" approach (i've seen this in some
versions of Java iirc, as well) - basically you're bypassing all the
goodness that the language parser provides by hiding from it - I'm not
sure how this is useful other than to shimmy in a specific bandaid
fix. Which is fine, if you're fighting fires, but I think for a more
structured approach, I'm not seeing what you're seeing, I'm afraid. Open
to be signposted though.


By executable comment I mean a comment that has a formal semantics that can be executed.

Think more pre&post conditions + ghost code. E.g., for a tricky sort algorithm like timsort the comment might be something like

var __ghost_array = copy(input_array);

.... // complicated timsort code here

insertion_sort(&__ghost_array); // timsort should give the same output as insertion stort, note that both are stable

for (i in 0...array_len) {
    assert (sorted_array[i] == __ghost_array[i]);
}


This is probably not going to help you find the well-known timsort bug, but it might be enough to find more trivial mistakes when rolling your own timsort.
Anyways this is what I mean by executable comment - a more readable, maintainable implementation of the code that tells you what the code ought to be doing + some checks to say that they're really doing the same thing.

As I understand, doxygen is just a way to link human-readable documentation with code, but maybe I'm wrong about this.


> The main issues I see when we look at it through the skyscraper
> analogy is that usually people in other engineering disciplines
> have CAD tools - backed by very powerful computers - to really
> systematically and thoroughly check the design. What we usually
> have is stress testing and some test suites built by hand that
> miss many cases.

Interestingly, I was surprised at the amount of compute available "off
the shelf" - the other day I sent a model for help, saying with a bit of
trepidation that that it might take about 25G of RAM and thrash the
machine, so please be careful - the volunteer came back to me and said
he has 128G RAM available, and 1TB RAM on a pinch! What takes an hour
to run on my i5 + 16GB laptop, took him 3minutes!

So in some sense, I believe we're nearing the "Deep Neural Nets" moment,
when GPU computation took inference to the next level.


Yeah, it's pretty impressive. Actually even the power of laptops is pretty impressive.
Things like GenMC or Dat3M can do bounded verification of (slightly massaged) C algorithms from the linux kernel by more or less exploring the full execution space (of a bounded client) within a couple of minutes on a laptop.

But for many scenarios, we are still a factor of at least trillions away from using something where exhaustive exploration works - not something that we'll fix by tying a bunch of supercomputers together...


best wishes,

jonas