Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

From: Peter Zijlstra
Date: Thu Nov 16 2017 - 06:28:16 EST


On Thu, Nov 16, 2017 at 10:18:00AM +0100, Martijn Coenen wrote:
> Thanks Peter for looking at this, more inline.

So I think I'm having a very hard time understanding things because I've
no idea how android works. I've made a number of assumptions below;
please bear with me.

> On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> + * 1) binder supports a "minimum node priority", meaning that all transactions
> >> + * into a node must run at this priority at a minimum. This means that the
> >> + * desired priority for handling a transaction is not necessarily equal to
> >> + * the priority of the caller.
> >
> > Isn't that the same as running everything in that node (whatever that
> > is) at that given prio?
>
> Not for synchronous transactions; in that case it's max(min_node_prio,
> caller_prio).

But that's exactly what you get!? How would it ever get below
min_node_prio? PI only ever (temporarily) raises prio, it never lowers
it.

> One reason min_node_prio exists is that for certain
> critical nodes, we don't want calls into it to run at a very low
> priority, because the implementation of said calls may take locks that
> are contended. I realize one possible solution to that is to use PI
> mutexes in userspace, but Android doesn't support that today.

Fix that?

> >> + * 2) binder supports asynchronous transactions, where the caller is not blocked
> >> + * on transaction completion; so, it also can't be blocked on an rt_mutex.
> >
> > This is not a theoretically sound thing to do... For a system to be
> > real-time it not only needs to guarantee forward progress but also have
> > bounded runtime.
> >
> > Elevating another task's priority to your own and then not blocking
> > effectively injects time. At the very least this time should be taking
> > into consideration when doing runtime analysis of the system.
>
> In case of asynchronous transactions, we never inherit the priority of
> the caller. We run at max(default_prio, min_node_prio). For the
> default node configuration, this means that we don't change priority
> at all for async transactions (they all run at nice 0). But for nodes
> that have a minimum priority set, we want calls into the node to run
> at that priority instead.

I'm confused -- again. Why not run those threads at the min prio to
begin with?

Also, this doesn't seem related to PI.

> > Also, since its async, why do we care? The original task is still making
> > progress. Only at the point where we start to wait for completion does
> > it make sense to boost. Otherwise you end up having to charge one task
> > double time.
>
> Yeah, so this is actually not about the caller; it's about certain
> critical nodes that need to process events at a high priority. For
> example, consider a node that receives sensor events using async
> transactions.

I'm not understanding; are you saying that some app does an async
request for sensor data, then later this sensor triggers and we need to
route the data back to the app?

But above you said that async requests do not boost the priority; so
then I would expect this sensor's data to land on regular min-prio
thread for dispatch.

I'm unclear of how sensors/devices are hooked up. But I'm assuming you
have a thread that poll()s the device, and this thread is part of a
system service process and that service in turn is a node on the RPC
network.

I would expect the service to have set the thread(s) at the relevant
priority for the device(s) it handles (aka. min prio ?).

> It's imperative that the threads handling these events
> run at real-time priority from the moment they wake up.

Sure, see above, ensure the thread(s) that poll()s and handles the
relevant device(s) are of appropriate priority a priory. Then you don't
need to fix anything in a hurry.

> Binder's thread model has a generic threadpool for the entire process;
> threads aren't dedicated to specific nodes. So to avoid such a thread
> being preempted, we need to raise its priority even before waking it
> up, and the only place to do it is from the caller.

*blink* what? Binder has its own threads? Its not just an RPC/ORB?

> >> + * 3) similarly, there may not necessarily be a thread waiting for
> >> + * transactions at the time the call is made, so we don't know who to proxy-
> >> + * lock the lock for.
> >
> > Non-issue; you have the exact same problem now. Your interface takes a
> > @task argument, so you have to have a @task either way around.
>
> But in my current implementation once a thread does call in for more
> work and finds the work pending, it will raise its own priority.

That smells like fail; if you need to raise your prio it means you are
low(er) prio and could at that point be subject to inversion.

Assuming no idle threads in the pool, and all 'busy' at lower priority.
A medium prio something is running, keeping these pool thingies from
asking for more work. A high prio request comes in.. nobody around to
service. You loose, game over.

> I don't see how this is possible with rt_mutex, because the thread is
> not yet available at the moment the caller goes to sleep.

The fact that you need queueing and do not have a task ready to accept
means you've already lost the game.

> >> + * 4) binder supports nested transactions, where A can call into B, and B can
> >> + * call back into A before returning a reply to the original transaction.
> >> + * This means that if A is blocked on an rt_mutex B holds, B must first wake
> >> + * up A to handle a new transaction, and only then can it proxy-lock and try
> >> + * to acquire the new rt_mutex. This leaves a race condition where B
> >> + * temporarily runs at its original priority.
> >
> > That doesn't sound like something a well formed (RT) program should do
> > in the first place.
>
> What do you mean specifically - nested transactions?

Yeah, that sounds really dodgy. I've been a _long_ time since I read the
RT CORBA stuff, so I'm not sure if there's a sensible solution to the
problem, but it sounds really really dodgy.

Regular RPCs are synchronous; if A asks B to do something, A blocks. B
can then not turn around and ask A something. That is called a deadlock.

The only thing B can do at that point is complete the RPC. You can of
course design your RPC such that it can 'fail' the call and then A
re-issues the call with additional information or somesuch.

Alternatively, if you consider A and B to be services/nodes and you get
service A to call into B and then have B call back into A, you need to
guarantee there are sufficient tasks/threads to complete the entire
transaction _ahead_of_time_. That is, when task 1 of A does the request
into B and blocks, it needs to have reserved another thread (2) of A to
be idle and waiting for B's request.

Without the advance reservation you'll run into deadlocks when the
thread-pool runs dry.

IOW, seriously dodgy stuff best avoided.

> I don't think we're trying to make this fit perfectly with WFQ - we're
> just trying to improve the latency on Android devices where it counts,
> by telling the scheduler which threads are important. Without these
> patches, we would often see binder threads receiving sensor events get
> preempted in the kernel, and the delay would cause issues with things
> like video stabilization, etc.

So I really don't understand how all this works.. how can 'important'
device threads not already be of the right priority. All this fixing up
after the fact just reeks of fail.

The device thread then stuffs its data into whatever RPC channels are
waiting for it and goes back to waiting. Each of those channels wake up
the threads waiting for it (at their own respective priorities) and life
goes on.

If the thread blocks while writing into one of those channels, you again
loose.

> > What you're proposing is a bunch of make-work-ish hacks on a system that
> > isn't designed for RT.
>
> What parts of the system aren't designed for RT?

None of it as far I can tell. The shared thread pools, the quick fix
priority, the nested transaction stuff, those all reek of utter fail and
confusion.

> I'm trying to understand if it's something in the way binder currently
> works - eg the use of asynchronous transactions and generic
> threadpools - or something in this patchset in particular. How should
> we deal with high-priority sensor events that go over binder IPC in
> your opinion?

Have tasks at appropriate priorities a priory wherever and whenever
possible. PI is a (least bad) fix for when two threads interact on a
shared resource. The best fix is taking out the shared resource.

You don't have a shared resource in most of the scenarios described
here.

Use dedicated IPC channels, and avoid mixing priority msgs on a channel.
If you have to, make sure the channels are priority ordered.

The primary rule for systems is simplicity, RT systems more so than
others. What you're describing is far too complicated and riddled with
holes.

Also, have a look at RT CORBA, there's a really nice open-source one
here:

http://www.cs.wustl.edu/~schmidt/TAO.html

That group also has a whole bunch of research papers on the subject. As
said, its been many years since I used any of that so I'm bound to not
know all the new shiny stuff ;-)

(As a bonus, its written in a semi sane language, as opposed to most of
Android :-)