Re: [PATCH] printk: avoid livelock if another CPU printks continuously

From: Steven Rostedt
Date: Tue Feb 09 2016 - 10:50:36 EST


On Tue, 09 Feb 2016 16:24:29 +0100
Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:


> > We released all locks, why can't it be true? What prevents another task
> > on another CPU from coming into this section and updating everything?
>
> If we see that happening, it means another CPU started serving printk
> backlog. In which case, we (this CPU) no longer have the obligation
> to service it.

I'm saying is that we can repeat when not needed to.

> > cpu_relax() is a compiler barrier, not a memory barrier. Yes, the
> > compiler wont cache the variables, but nothing stops the CPU from doing
> > so. All archs use this code.
>
> We expect caches to be coherent. If another CPU modifies these variables,
> our CPU will (eventually) see modified values.

Perhaps, but is that guaranteed on all archs? Maybe not. I have worked
on archs with non coherent caches. They are a pain.

>
> My code does not depend on always catching "console_seq != log_next_seq"
> ==> "console_seq == log_next_seq" transition. It's okay to miss it.
>
> > spin_locks() are memory barriers, and will sync everything.
>
> They are also much heavier than memory reads.

Because they are barriers ;-) But don't worry, a printk is never a fast
path. The overhead of taking the locks is not going to be noticed by
anyone.

>
> >>>> + /* Good, other CPU entered "for(;;)" loop */
> >>>> + goto out;
> >>>> + }
> >>>> + }
> >>>> + /* No one seems to be willing to take it... */
> >>>> + if (console_trylock())
> >>>> + goto again; /* we took it */
> >>>
> >>> Perhaps add a few loops to the taking of the console sem.
> >>
> >> Why?
> >>
> >> If we fail to take the lock, another CPU took it.
> >> There is no need to try harder, as soon as we know that any
> >> other CPU took that lock, we can safely exit this function.
> >
> > Because if this CPU is the one spamming the other CPU, it will widen
> > the window to be the one that takes the lock.
>
> If we reached this code, we aren't the spamming CPU. We are the CPU
> which is being spammed (we are in the loop which services the backlog).

No, I mentioned the taking of console sem. The spamming task will be
trying that a bit, failing and then letting this CPU continue doing its
bidding. The above hopes that the spamming task takes the CPU between
the time this CPU lets go of the lock and does the for loop. But is
that for loop slow enough to catch the spamming CPU writing out more.
Of course, if we add the necessary memory barriers, we don't need to do
that.

>
> > I still think this is very hacky.
>
> I'm not insisting on my patch. Propose a different solution.

I haven't thought about it enough. Perhaps someone else can come up
with something. One thing is to find the spamming code and fix that.

-- Steve