Re: [PATCH] dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()

From: Dan Williams
Date: Thu Aug 13 2020 - 19:39:47 EST


On Thu, Aug 13, 2020 at 12:03 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Aug 12, 2020 at 8:17 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > Since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> > improved unlock_page(), it has become more noticeable how cow_user_page()
> > in a kernel with CONFIG_DMA_API_DEBUG=y can create and suffer from heavy
> > contention on DMA debug's radix_lock in debug_dma_assert_idle().
>
> Ooh.
>
> Yeah, that's ridiculously expensive, and serializes things for no good reason.
>
> Your patch looks obviously correct to me (Christoph?), but it also
> makes me go "why are we doing this in the first place"?
>
> Because it looks to me like
> (a) the debug check is wrong
> (b) this is left-over from early debugging
>
> In particular, I don't see why we couldn't do a COW on a page that is
> under writeback at the same time. We're not changing the page that is
> doing DMA.
>
> In fact, the whole "COW with DMA" makes me feel like the real bug may
> have been due that whole "ambiguous COW" thing, which was fixed in
> 17839856fd58 ("gup: document and work around "COW can break either
> way" issue")
>
> That debug thing goes back almost 7 years, and I don't think it has
> caught anything in those seven years, but I could be wrong.
>
> The commit that adds it does talk about a bug, but that code was
> removed entirely eventually. And google shows no hits for
> debug_dma_assert_idle() since - until your email.
>
> So my gut feel is that we should remove the check entirely, although
> your patch does seem like a big improvement.
>
> Christoph?
>
> (And Dan too, of course, in case he happens to be relaxing in front of
> the computer away from a newborn baby ;)
>

I can at least confirm that it has not caught anything in a long while
except a false positive that needed a fix up.

https://lore.kernel.org/lkml/CAPcyv4hy_nNe8G0o8sMrz9A8HcdRzAuKgXmvdjKusAAA3Fow4g@xxxxxxxxxxxxxx/

Part of me says it's not doing anything worthwhile upstream, but I
wonder if it is keeping some people from submitting patches that play
these page reference shenanigans? I know they're out there. The land
of gup and truncate is where questionable kernel changes go to die.

Outside of that, Hugh's patch looks like a definite improvement so I'd
be inclined to run with that, but rip the whole facility out at the
next sign of a false positive.