Re: [BUG] kmemleak on __radix_tree_preload

From: Paul E. McKenney
Date: Thu May 08 2014 - 11:00:48 EST


On Thu, May 08, 2014 at 11:24:36AM +0100, Catalin Marinas wrote:
> On Thu, May 08, 2014 at 10:37:40AM +0100, Jaegeuk Kim wrote:
> > 2014-05-08 (ë), 10:26 +0100, Catalin Marinas:
> > > On Thu, May 08, 2014 at 06:16:51PM +0900, Jaegeuk Kim wrote:
> > > > 2014-05-07 (ì), 12:39 +0100, Catalin Marinas:
> > > > > On Wed, May 07, 2014 at 03:58:08AM +0100, Jaegeuk Kim wrote:
> > > > > > unreferenced object 0xffff880004226da0 (size 576):
> > > > > > comm "fsstress", pid 14590, jiffies 4295191259 (age 706.308s)
> > > > > > hex dump (first 32 bytes):
> > > > > > 01 00 00 00 81 ff ff ff 00 00 00 00 00 00 00 00 ................
> > > > > > 50 89 34 81 ff ff ff ff b8 6d 22 04 00 88 ff ff P.4......m".....
> > > > > > backtrace:
> > > > > > [<ffffffff816c02e8>] kmemleak_update_trace+0x58/0x80
> > > > > > [<ffffffff81349517>] radix_tree_node_alloc+0x77/0xa0
> > > > > > [<ffffffff81349718>] __radix_tree_create+0x1d8/0x230
> > > > > > [<ffffffff8113286c>] __add_to_page_cache_locked+0x9c/0x1b0
> > > > > > [<ffffffff811329a8>] add_to_page_cache_lru+0x28/0x80
> > > > > > [<ffffffff81132f58>] grab_cache_page_write_begin+0x98/0xf0
> > > > > > [<ffffffffa02e4bf4>] f2fs_write_begin+0xb4/0x3c0 [f2fs]
> > > > > > [<ffffffff81131b77>] generic_perform_write+0xc7/0x1c0
> > > > > > [<ffffffff81133b7d>] __generic_file_aio_write+0x1cd/0x3f0
> > > > > > [<ffffffff81133dfe>] generic_file_aio_write+0x5e/0xe0
> > > > > > [<ffffffff81195c5a>] do_sync_write+0x5a/0x90
> > > > > > [<ffffffff811968d2>] vfs_write+0xc2/0x1d0
> > > > > > [<ffffffff81196daf>] SyS_write+0x4f/0xb0
> > > > > > [<ffffffff816dead2>] system_call_fastpath+0x16/0x1b
> > > > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > > >
> > > > > OK, it shows that the allocation happens via add_to_page_cache_locked()
> > > > > and I guess it's page_cache_tree_insert() which calls
> > > > > __radix_tree_create() (the latter reusing the preloaded node). I'm not
> > > > > familiar enough to this code (radix-tree.c and filemap.c) to tell where
> > > > > the node should have been freed, who keeps track of it.
> > > > >
> > > > > At a quick look at the hex dump (assuming that the above leak is struct
> > > > > radix_tree_node):
> > > > >
> > > > > .path = 1
> > > > > .count = -0x7f (or 0xffffff81 as unsigned int)
> > > > > union {
> > > > > {
> > > > > .parent = NULL
> > > > > .private_data = 0xffffffff81348950
> > > > > }
> > > > > {
> > > > > .rcu_head.next = NULL
> > > > > .rcu_head.func = 0xffffffff81348950
> > > > > }
> > > > > }
> > > > >
> > > > > The count is a bit suspicious.
> > > > >
> > > > > From the union, it looks most likely like rcu_head information. Is
> > > > > radix_tree_node_rcu_free() function at the above rcu_head.func?
> > >
> > > Thanks for the config. Could you please confirm that 0xffffffff81348950
> > > address corresponds to the radix_tree_node_rcu_free() function in your
> > > System.map (or something else)?
> >
> > Yap, the address is matched to radix_tree_node_rcu_free().
>
> Cc'ing Paul as well, not that I blame RCU ;), but maybe he could shed
> some light on why kmemleak can't track this object.

Do we have any information on how long it has been since that data
structure was handed to call_rcu()? If that time is short, then it
is quite possible that its grace period simply has not yet completed.

It might also be that one of the CPUs is stuck (e.g., spinning with
interrupts disabled), which would prevent the grace period from
completing, in turn preventing any memory waiting for that grace period
from being freed.

> My summary so far:
>
> - radix_tree_node reported by kmemleak as it cannot find any trace of it
> when scanning the memory
> - at allocation time, radix_tree_node is memzero'ed by
> radix_tree_node_ctor(). Given that node->rcu_head.func ==
> radix_tree_node_rcu_free, my guess is that radix_tree_node_free() has
> been called
> - some time later, kmemleak still hasn't received any callback for
> kmem_cache_free(node). Possibly radix_tree_node_rcu_free() hasn't been
> called either since node->count is not NULL.
>
> For RCU queued objects, kmemleak should still track references to them
> via rcu_sched_state and rcu_head members. But even if this went wrong, I
> would expect the object to be freed eventually and kmemleak notified (so
> just a temporary leak report which doesn't seem to be the case here).

OK, so you are saying that this memory has been in this state for quite
some time?

If the system is responsive during this time, I recommend building with
CONFIG_RCU_TRACE=y, then polling the debugfs rcu/*/rcugp files. The value
of "*" will be "rcu_sched" for kernels built with CONFIG_PREEMPT=n and
"rcu_preempt" for kernels built with CONFIG_PREEMPT=y.

If the number printed does not advance, then the RCU grace period is
stalled, which will prevent memory waiting for that grace period from
ever being freed.

Of course, if the value of node->count is preventing call_rcu() from
being invoked in the first place, then the needed grace period won't
start, much less finish. ;-)

Thanx, Paul

> I still cannot explain the node->count value above and how it can get
> there (too many node->count--?). Maybe Johannes could shed some light.
>
> Thanks.
>
> --
> Catalin
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/