Re: [4/4] rhashtable: improve rhashtable_walk stability when stop/start used.

From: Paolo Abeni
Date: Thu Jul 05 2018 - 07:20:54 EST


On Tue, 2018-04-24 at 08:29 +1000, NeilBrown wrote:
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 81edf1ab38ab..9427b5766134 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -727,6 +727,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
> __acquires(RCU)
> {
> struct rhashtable *ht = iter->ht;
> + bool rhlist = ht->rhlist;
>
> rcu_read_lock();
>
> @@ -735,13 +736,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
> list_del(&iter->walker.list);
> spin_unlock(&ht->lock);
>
> - if (!iter->walker.tbl && !iter->end_of_table) {
> + if (iter->end_of_table)
> + return 0;
> + if (!iter->walker.tbl) {
> iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
> iter->slot = 0;
> iter->skip = 0;
> return -EAGAIN;
> }
>
> + if (iter->p && !rhlist) {
> + /*
> + * We need to validate that 'p' is still in the table, and
> + * if so, update 'skip'
> + */
> + struct rhash_head *p;
> + int skip = 0;
> + rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
> + skip++;
> + if (p == iter->p) {
> + iter->skip = skip;
> + goto found;
> + }
> + }
> + iter->p = NULL;
> + } else if (iter->p && rhlist) {
> + /* Need to validate that 'list' is still in the table, and
> + * if so, update 'skip' and 'p'.
> + */
> + struct rhash_head *p;
> + struct rhlist_head *list;
> + int skip = 0;
> + rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
> + for (list = container_of(p, struct rhlist_head, rhead);
> + list;
> + list = rcu_dereference(list->next)) {
> + skip++;
> + if (list == iter->list) {
> + iter->p = p;
> + skip = skip;
> + goto found;
> + }
> + }
> + }
> + iter->p = NULL;
> + }
> +found:
> return 0;
> }
> EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);

While testing new code that uses the rhashtable walker, I'm obeserving
an use after free, that is apparently caused by the above:

[ 146.834815] ==================================================================
[ 146.842933] BUG: KASAN: use-after-free in inet_frag_worker+0x9f/0x210
[ 146.850120] Read of size 4 at addr ffff881b6b9342d8 by task kworker/13:1/177
[ 146.857984]
[ 146.859645] CPU: 13 PID: 177 Comm: kworker/13:1 Not tainted 4.18.0-rc3.mirror_unclone_6_frag_dbg+ #1974
[ 146.870128] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
[ 146.878478] Workqueue: events inet_frag_worker
[ 146.883433] Call Trace:
[ 146.886162] dump_stack+0x90/0xe3
[ 146.889861] print_address_description+0x6a/0x2a0
[ 146.895109] kasan_report+0x176/0x2d0
[ 146.899193] ? inet_frag_worker+0x9f/0x210
[ 146.903762] inet_frag_worker+0x9f/0x210
[ 146.908142] process_one_work+0x24f/0x6e0
[ 146.912614] ? process_one_work+0x1a6/0x6e0
[ 146.917285] worker_thread+0x4e/0x3d0
[ 146.921373] kthread+0x106/0x140
[ 146.924970] ? process_one_work+0x6e0/0x6e0
[ 146.929637] ? kthread_bind+0x10/0x10
[ 146.933723] ret_from_fork+0x3a/0x50
[ 146.937717]
[ 146.939377] Allocated by task 177:
[ 146.943170] kasan_kmalloc+0x86/0xb0
[ 146.947158] __kmalloc_node+0x181/0x430
[ 146.951438] kvmalloc_node+0x4f/0x70
[ 146.955427] alloc_bucket_spinlocks+0x34/0xa0
[ 146.960286] bucket_table_alloc.isra.13+0x61/0x180
[ 146.965630] rhashtable_rehash_alloc+0x26/0x80
[ 146.970585] rht_deferred_worker+0x394/0x810
[ 146.975348] process_one_work+0x24f/0x6e0
[ 146.979819] worker_thread+0x4e/0x3d0
[ 146.983902] kthread+0x106/0x140
[ 146.987502] ret_from_fork+0x3a/0x50
[ 146.991487]
[ 146.993146] Freed by task 90:
[ 146.996455] __kasan_slab_free+0x11d/0x180
[ 147.001025] kfree+0x113/0x350
[ 147.004431] bucket_table_free+0x1c/0x70
[ 147.008806] rcu_process_callbacks+0x6c8/0x880
[ 147.013762] __do_softirq+0xd5/0x505
[ 147.017747]
[ 147.019407] The buggy address belongs to the object at ffff881b6b934200
[ 147.019407] which belongs to the cache kmalloc-8192 of size 8192
[ 147.033574] The buggy address is located 216 bytes inside of
[ 147.033574] 8192-byte region [ffff881b6b934200, ffff881b6b936200)
[ 147.046773] The buggy address belongs to the page:
[ 147.052116] page:ffffea006dae4c00 count:1 mapcount:0 mapping:ffff880107c0e400 index:0x0 compound_mapcount: 0
[ 147.063086] flags: 0x17ffffc0008100(slab|head)
[ 147.068043] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff880107c0e400
[ 147.076684] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000
[ 147.085324] page dumped because: kasan: bad access detected
[ 147.091540]
[ 147.093210] Memory state around the buggy address:
[ 147.098553] ffff881b6b934180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 147.106613] ffff881b6b934200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 147.114670] >ffff881b6b934280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 147.122730] ^
[ 147.129527] ffff881b6b934300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 147.137587] ffff881b6b934380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 147.145646] ==================================================================

Reverting the above change avoid the issue.

I *think* that reusing iter->p after a
rcu_read_lock()/rcu_read_unlock() is unsafe:
if the table has been reashed we can still and reuse 'p', but the
related grace period could be already expired.

I can't think of any better fix than a revert. Other opinions welcome!

Paolo