Re: [PATCH] dax: fix radix tree insertion race

From: Jan Kara
Date: Tue Apr 11 2017 - 04:30:12 EST


On Mon 10-04-17 14:34:29, Ross Zwisler wrote:
> On Mon, Apr 10, 2017 at 03:41:11PM +0200, Jan Kara wrote:
> > On Thu 06-04-17 15:29:44, Ross Zwisler wrote:
> > > While running generic/340 in my test setup I hit the following race. It can
> > > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
> > >
> > > Thread 1 Thread 2
> > > -------- --------
> > > dax_iomap_pmd_fault()
> > > grab_mapping_entry()
> > > spin_lock_irq()
> > > get_unlocked_mapping_entry()
> > > 'entry' is NULL, can't call lock_slot()
> > > spin_unlock_irq()
> > > radix_tree_preload()
> > > dax_iomap_pmd_fault()
> > > grab_mapping_entry()
> > > spin_lock_irq()
> > > get_unlocked_mapping_entry()
> > > ...
> > > lock_slot()
> > > spin_unlock_irq()
> > > dax_pmd_insert_mapping()
> > > <inserts a PMD mapping>
> > > spin_lock_irq()
> > > __radix_tree_insert() fails with -EEXIST
> > > <fall back to 4k fault, and die horribly
> > > when inserting a 4k entry where a PMD exists>
> > >
> > > The issue is that we have to drop mapping->tree_lock while calling
> > > radix_tree_preload(), but since we didn't have a radix tree entry to lock
> > > (unlike in the pmd_downgrade case) we have no protection against Thread 2
> > > coming along and inserting a PMD at the same index. For 4k entries we
> > > handled this with a special-case response to -EEXIST coming from the
> > > __radix_tree_insert(), but this doesn't save us for PMDs because the
> > > -EEXIST case can also mean that we collided with a 4k entry in the radix
> > > tree at a different index, but one that is covered by our PMD range.
> > >
> > > So, correctly handle both the 4k and 2M collision cases by explicitly
> > > re-checking the radix tree for an entry at our index once we reacquire
> > > mapping->tree_lock.
> > >
> > > This patch has made it through a clean xfstests run with the current
> > > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
> > > loop. It used to fail within the first 10 iterations.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > > Cc: <stable@xxxxxxxxxxxxxxx> [4.10+]
> >
> > The patch looks good to me (and I can see Andrew already sent it to Linus),
> > I'm just worndering where did things actually go wrong? I'd expect we would
> > return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault
> > for the address which should just work out fine...
>
> Yep, that's what I thought as well, and I think it does work for processes
> which have separate page tables. The second process will do a 4k fault (just
> as it would have if it had a VMA that was smaller than 2MiB, for example), map
> the 4k page into its page table and just dirty the 2MiB DAX entry in the radix
> tree. I've tested this case manually in the past.
>
> I think the error case that I was seeing was for threads that share page
> tables. In that case the 2nd thread falls back to PTEs, but there is already
> a PMD in the page table from the first fault. When we try and insert a PTE
> over the PMD we get the following BUG:

Ouch, right. We have to be really careful so that radix tree entries are
consistent with what we have in page tables so that our locking works...
Thanks for explanation.

Honza

>
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: do_raw_spin_trylock+0x5/0x40
> PGD 8d6ee0067
> PUD 8db6e8067
> PMD 0
>
> Oops: 0000 [#1] PREEMPT SMP
> Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm [last unloaded: scsi_debug]
> CPU: 2 PID: 25323 Comm: holetest Not tainted 4.11.0-rc4 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> task: ffff880095492a00 task.stack: ffffc90014048000
> RIP: 0010:do_raw_spin_trylock+0x5/0x40
> RSP: 0000:ffffc9001404bb60 EFLAGS: 00010296
> RAX: ffff880095492a00 RBX: 0000000000000018 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffffc9001404bb80 R08: 0000000000000001 R09: 0000000000000000
> R10: ffff880095492a00 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff8808d5fe4220 R14: ffff88004c3e3c80 R15: 8000000000000025
> FS: 00007f7ed7dff700(0000) GS:ffff8808de400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000008d86f6000 CR4: 00000000001406e0
> Call Trace:
> ? _raw_spin_lock+0x49/0x80
> ? __get_locked_pte+0x16b/0x1d0
> __get_locked_pte+0x16b/0x1d0
> insert_pfn.isra.68+0x3a/0x100
> vm_insert_mixed+0x64/0x90
> dax_iomap_fault+0xa41/0x1680
> ext4_dax_huge_fault+0xa9/0xd0
> ext4_dax_fault+0x10/0x20
> __do_fault+0x20/0x130
> __handle_mm_fault+0x9b3/0x1190
> handle_mm_fault+0x169/0x370
> ? handle_mm_fault+0x47/0x370
> __do_page_fault+0x28f/0x590
> trace_do_page_fault+0x58/0x2c0
> do_async_page_fault+0x2c/0x90
> async_page_fault+0x28/0x30
> RIP: 0033:0x4014b2
> RSP: 002b:00007f7ed7dfef20 EFLAGS: 00010216
> RAX: 00007f7ec6c00400 RBX: 0000000000010000 RCX: 0000000001c00000
> RDX: 0000000000001c01 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 00007f7ed7dff700 R08: 00007f7ed7dff700 R09: 00007f7ed7dff700
> R10: 00007f7ed7dff9d0 R11: 0000000000000202 R12: 00007f7ec6c00000
> R13: 00007ffe3ffb5b60 R14: 0000000000000400 R15: 00007f7ed7dff700
> Code: 30 84 ee 81 48 89 df e8 4a fe ff ff eb 89 89 c6 48 89 df e8 7e e7 ff ff eb 8c 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 55 48 89 e5 85 c0 75 2b ba 01 00 00 00 f0 0f b1 17 85 c0
> RIP: do_raw_spin_trylock+0x5/0x40 RSP: ffffc9001404bb60
> CR2: 0000000000000000
> ---[ end trace 75d38250d89b67cd ]---
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR