Re: [PATCH v8 6/9] dax: add support for fsync/msync

From: Ross Zwisler
Date: Wed Jan 13 2016 - 13:58:43 EST


On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > And secondly: You must write-protect all mappings of the flushed range so
> > > that you get fault when the sector gets written-to again. We spoke about
> > > this in the past already but somehow it got lost and I forgot about it as
> > > well. You need something like rmap_walk_file()...
> >
> > The code that write protected mappings and then cleaned the radix tree entries
> > did get written, and was part of v2:
> >
> > https://lkml.org/lkml/2015/11/13/759
> >
> > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > The reason behind this was that there was a race that I couldn't figure out
> > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > tree entries.
> >
> > The race goes like this:
> >
> > Thread 1 (write) Thread 2 (fsync)
> > ================ ================
> > wp_pfn_shared()
> > pfn_mkwrite()
> > dax_radix_entry()
> > radix_tree_tag_set(DIRTY)
> > dax_writeback_mapping_range()
> > dax_writeback_one()
> > radix_tag_clear(DIRTY)
> > pgoff_mkclean()
> > ... return up to wp_pfn_shared()
> > wp_page_reuse()
> > pte_mkdirty()
> >
> > After this sequence we end up with a dirty PTE that is writeable, but with a
> > clean radix tree entry. This means that users can write to the page, but that
> > a follow-up fsync or msync won't flush this dirty data to media.
> >
> > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > the DAX code has control over when the radix tree entry is dirtied but not
> > when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
> > This means that we can't easily add locking, etc. to protect ourselves.
> >
> > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > easy solutions presented themselves in the absence of a page lock. I do have
> > one idea, but I think it's pretty invasive and will need to wait for another
> > kernel cycle.
> >
> > The current code that leaves the radix tree entry will give us correct
> > behavior - it'll just be less efficient because we will have an ever-growing
> > dirty set to flush.
>
> Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> tags but it does not (I should have known, I have written that functions
> few years ago ;). Makes sense. Thanks for clarification.
>
> > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > > * dax_pfn_mkwrite - handle first write to DAX page
> > > > * @vma: The virtual memory area where the fault occurred
> > > > * @vmf: The description of the fault
> > > > - *
> > > > */
> > > > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > {
> > > > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > + struct file *file = vma->vm_file;
> > > >
> > > > - sb_start_pagefault(sb);
> > > > - file_update_time(vma->vm_file);
> > > > - sb_end_pagefault(sb);
> > > > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > >
> > > Why is NO_SECTOR argument correct here?
> >
> > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > tree - just make that entry dirty". This works because pfn_mkwrite() always
> > follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
> > will insert the radix tree entry, regardless of whether the fault was for a
> > read or a write. If the fault was for a write, the radix tree entry will also
> > be made dirty.
> >
> > For reads the radix tree entry will be inserted but left clean. When the
> > first write happens we will get a pfn_mkwrite() call, which will call
> > dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
> > tree entry & set the dirty tag.
>
> So the explanation of this should be somewhere so that everyone knows that
> we must have radix tree entries even for clean mapped blocks. Because upto
> know that was not clear to me. Also __dax_pmd_fault() seems to insert
> entries only for write fault so the assumption doesn't seem to hold there?

Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
4k pages. You are right about our handling of 2MiB pages - for a read
followed by a write we will just call into the normal __dax_pmd_fault() code
again, which will do the get_block() call and insert a dirty radix tree entry.
Because we have to go all the way through the fault handler again at write
time there isn't a benefit to inserting a clean radix tree entry on read, so
we just skip it.

> I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
> with hole punching. But I guess I can live with that.
>
> Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR