Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

From: Dave Chinner
Date: Tue Jun 20 2017 - 06:12:01 EST


On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> >> Second: syncing extents. Here's a straw man. Forget the mmap() flag.
> >> Instead add a new msync() operation:
> >>
> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
> >
> > How's this any different from the fallocate command I proposed to do
> > this (apart from the fact that msync() is not intended to be abused
> > as a file preallocation/zeroing interface)?
>
> I must have missed that suggestion.
>
> But it's different in a major way. fallocate() takes an fd parameter,
> which means that, if some flag gets set, it's set on the struct file.

DAX is a property of the inode, not the VMA or struct file as it
needs to be consistent across all VMAs and struct files that
reference that inode. Also, fallocate() manipulates state and
metadata hidden behind the struct inode, not the struct file, so it
seems to me like the right API to use.

And, as mmap() requires a fd to set up the mapping and fallocate()
would have to be run *before* mmap() is used to access the data
directly, I don't see why using fallocate would be a problem here...

> >> If this operation succeeds, it guarantees that all future writes
> >> through this mapping on this range will hit actual storage and that
> >> all the metadata operations needed to make this write persistent will
> >> hit storage such that they are ordered before the user's writes.
> >> As an implementation detail, this will flush out the extents if
> >> needed. In addition, if the FS has any mechanism that would cause
> >> problems asyncronously later on (dedupe? deallocated extents full
> >> of zeros? defrag?),
> >
> > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
> > other background filesystem maintenance operations, etc can all
> > change the extent layout asynchronously if there's no mechanism to
> > tell the fs not to modify the inode extent layout.
>
> But that's my whole point. The kernel doesn't really need to prevent
> all these background maintenance operations -- it just needs to block
> .page_mkwrite until they are synced. I think that whatever new
> mechanism we add for this should be sticky, but I see no reason why
> the filesystem should have to block reflink on a DAX file entirely.

I understand the problem quite well, thank you very much. Yes,
COW operations (and other things) can be handled by invalidating DAX
mappings and blocking new page faults. I see little difference
between this and running the sync path after page-mkwrite has
triggered filesystem metadata changes (e.g. block allocation). i.e.
If MAP_SYNC is going to be used, then all the things you are talking
about comes along for the ride via invalidations.

The MAP_SYNC proposal is effectively "run the metadata side of
fdatasync() on every page fault". If the inode is not metadata
dirty, then it will do nothing, otherwise it will do what it needs
to stabilise the inode for userspace to be able to sync the data and
it will block until it is done.

Prediction for the MAP_SYNC future: frequent bug reports about huge,
unpredictable page fault latencies on DAX files because every so
often a page fault is required to sync tens of thousands of
unrelated dirty objects because of filesystem journal ordering
constraints....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx