Re: [PATCH] dax, pmem: add support for msync

From: Dave Chinner
Date: Wed Sep 02 2015 - 01:17:31 EST


On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote:
> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
> > Which means applications that should "just work" without
> > modification on DAX are now subtly broken and don't actually
> > guarantee data is safe after a crash. That's a pretty nasty
> > landmine, and goes against *everything* we've claimed about using
> > DAX with existing applications.
> >
> > That's wrong, and needs fixing.
>
> I agree that we need to fix fsync as well, and that the fsync solution could
> be used to implement msync if we choose to go that route. I think we might
> want to consider keeping the msync and fsync implementations separate, though,
> for two reasons.
>
> 1) The current msync implementation is much more efficient than what will be
> needed for fsync. Fsync will need to call into the filesystem, traverse all
> the blocks, get kernel virtual addresses from those and then call
> wb_cache_pmem() on those kernel addresses. I think this is a necessary evil
> for fsync since you don't have a VMA, but for msync we do and we can just
> flush using the user addresses without any fs lookups.

Yet you're ignoring the fact that flushing the entire range of the
relevant VMAs may not be very efficient. It may be a very
large mapping with only a few pages that need flushing from the
cache, but you still iterate the mappings flushing GB ranges from
the cache at a time.

We don't need struct pages to track page dirty state. We already
have a method for doing this that does not rely on having a struct
page and can be used for efficient lookup of exact dirty ranges. i.e
the per-page dirty tag that is kept in the mapping radix tree. It's
fine grained, and extremely efficient in terms of lookups to find
dirty pages.

Indeed, the mapping tree tags were specifically designed to avoid
this "fsync doesn't know what range to flush" problem for normal
files. That same problem still exists here for msync - these patches
are just hitting it with a goddamn massive hammer "because it is
easy" rather than attempting to do the flushing efficiently.

> 2) I believe that the near-term fsync code will rely on struct pages for
> PMEM, which I believe are possible but optional as of Dan's last patch set:
>
> https://lkml.org/lkml/2015/8/25/841
>
> I believe that this means that if we don't have struct pages for PMEM (becuase
> ZONE_DEVICE et al. are turned off) fsync won't work. I'd be nice not to lose
> msync as well.

I don't think this is an either-or situation. If we use struct pages
for PMEM, then fsync will work without modification as DAX will need
to use struct pages and hence we can insert them in the mapping
radix tree directly and use the normal set/clear_page_dirty() calls
to track dirty state. It will give us fine grained flush capability
and we won't want msync() to be using the big hammer if we can avoid
it.

If we make the existing pfn-based DAX code track dirty pages via
mapping radix tree tags right now, then we allow fsync to work by
reusing most of the infrastructure we already have. That means DAX
and fsync will work exactly the same regardless of how we
index/reference PMEM in future and we won't have to come back and
fix it all up again.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/