Re: [PATCH] dax, pmem: add support for msync

From: Ross Zwisler
Date: Wed Sep 02 2015 - 13:47:59 EST


On Tue, Sep 01, 2015 at 04:12:42PM +0300, Boaz Harrosh wrote:
> On 08/31/2015 09:59 PM, Ross Zwisler wrote:
> > @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> > return dax_zero_page_range(inode, from, length, get_block);
> > }
> > EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +void dax_sync_range(unsigned long addr, size_t len)
> > +{
> > + while (len) {
> > + size_t chunk_len = min_t(size_t, SZ_1G, len);
> > +
>
> Where does the SZ_1G come from is it because you want to do cond_resched()
> every 1G bytes so not to get stuck for a long time?
>
> It took me a while to catch, At first I thought it might be do to wb_cache_pmem()
> limitations. Would you put a comment in the next iteration?

Yep, the SZ_1G is just to make sure we cond_reshced() every once in a while.
Is there a documented guideline somewhere as to how long a kernel thread is
allowed to spin before calling cond_resched()? So far I haven' been able to
find anything solid on this - it seems like each developer has their own
preferences, and that those preferences vary pretty widely.

In any case, assuming we continue to separate the msync() and fsync()
implementations for DAX (which right now I'm doubting, to be honest), I'll add
in a comment to explain this logic.

> > diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> > index 85f810b3..aa29ebb 100644
> > --- a/include/linux/pmem.h
> > +++ b/include/linux/pmem.h
> > @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
> > {
> > BUG();
>
> See below
>
> > }
> > +
> > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
> > +{
> > + BUG();
>
> There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit)
> this works perfectly and is persistent. why not use that in the generic case?

Nope, we really do need to use wb_cache_pmem() because clflush_cache_range()
isn't an architecture neutral API. wb_cache_pmem() also has the advantage
that on x86 it will take advantage of the new CLWB instruction if it is
available on the platform, and it doesn't introduce any unnecessary memory
fencing. This works on both PCOMMIT-aware systems and on ADR boxes without
PCOMMIT.

> One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM
> or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives
> past the boot of that system. The CPU cache still needs flushing in this case.
> (People are already using this for logs and crash dumps)

I'm confused about this "DRAM as pmem" use case - are the requirements
essentially the same as the ADR case? You need to make sure that pre-reboot
the dirty cache lines have been flushed from the processor cache, but if they
are in platform buffers (the "safe zone" for ADR) you're fine?

If so, we're good to go, I think. Dan's most recent patch series made it so
we correctly handle systems that have the PMEM API but not PCOMMIT:

https://lists.01.org/pipermail/linux-nvdimm/2015-August/002005.html

If the "DRAM as pmem across reboots" case isn't okay with your dirty data
being in the ADR safe zone, I think you're toast. Without PCOMMIT the kernel
cannot guarantee that the data has ever made it durably to the DIMMs,
regardless what clflush/clflushopt/clwb magic you do.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/