NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

From: Mikulas Patocka
Date: Mon Sep 21 2020 - 12:20:53 EST




On Wed, 16 Sep 2020, Mikulas Patocka wrote:

>
>
> On Wed, 16 Sep 2020, Dan Williams wrote:
>
> > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> > >
> > > > My first question about nvfs is how it compares to a daxfs with
> > > > executables and other binaries configured to use page cache with the
> > > > new per-file dax facility?
> > >
> > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > because it doesn't have the overhead of the buffer cache and bios. See
> > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> >
> > ...and that metadata problem is intractable upstream? Christoph poked
> > at bypassing the block layer for xfs metadata operations [1], I just
> > have not had time to carry that further.
> >
> > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > he's dropped that branch from his xfs.git
>
> XFS is very big. I wanted to create something small.

And the another difference is that XFS metadata are optimized for disks
and SSDs.

On disks and SSDs, reading one byte is as costly as reading a full block.
So we must put as much information to a block as possible. XFS uses
b+trees for file block mapping and for directories - it is reasonable
decision because b+trees minimize the number of disk accesses.

On persistent memory, each access has its own cost, so NVFS uses metadata
structures that minimize the number of cache lines accessed (rather than
the number of blocks accessed). For block mapping, NVFS uses the classic
unix dierct/indirect blocks - if a file block is mapped by a 3-rd level
indirect block, we do just three memory accesses and we are done. If we
used b+trees, the number of accesses would be much larger than 3 (we would
have to do binary search in the b+tree nodes).

The same for directories - NVFS hashes the file name and uses radix-tree
to locate a directory page where the directory entry is located. XFS
b+trees would result in much more accesses than the radix-tree.

Regarding journaling - NVFS doesn't do it because persistent memory is so
fast that we can just check it in the case of crash. NVFS has a
multithreaded fsck that can do 3 million inodes per second. XFS does
journaling (it was reasonable decision for disks where fsck took hours)
and it will cause overhead for all the filesystem operations.

Mikulas