Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

From: Darrick J. Wong
Date: Mon Oct 03 2022 - 20:12:51 EST


On Fri, Sep 30, 2022 at 09:56:41AM +0900, Gotou, Yasunori/五島 康文 wrote:
> Hello everyone,
>
> On 2022/09/20 11:38, Yang, Xiao/杨 晓 wrote:
> > Hi Darrick, Brian and Christoph
> >
> > Ping. I hope to get your feedback.
> >
> > 1) I have confirmed that the following patch set did not change the test
> > result of generic/470 with thin-volume. Besides, I didn't see any
> > failure when running generic/470 based on normal PMEM device instaed of
> > thin-volume.
> > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@xxxxxx/
> >
> > 2) I can reproduce the failure of generic/482 without thin-volume.
> >
> > 3) Is it necessary to make thin-volume support DAX. Is there any use
> > case for the requirement?
>
>
> Though I asked other place(*), I really want to know the usecase of
> dm-thin-volume with DAX and reflink.
>
>
> In my understanding, dm-thin-volume seems to provide similar feature like
> reflink of xfs. Both feature provide COW update to reduce usage of
> its region, and snapshot feature, right?
>
> I found that docker seems to select one of them (or other feature which
> supports COW). Then user don't need to use thin-volume and reflink at same
> time.
>
> Database which uses FS-DAX may want to use snapshot for its data of FS-DAX,
> its user seems to be satisfied with reflink or thin-volume.
>
> So I could not find on what use-case user would like to use dm-thin-volume
> and reflink at same time.
>
> The only possibility is that the user has mistakenly configured dm-thinpool
> and reflink to be used at the same time, but if that is the case, it seems
> to be better for the user to disable one or the other.
>
> I really wander why dm-thin-volume must be used with reflik and FS-DAX.

There isn't a hard requirement between fsdax and dm-thinp. The /test/
needs dm-logwrites to check that write page faults on a MAP_SYNC
mmapping are persisted directly to disk. dm-logwrites requires a fast
way to zero an entire device for correct operation of the replay step,
and thinp is the only way to guarantee that.

--D

> If my understanding is something wrong, please correct me.
>
> (*)https://lore.kernel.org/all/TYWPR01MB1008258F474CA2295B4CD3D9B90549@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> Thanks,
> ---
> Yasunori Goto
>
>
> >
> > Best Regards,
> > Xiao Yang
> >
> > On 2022/9/16 10:04, Yang, Xiao/杨 晓 wrote:
> > > On 2022/9/15 18:14, Yang, Xiao/杨 晓 wrote:
> > > > On 2022/9/15 0:28, Darrick J. Wong wrote:
> > > > > On Wed, Sep 14, 2022 at 08:34:26AM -0400, Brian Foster wrote:
> > > > > > On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> > > > > > > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > > > > > > > On 2022/9/9 21:01, Brian Foster wrote:
> > > > > > > > > Yes.. I don't recall all the internals of
> > > > > > > > > the tools and test, but IIRC
> > > > > > > > > it relied on discard to perform zeroing
> > > > > > > > > between checkpoints or some such
> > > > > > > > > and avoid spurious failures. The purpose of running on dm-thin was
> > > > > > > > > merely to provide reliable discard zeroing
> > > > > > > > > behavior on the target device
> > > > > > > > > and thus to allow the test to run reliably.
> > > > > > > > Hi Brian,
> > > > > > > >
> > > > > > > > As far as I know, generic/470 was original designed to verify
> > > > > > > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > > > > > > > reason, we need to ensure that all underlying devices under
> > > > > > > > dm-log-writes device support DAX. However
> > > > > > > > dm-thin device never supports
> > > > > > > > DAX so
> > > > > > > > running generic/470 with dm-thin device always returns "not run".
> > > > > > > >
> > > > > > > > Please see the difference between old and new logic:
> > > > > > > >
> > > > > > > >             old logic                          new logic
> > > > > > > > ---------------------------------------------------------------
> > > > > > > > log-writes device(DAX)                 log-writes device(DAX)
> > > > > > > >               |                                       |
> > > > > > > > PMEM0(DAX) + PMEM1(DAX)       Thin device(non-DAX) + PMEM1(DAX)
> > > > > > > >                                             |
> > > > > > > >                                           PMEM0(DAX)
> > > > > > > > ---------------------------------------------------------------
> > > > > > > >
> > > > > > > > We think dm-thin device is not a good solution
> > > > > > > > for generic/470, is there
> > > > > > > > any other solution to support both discard zero and DAX?
> > > > > > >
> > > > > > > Hi Brian,
> > > > > > >
> > > > > > > I have sent a patch[1] to revert your fix because I
> > > > > > > think it's not good for
> > > > > > > generic/470 to use thin volume as my revert patch[1] describes:
> > > > > > > [1] https://lore.kernel.org/fstests/20220914090625.32207-1-yangx.jy@xxxxxxxxxxx/T/#u
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > I think the history here is that generic/482 was changed over first in
> > > > > > commit 65cc9a235919 ("generic/482: use thin volume as
> > > > > > data device"), and
> > > > > > then sometime later we realized generic/455,457,470 had
> > > > > > the same general
> > > > > > flaw and were switched over. The dm/dax compatibility
> > > > > > thing was probably
> > > > > > just an oversight, but I am a little curious about that
> > > > > > because it should
> > > > >
> > > > > It's not an oversight -- it used to work (albeit with EXPERIMENTAL
> > > > > tags), and now we've broken it on fsdax as the pmem/blockdev divorce
> > > > > progresses.
> > > > Hi
> > > >
> > > > Do you mean that the following patch set changed the test result
> > > > of generic/470 with thin-volume? (pass => not run/failure)
> > > > https://lore.kernel.org/linux-xfs/20211129102203.2243509-1-hch@xxxxxx/
> > > >
> > > > >
> > > > > > have been obvious that the change caused the test to no
> > > > > > longer run. Did
> > > > > > something change after that to trigger that change in behavior?
> > > > > >
> > > > > > > With the revert, generic/470 can always run
> > > > > > > successfully on my environment
> > > > > > > so I wonder how to reproduce the out-of-order replay issue on XFS v5
> > > > > > > filesystem?
> > > > > > >
> > > > > >
> > > > > > I don't quite recall the characteristics of the failures
> > > > > > beyond that we
> > > > > > were seeing spurious test failures with generic/482 that were due to
> > > > > > essentially putting the fs/log back in time in a way that wasn't quite
> > > > > > accurate due to the clearing by the logwrites tool not
> > > > > > taking place. If
> > > > > > you wanted to reproduce in order to revisit that, perhaps start with
> > > > > > generic/482 and let it run in a loop for a while and see if it
> > > > > > eventually triggers a failure/corruption..?
> > > > > >
> > > > > > > PS: I want to reproduce the issue and try to find a
> > > > > > > better solution to fix
> > > > > > > it.
> > > > > > >
> > > > > >
> > > > > > It's been a while since I looked at any of this tooling
> > > > > > to semi-grok how
> > > > > > it works.
> > > > >
> > > > > I /think/ this was the crux of the problem, back in 2019?
> > > > > https://lore.kernel.org/fstests/20190227061529.GF16436@dastard/
> > > >
> > > > Agreed.
> > > >
> > > > >
> > > > > > Perhaps it could learn to rely on something more explicit like
> > > > > > zero range (instead of discard?) or fall back to manual zeroing?
> > > > >
> > > > > AFAICT src/log-writes/ actually /can/ do zeroing, but (a) it probably
> > > > > ought to be adapted to call BLKZEROOUT and (b) in the worst case it
> > > > > writes zeroes to the entire device, which is/can be slow.
> > > > >
> > > > > For a (crass) example, one of my cloudy test VMs uses 34GB partitions,
> > > > > and for cost optimization purposes we're only "paying" for the cheapest
> > > > > tier.  Weirdly that maps to an upper limit of 6500 write iops and
> > > > > 48MB/s(!) but that would take about 20 minutes to zero the entire
> > > > > device if the dm-thin hack wasn't in place.  Frustratingly, it doesn't
> > > > > support discard or write-zeroes.
> > > >
> > > > Do you mean that discard zero(BLKDISCARD) is faster than both
> > > > fill zero(BLKZEROOUT) and write zero on user space?
> > >
> > > Hi Darrick, Brian and Christoph
> > >
> > > According to the discussion about generic/470. I wonder if it is
> > > necessary to make thin-pool support DAX. Is there any use case for
> > > the requirement?
> > >
> > > Best Regards,
> > > Xiao Yang
> > > >
> > > > Best Regards,
> > > > Xiao Yang
> > > > >
> > > > > > If the
> > > > > > eventual solution is simple and low enough overhead, it
> > > > > > might make some
> > > > > > sense to replace the dmthin hack across the set of tests mentioned
> > > > > > above.
> > > > >
> > > > > That said, for a *pmem* test you'd expect it to be faster than that...
> > > > >
> > > > > --D
> > > > >
> > > > > > Brian
> > > > > >
> > > > > > > Best Regards,
> > > > > > > Xiao Yang
> > > > > > >
> > > > > > > >
> > > > > > > > BTW, only log-writes, stripe and linear support DAX for now.
> > > > > > >
> > > > > >
>