Re: [PATCH] md/raid0: drop discard support past the first zone for the original layout (stable fix)

From: Song Liu
Date: Wed Jun 28 2023 - 18:56:25 EST


On Wed, Jun 28, 2023 at 3:43 PM Song Liu <song@xxxxxxxxxx> wrote:
>
> Hi Jason,
>
> Thanks for the fix!
>
> On Wed, Jun 28, 2023 at 7:32 AM Jason Baron <jbaron@xxxxxxxxxx> wrote:
> >
> > We've found that using raid0 with the 'original' layout and discard
> > enabled with different disk sizes (such that at least two zones are
> > created) can result in data corruption. This is due to the fact that
> > the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
> > layout. We've seen this corruption using ext4 but other filesystems are
> > likely susceptible as well.
> >
> > More specifically, while multiple zones are necessary to create the
> > corruption, the corruption may not occur with multiple zones if they
> > layout in such a way the layout matches what the 'alternate' layout
> > would have produced. Thus, not all raid0 devices with the 'original'
> > layout, different size disks and discard enabled will encounter this
> > corruption.
> >
> > The 3.14 kernel inadvertently changed the raid0 disk layout for different
> > size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
> > same raid0 array could corrupt data. This lead to the creation of the
> > 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
> > (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
> > to tell the kernel which layout to use (since it couldn't be autodetected).
> > However, when the 'original' layout was added back to 5.4 discard support
> > for the 'original' layout was not added leading this issue.
> >
> > I've been able to reliably reproduce the corruption with the following
> > test case:
> >
> > 1. create raid0 array with different size disks using original layout
> > 2. mkfs
> > 3. mount -o discard
> > 4. create lots of files
> > 5. remove 1/2 the files
> > 6. fstrim -a (or just the mount point for the raid0 array)
> > 7. umount
> > 8. fsck -fn /dev/md0 (spews all sorts of corruptions)
> >
> > The fix here is a minimal fix intended for stable trees which doesn't do
> > discard if we have the original layout and we are not in first zone or
> > if the i/o crosses zones (we either do the entire discard or none of it).
> > The proper fix to actually perform discard to all zones for the original
> > layout will land in upstream versions. We have implemented the minimal
> > fix here for stable branches to reduce risk.
> >
> > I've verified the change using the reproducer mentioned above. Typically,
> > the corruption is seen after less than 3 iterations, while the patch has
> > run 500+ iterations.
> >
> > Cc: NeilBrown <neilb@xxxxxxx>
> > Cc: Song Liu <song@xxxxxxxxxx>
> > Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
> > Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx>
> > ---
> > drivers/md/raid0.c | 17 ++++++++++++++++-
> > 1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> > index f8ee9a95e25d..2713a4acb44f 100644
> > --- a/drivers/md/raid0.c
> > +++ b/drivers/md/raid0.c
> > @@ -444,10 +444,25 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
> > sector_t end_disk_offset;
> > unsigned int end_disk_index;
> > unsigned int disk;
> > + bool bio_zone_overlap = false;
> >
> > zone = find_zone(conf, &start);
> > + if (bio_end_sector(bio) > zone->zone_end)
> > + bio_zone_overlap = true;
> > +
> > + /* The discard code below doesn't properly support the original
> > + * layout for the zones above the first one. We are adding
> > + * proper support in later kernel versions but have decided
> > + * that dropping discard support here is the lower risk option
> > + * to avoid data corruption for stable versions.
> > + */
> > + if ((conf->layout == RAID0_ORIG_LAYOUT) &&
> > + ((zone != conf->strip_zone) || (bio_zone_overlap))) {
> > + bio_endio(bio);
> > + return;
> > + }
>
> For bio_zone_overlap case, I think we can still do the split below and
> send discard to the first zone?

Actually, on a second (third?) thought, I think we can just ship the other
version (20230623180523.1901230-1-jbaron@xxxxxxxxxx) to stable
kernel. It is very safe. So we don't need more work on this version.
Sorry for the confusion.

Thanks again for the fix!
Song