Re: [PATCH v2] btrfs: fix rw device counting in __btrfs_free_extra_devids

From: David Sterba
Date: Fri Aug 13 2021 - 06:33:34 EST


On Fri, Aug 13, 2021 at 05:57:26PM +0800, Desmond Cheong Zhi Xi wrote:
> On 13/8/21 4:51 pm, David Sterba wrote:
> > On Fri, Aug 13, 2021 at 01:31:25AM +0800, Desmond Cheong Zhi Xi wrote:
> >> On 12/8/21 11:50 pm, David Sterba wrote:
> >>> On Thu, Aug 12, 2021 at 11:43:16PM +0800, Desmond Cheong Zhi Xi wrote:
> >>>> On 12/8/21 6:38 pm, David Sterba wrote:
> >>>>> On Tue, Jul 27, 2021 at 03:13:03PM +0800, Desmond Cheong Zhi Xi wrote:
> >>>>>> --- a/fs/btrfs/volumes.c
> >>>>>> +++ b/fs/btrfs/volumes.c
> >>>>>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
> >>>>>> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
> >>>>>> list_del_init(&device->dev_alloc_list);
> >>>>>> clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> >>>>>> + fs_devices->rw_devices--;
> >>>>>> }
> >>>>>> list_del_init(&device->dev_list);
> >>>>>> fs_devices->num_devices--;
> >>>>>
> >>>>> I've hit a crash on master branch with stacktrace very similar to one
> >>>>> this bug was supposed to fix. It's a failed assertion on device close.
> >>>>> This patch was the last one to touch it and it matches some of the
> >>>>> keywords, namely the BTRFS_DEV_STATE_REPLACE_TGT bit that used to be in
> >>>>> the original patch but was not reinstated in your fix.
> >>>>>
> >>>>> I'm not sure how reproducible it is, right now I have only one instance
> >>>>> and am hunting another strange problem. They could be related.
> >>>>>
> >>>>> assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
> >>>>>
> >>>>> https://susepaste.org/view/raw/18223056 full log with other stacktraces,
> >>>>> possibly relatedg
> >>>>>
> >>>>
> >>>> Looking at the logs, it seems that a dev_replace was started, then
> >>>> suspended. But it wasn't canceled or resumed before the fs devices were
> >>>> closed.
> >>>>
> >>>> I'll investigate further, just throwing some observations out there.
> >>>
> >>> Thanks. I'm testing the patch revert, no crash after first loop, I'll
> >>> run a few more to be sure as it's not entirely reliable.
> >>>
> >>> Sending the revert is option of last resort as we're approaching end of
> >>> 5.14 dev cycle and the crash prevents testing (unlike the fuzzer
> >>> warning).
> >>>
> >>
> >> I might be missing something, so any thoughts would be appreciated. But
> >> I don't think the assertion in btrfs_close_one_device is correct.
> >>
> >> From what I see, this crash happens when close_ctree is called while a
> >> dev_replace hasn't completed. In close_ctree, we suspend the
> >> dev_replace, but keep the replace target around so that we can resume
> >> the dev_replace procedure when we mount the root again. This is the call
> >> trace:
> >>
> >> close_ctree():
> >> btrfs_dev_replace_suspend_for_unmount();
> >> btrfs_close_devices():
> >> btrfs_close_fs_devices():
> >> btrfs_close_one_device():
> >> ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> >> &device->dev_state));
> >>
> >> However, since the replace target sticks around, there is a device with
> >> BTRFS_DEV_STATE_REPLACE_TGT set, and we fail the assertion in
> >> btrfs_close_one_device.
> >>
> >> Two options I can think of:
> >>
> >> - We could remove the assertion.
> >>
> >> - Or we could clear the BTRFS_DEV_STATE_REPLACE_TGT bit in
> >> btrfs_dev_replace_suspend_for_unmount. This is fine since the bit is set
> >> again in btrfs_init_dev_replace if the dev_replace->replace_state is
> >> BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED. But this approach strikes me as
> >> a little odd because the device is still the replace target when
> >> mounting in the future.
> >
> > The option #2 does not sound safe because the TGT bit is checked in
> > several places where device list is queried for various reasons, even
> > without a mounted filesystem.
> >
> > Removing the assertion makes more sense but I'm still not convinced that
> > the this is expected/allowed state of a closed device.
> >
>
> Would it be better if we cleared the REPLACE_TGT bit only when closing
> the device where device->devid == BTRFS_DEV_REPLACE_DEVID?
>
> The first conditional in btrfs_close_one_device assumes that we can come
> across such a device. If we come across it, we should properly reset it.
>
> If other devices has this bit set, the ASSERT will still catch it and
> let us know something is wrong.

That sounds great.

> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 70f94b75f25a..a5afebb78ecf 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1130,6 +1130,9 @@ static void btrfs_close_one_device(struct btrfs_device *device)
> fs_devices->rw_devices--;
> }
>
> + if (device->devid == BTRFS_DEV_REPLACE_DEVID)
> + clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
> +
> if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
> fs_devices->missing_devices--;

I'll do a few test rounds, thanks.