Re: [PATCH v2] btrfs: fix rw device counting in __btrfs_free_extra_devids

From: Desmond Cheong Zhi Xi
Date: Thu Aug 12 2021 - 13:31:34 EST


On 12/8/21 11:50 pm, David Sterba wrote:
On Thu, Aug 12, 2021 at 11:43:16PM +0800, Desmond Cheong Zhi Xi wrote:
On 12/8/21 6:38 pm, David Sterba wrote:
On Tue, Jul 27, 2021 at 03:13:03PM +0800, Desmond Cheong Zhi Xi wrote:
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.

This error was caught by Syzbot which reported a warning in
close_fs_devices because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:

btrfs_mount_root():
btrfs_scan_one_device():
device_list_add(); <---------------- device added
btrfs_open_devices():
open_fs_devices():
btrfs_open_one_device(); <-------- writable device opened,
rw device count ++
btrfs_fill_super():
open_ctree():
btrfs_free_extra_devids():
__btrfs_free_extra_devids(); <--- writable device removed,
rw device count not decremented
fail_tree_roots:
btrfs_close_devices():
close_fs_devices(); <------- rw device count off by 1

As a note, prior to commit cf89af146b7e ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.

In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.

Additionally, following commit 82372bc816d7 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.

Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
Reported-by: syzbot+a70e2ad0879f160b9217@xxxxxxxxxxxxxxxxxxxxxxxxx
Tested-by: syzbot+a70e2ad0879f160b9217@xxxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@xxxxxxxxx>
Reviewed-by: Anand Jain <anand.jain@xxxxxxxxxx>
---
fs/btrfs/volumes.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 807502cd6510..916c25371658 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
list_del_init(&device->dev_alloc_list);
clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+ fs_devices->rw_devices--;
}
list_del_init(&device->dev_list);
fs_devices->num_devices--;

I've hit a crash on master branch with stacktrace very similar to one
this bug was supposed to fix. It's a failed assertion on device close.
This patch was the last one to touch it and it matches some of the
keywords, namely the BTRFS_DEV_STATE_REPLACE_TGT bit that used to be in
the original patch but was not reinstated in your fix.

I'm not sure how reproducible it is, right now I have only one instance
and am hunting another strange problem. They could be related.

assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150

https://susepaste.org/view/raw/18223056 full log with other stacktraces,
possibly relatedg


Looking at the logs, it seems that a dev_replace was started, then
suspended. But it wasn't canceled or resumed before the fs devices were
closed.

I'll investigate further, just throwing some observations out there.

Thanks. I'm testing the patch revert, no crash after first loop, I'll
run a few more to be sure as it's not entirely reliable.

Sending the revert is option of last resort as we're approaching end of
5.14 dev cycle and the crash prevents testing (unlike the fuzzer
warning).


I might be missing something, so any thoughts would be appreciated. But I don't think the assertion in btrfs_close_one_device is correct.

From what I see, this crash happens when close_ctree is called while a dev_replace hasn't completed. In close_ctree, we suspend the dev_replace, but keep the replace target around so that we can resume the dev_replace procedure when we mount the root again. This is the call trace:

close_ctree():
btrfs_dev_replace_suspend_for_unmount();
btrfs_close_devices():
btrfs_close_fs_devices():
btrfs_close_one_device():
ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state));

However, since the replace target sticks around, there is a device with BTRFS_DEV_STATE_REPLACE_TGT set, and we fail the assertion in btrfs_close_one_device.

Two options I can think of:

- We could remove the assertion.

- Or we could clear the BTRFS_DEV_STATE_REPLACE_TGT bit in btrfs_dev_replace_suspend_for_unmount. This is fine since the bit is set again in btrfs_init_dev_replace if the dev_replace->replace_state is BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED. But this approach strikes me as a little odd because the device is still the replace target when mounting in the future.

Thoughts?