Re: [syzbot] kernel BUG in assertfail

From: David Sterba
Date: Tue Jun 01 2021 - 13:30:38 EST


On Mon, May 31, 2021 at 12:27:05PM +0300, Nikolay Borisov wrote:
> On 31.05.21 г. 12:09, Dmitry Vyukov wrote:
> > On Mon, May 31, 2021 at 10:57 AM Nikolay Borisov <nborisov@xxxxxxxx> wrote:
> >> On 31.05.21 г. 11:55, Dmitry Vyukov wrote:
> >>> On Mon, May 31, 2021 at 10:44 AM 'Nikolay Borisov' via syzkaller-bugs
> >>> <syzkaller-bugs@xxxxxxxxxxxxxxxx> wrote:
> >>>> On 31.05.21 г. 10:53, syzbot wrote:
> >>>>> Hello,
> >>>>>
> >>>>> syzbot found the following issue on:
> >>>>>
> >>>>> HEAD commit: 1434a312 Merge branch 'for-5.13-fixes' of git://git.kernel..
> >>>>> git tree: upstream
> >>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=162843f3d00000
> >>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=9f3da44a01882e99
> >>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=a6bf271c02e4fe66b4e4
> >>>>>
> >>>>> Unfortunately, I don't have any reproducer for this issue yet.
> >>>>>
> >>>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> >>>>> Reported-by: syzbot+a6bf271c02e4fe66b4e4@xxxxxxxxxxxxxxxxxxxxxxxxx
> >>>>>
> >>>>> assertion failed: !memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE), in fs/btrfs/disk-io.c:3282
> >>>>
> >>>> This means a device contains a btrfs filesystem which has a different
> >>>> FSID in its superblock than the fsid which all devices part of the same
> >>>> fs_devices should have. This can happen in 2 ways - memory corruption
> >>>> where either of the ->fsid member are corrupted or if there was a crash
> >>>> while a filesystem's fsid was being changed. We need more context about
> >>>> what the test did?
> >>>
> >>> Hi Nikolay,
> >>>
> >>> From a semantic point of view we can consider that it just mounts /dev/random.
> >>> If syzbot comes up with a reproducer it will post it, but you seem to
> >>> already figure out what happened, so I assume you can write a unit
> >>> test for this.
> >>
> >> Well no, under normal circumstances this shouldn't trigger. So if syzbot
> >> is doing something stupid as mounting /dev/random then I don't see a
> >> problem here. The assert is there to catch inconsistencies during normal
> >> operation which doesn't seem to be the case here.
> >
> > Does this mean that CONFIG_BTRFS_ASSERT needs to be disabled in any testing?
> > What is it intended for? Or it can only be enabled when mounting known
> > good images? But then I assume even btrfs unit tests mount some
> > invalid images, so it would mean it can't be used even during unit
> > testing?
> >
> > Looking at the output of "grep ASSERT fs/btrfs/*.c" it looks like most
> > of these actually check for something that "must never happen". E.g.
> > some lists/pointers are empty/non-empty in particular states. And
> > "must never happen" checks are for testing scenarios...
> >
> > Taking this particular FSID mismatch assert, should such corrupted
> > images be mounted for end users? Should users be notified? Currently
> > they are mounted and users are not notified, what is the purpose of
> > this assertion?
> >
> > Perhaps CONFIG_BTRFS_ASSERT needs to be split into "must never happen"
> > checks that are enabled during testing and normal if's with pr_err for
> > user notifications?
>
> After going through the code you've convinced me. I just sent a patch
> turning the 2 debugging asserts into full-fledged checks in
> validate_super. So now the correct behavior is to prevent mounting of
> such images. How can I force syzbot to retest with the given patch applied?

Let me answer the points from the discussions:

- mounting /dev/random should never lead to a crash, this is effectively
the same as crafting data that would get past the sanity checks

- the behaviour regarding assertions should not affect testing, same
result with or without it

- input validation - for a filesystem everything that comes from the
disk is input, so what we want to do with the data is unconditional
runtime validation

- the 'must never happen' is two fold, depending what's the answer, but
we namely use it to verify invariants and catch developer errors, eg.
adding an object to list and then later assserting that it's there;
excluding the cosmic rays type of bugs, the remaining reason is that
the assert should be turned into a runtime check