Re: [PATCH 2/2] ext4: add inode table check in __ext4_get_inode_loc to aovid possible infinite loop

From: Jan Kara
Date: Fri Aug 19 2022 - 04:44:19 EST


On Fri 19-08-22 04:45:41, Ritesh Harjani (IBM) wrote:
> On 22/08/18 07:23PM, Jan Kara wrote:
> > On Thu 18-08-22 20:13:53, Ritesh Harjani (IBM) wrote:
> > > On 22/08/17 04:31PM, Jan Kara wrote:
> > > > On Wed 17-08-22 21:27:01, Baokun Li wrote:
> > > > > In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
> > > > > and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
> > > > >
> > > > > In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
> > > > > the function returns -ENOMEM.
> > > > >
> > > > > In __getblk_slow, if the return value of grow_buffers is less than 0,
> > > > > the function returns NULL.
> > > > >
> > > > > When the three processes are connected in series like the following stack,
> > > > > an infinite loop may occur:
> > > > >
> > > > > do_writepages <--- keep retrying
> > > > > ext4_writepages
> > > > > mpage_map_and_submit_extent
> > > > > mpage_map_one_extent
> > > > > ext4_map_blocks
> > > > > ext4_ext_map_blocks
> > > > > ext4_ext_handle_unwritten_extents
> > > > > ext4_ext_convert_to_initialized
> > > > > ext4_split_extent
> > > > > ext4_split_extent_at
> > > > > __ext4_ext_dirty
> > > > > __ext4_mark_inode_dirty
> > > > > ext4_reserve_inode_write
> > > > > ext4_get_inode_loc
> > > > > __ext4_get_inode_loc <--- return -ENOMEM
> > > > > sb_getblk
> > > > > __getblk_gfp
> > > > > __getblk_slow <--- return NULL
> > > > > grow_buffers
> > > > > grow_dev_page <--- return -ENXIO
> > > > > ret = (block < end_block) ? 1 : -ENXIO;
> > > > >
> > > > > In this issue, bg_inode_table_hi is overwritten as an incorrect value.
> > > > > As a result, `block < end_block` cannot be met in grow_dev_page.
> > > > > Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
> > > > > keeps retrying. As a result, the writeback process is in the D state due
> > > > > to an infinite loop.
> > > > >
> > > > > Add a check on inode table block in the __ext4_get_inode_loc function by
> > > > > referring to ext4_read_inode_bitmap to avoid this infinite loop.
> > > > >
> > > > > Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
> > > >
> > > > Thanks for the fixes. Normally, we check that inode table is fine in
> > > > ext4_check_descriptors() (and those checks are much stricter) so it seems
> > > > unnecessary to check it again here. I understand that in your case it was
> > > > resize that corrupted the group descriptor after the filesystem was mounted
> > > > which is nasty but there's much more metadata that can be corrupted like
> > > > this and it's infeasible to check each metadata block before we use it.
> > > >
> > > > IMHO a proper fix to this class of issues would be for sb_getblk() to
> > > > return proper error so that we can distinguish ENOMEM from other errors.
> > > > But that will be a larger undertaking...
> > > >
> > >
> > > Hi Jan,
> > >
> > > How about adding a wrapper around sb_getblk() which will do the basic block
> > > bound checks for ext4. Then we can carefully convert all the callers of
> > > sb_getblk() in ext4 to call ext4_sb_getblk().
> > >
> > > ext4_sb_getblk() will then return either of below -
> > > 1. ERR_PTR(-EFSCORRUPTED)
> > > 2. NULL
> > > 3. struct buffer_head*
> > >
> > > It's caller can then implement the proper error handling.
> > >
> > > Folding a small patch to implement the simple bound check. Is this the right
> > > approach?
> >
> > Yep, looks sensible to me. Maybe I'd just make ext4_sb_getblk() return bh
> > or ERR_PTR so something like ERR_PTR(-EFSCORRUPTED), ERR_PTR(-ENXIO), or bh
> > pointer.
>
> Sure, Thanks Jan. Will do that once I clear some confusion w.r.t
> "start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)"
>
> At some places this is checked with "<= s_first_data_block"
> e.g. fs/ext4/ialloc.c, ext4_sb_block_valid()
>
> while at some places I see it to be "< s_first_data_block"
> e.g. fs/ext4/mballoc.c, fs/ext4/mmp.c

Well, superblock is stored at s_first_data_block offset. So strictly
speaking the check should be < s_first_data_block because that block is a
valid filesystem block. OTOH in most places you are not supposed to look at
block with the superblock so stricter <= s_first_data_block is fine.

> Will spend sometime to understand why the difference and if there is anything
> I might miss here for off-by-one check.
>
> Adding more to the confusion would be difference w.r.t blocksize = 1024 v/s
> other blocksizes. Based on the blocksize value I guess, s_first_data_block can
> be different (0/1??). Or can bigalloc can change this...
> ...Will look more into this.

That's what we discussed on our ext4 call yesterday. Normally,
s_first_data_block is 1 for blocksize 1k and 0 for blocksize > 1k. So for
1k blocksize the first group begins with block 1 and not block 0.
Effectively the whole filesystem is shifted by 1 block with 1k blocksize.
When bigalloc comes to play and blocksize is 1k, things are even more
interesting because there, the first group starts at block 0 while
s_first_data_block is still 1, because superblock is stored in block 1 of
cluster 0.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR