Re: [PATCH] ext4_sb_breadahead_unmovable: Change to be no-blocking

From: Hui Zhu
Date: Fri Aug 11 2023 - 03:22:30 EST


Oops, this is the version 2 of the patch.

Best,
Hui

Hui Zhu <teawaterz@xxxxxxxxxxxxxxxxx> 于2023年8月11日周五 15:15写道:
>
> From: Hui Zhu <teawater@xxxxxxxxxxxx>
>
> This version fix the gfp flags in the callers instead of working this
> new "bool" flag through the buffer head layers according to the comments
> from Matthew Wilcox.
>
> Encountered an issue where a large number of filesystem reads and writes
> occurred suddenly within a container. At the same time, other tasks on
> the same host that were performing filesystem read and write operations
> became blocked. It was observed that many of the blocked tasks were
> blocked on the ext4 journal lock. For example:
> PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread"
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] schedule at xxxxxxxxxxxxxxx
> 2 [] wait_transaction_locked at xxxxxxxxxxxxxxx
> 3 [] add_transaction_credits at xxxxxxxxxxxxxxx
> 4 [] start_this_handle at xxxxxxxxxxxxxxx
> 5 [] jbd2__journal_start at xxxxxxxxxxxxxxx
> 6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx
> 7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx
> 8 [] mark_inode_dirty at xxxxxxxxxxxxxxx
> 9 [] generic_update_time at xxxxxxxxxxxxxxx
>
> Meanwhile, it was observed that the task holding the ext4 journal lock
> was blocked for an extended period of time on "shrink_page_list" due to
> "ext4_sb_breadahead_unmovable".
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] _cond_resched at xxxxxxxxxxxxxxx
> 2 [] shrink_page_list at xxxxxxxxxxxxxxx
> 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx
> 4 [] shrink_lruvec at xxxxxxxxxxxxxxx
> 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx
> 6 [] shrink_node at xxxxxxxxxxxxxxx
> 7 [] shrink_zones at xxxxxxxxxxxxxxx
> 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx
> 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx
> 10 [] try_charge at xxxxxxxxxxxxxxx
> 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx
> 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx
> 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx
> 14 [] pagecache_get_page at xxxxxxxxxxxxxxx
> 15 [] grow_dev_page at xxxxxxxxxxxxxxx
> 16 [] __getblk_slow at xxxxxxxxxxxxxxx
> 17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx
> 18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx
> 21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx
> 22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx
> 23 [] ext4_add_entry at xxxxxxxxxxxxxxx
> 24 [] ext4_add_nondir at xxxxxxxxxxxxxxx
> 25 [] ext4_create at xxxxxxxxxxxxxxx
> 26 [] vfs_create at xxxxxxxxxxxxxxx
>
> The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL",
> causing longer blocking times.
> /*
> * XXX: __getblk_slow() can not really deal with failure and
> * will endlessly loop on improvised global reclaim. Prefer
> * looping in the allocator rather than here, at least that
> * code knows what it's doing.
> */
> gfp_mask |= __GFP_NOFAIL;
> However, "ext4_sb_breadahead_unmovable" is a prefetch function and
> failures are acceptable.
>
> Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be
> non-blocking.
> Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable
> calls sb_getblk_gfp.
> Modify grow_dev_page to will not be blocked by the allocation of folio
> if gfp is ~__GFP_DIRECT_RECLAIM.
>
> Signed-off-by: Hui Zhu <teawater@xxxxxxxxxxxx>
> ---
> fs/buffer.c | 27 +++++++++++++++++++--------
> fs/ext4/super.c | 3 ++-
> 2 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index bd091329026c..330cf19c77b1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio,
> * Create the page-cache page that contains the requested block.
> *
> * This is used purely for blockdev mappings.
> + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM.
> */
> static int
> grow_dev_page(struct block_device *bdev, sector_t block,
> @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> int ret = 0;
> gfp_t gfp_mask;
>
> - gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
> + gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS);
> + if (gfp == ~__GFP_DIRECT_RECLAIM)
> + gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> + else {
> + gfp_mask |= gfp;
>
> - /*
> - * XXX: __getblk_slow() can not really deal with failure and
> - * will endlessly loop on improvised global reclaim. Prefer
> - * looping in the allocator rather than here, at least that
> - * code knows what it's doing.
> - */
> - gfp_mask |= __GFP_NOFAIL;
> + /*
> + * XXX: __getblk_slow() can not really deal with failure and
> + * will endlessly loop on improvised global reclaim. Prefer
> + * looping in the allocator rather than here, at least that
> + * code knows what it's doing.
> + */
> + gfp_mask |= __GFP_NOFAIL;
> + }
>
> folio = __filemap_get_folio(inode->i_mapping, index,
> FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
> + if (IS_ERR(folio)) {
> + ret = PTR_ERR(folio);
> + goto out;
> + }
>
> bh = folio_buffers(folio);
> if (bh) {
> @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> failed:
> folio_unlock(folio);
> folio_put(folio);
> +out:
> return ret;
> }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c94ebf704616..6a529509b83b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
>
> void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
> {
> - struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
> + struct buffer_head *bh = sb_getblk_gfp(sb, block,
> + ~__GFP_DIRECT_RECLAIM);
>
> if (likely(bh)) {
> if (trylock_buffer(bh))
> --
> 2.19.1.6.gb485710b
>