Re: [PATCH 1/1] block: Check the queue limit before bio submitting

From: Ed Tsai (蔡宗軒)
Date: Sun Nov 05 2023 - 20:40:25 EST


On Mon, 2023-11-06 at 09:33 +0800, Ed Tsai wrote:
> On Sat, 2023-11-04 at 11:43 +0800, Ming Lei wrote:
> > On Sat, Nov 04, 2023 at 01:11:02AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > On Sat, 2023-11-04 at 00:20 +0800, Ming Lei wrote:
> > > > On Wed, Nov 01, 2023 at 02:23:26AM +0000, Ed Tsai (蔡宗軒) wrote:
> > > > > On Wed, 2023-10-25 at 17:22 +0800, ed.tsai@xxxxxxxxxxxx
> > > > > wrote:
> > > > > > From: Ed Tsai <ed.tsai@xxxxxxxxxxxx>
> > > > > >
> > > > > > Referring to commit 07173c3ec276 ("block: enable multipage
> > > >
> > > > bvecs"),
> > > > > > each bio_vec now holds more than one page, potentially
> >
> > exceeding
> > > > > > 1MB in size and causing alignment issues with the queue
> >
> > limit.
> > > > > >
> > > > > > In a sequential read/write scenario, the file system
> >
> > maximizes
> > > > the
> > > > > > bio's capacity before submitting. However, misalignment
> > > > > > with
> >
> > the
> > > > > > queue limit can result in the bio being split into smaller
> >
> > I/O
> > > > > > operations.
> > > > > >
> > > > > > For instance, assuming the maximum I/O size is set to 512KB
> >
> > and
> > > > the
> > > > > > memory is highly fragmented, resulting in each bio
> > > > > > containing
> > > >
> > > > only
> > > > > > one 2-pages bio_vec (i.e., bi_size = 1028KB). This would
> >
> > cause
> > > > the
> > > > > > bio to be split into two 512KB portions and one 4KB
> > > > > > portion.
> >
> > As a
> > > > > > result, the originally expected continuous large I/O
> >
> > operations
> > > > are
> > > > > > interspersed with many small I/O operations.
> > > > > >
> > > > > > To address this issue, this patch adds a check for the
> > > >
> > > > max_sectors
> > > > > > before submitting the bio. This allows the upper layers to
> > > > > > proactively
> > > > > > detect and handle alignment issues.
> > > > > >
> > > > > > I performed the Antutu V10 Storage Test on a UFS 4.0
> > > > > > device,
> > > >
> > > > which
> > > > > > resulted in a significant improvement in the Sequential
> > > > > > test:
> > > > > >
> > > > > > Sequential Read (average of 5 rounds):
> > > > > > Original: 3033.7 MB/sec
> > > > > > Patched: 3520.9 MB/sec
> > > > > >
> > > > > > Sequential Write (average of 5 rounds):
> > > > > > Original: 2225.4 MB/sec
> > > > > > Patched: 2800.3 MB/sec
> > > > > >
> > > > > > Signed-off-by: Ed Tsai <ed.tsai@xxxxxxxxxxxx>
> > > > > > ---
> > > > > > block/bio.c | 6 ++++++
> > > > > > 1 file changed, 6 insertions(+)
> > > > > >
> > > > > > diff --git a/block/bio.c b/block/bio.c
> > > > > > index 816d412c06e9..a4a1f775b9ea 100644
> > > > > > --- a/block/bio.c
> > > > > > +++ b/block/bio.c
> > > > > > @@ -1227,6 +1227,7 @@ static int
> >
> > __bio_iov_iter_get_pages(struct
> > > > bio
> > > > > > *bio, struct iov_iter *iter)
> > > > > > iov_iter_extraction_t extraction_flags = 0;
> > > > > > unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> > > > > > unsigned short entries_left = bio->bi_max_vecs - bio-
> > >
> > > bi_vcnt;
> > > > > > +struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> > > > > > > limits;
> > > > > >
> > > > > > struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> > > > > > struct page **pages = (struct page **)bv;
> > > > > > ssize_t size, left;
> > > > > > @@ -1275,6 +1276,11 @@ static int
> >
> > __bio_iov_iter_get_pages(struct
> > > > bio
> > > > > > *bio, struct iov_iter *iter)
> > > > > > struct page *page = pages[i];
> > > > > >
> > > > > > len = min_t(size_t, PAGE_SIZE - offset, left);
> > > > > > +if (bio->bi_iter.bi_size + len >
> > > > > > + lim->max_sectors << SECTOR_SHIFT) {
> > > > > > +ret = left;
> > > > > > +break;
> > > > > > +}
> > > > > > if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> > > > > > ret = bio_iov_add_zone_append_page(bio, page,
> > > > > > len,
> > > > > > offset);
> > > > > > --
> > > > > > 2.18.0
> > > > > >
> > > > >
> > > > >
> > > > > Hi Jens,
> > > > >
> > > > > Just to clarify any potential confusion, I would like to
> >
> > provide
> > > > > further details based on the assumed scenario mentioned
> > > > > above.
> > > > >
> > > > > When the upper layer continuously sends 1028KB full-sized
> > > > > bios
> >
> > for
> > > > > sequential reads, the Block Layer sees the following
> > > > > sequence:
> > > > > submit bio: size = 1028KB, start LBA = n
> > > > > submit bio: size = 1028KB, start LBA = n + 1028KB
> > > > > submit bio: size = 1028KB, start LBA = n + 2056KB
> > > > > ...
> > > > >
> > > > > However, due to the queue limit restricting the I/O size to a
> > > >
> > > > maximum
> > > > > of 512KB, the Block Layer splits into the following sequence:
> > > > > submit bio: size = 512KB, start LBA = n
> > > > > submit bio: size = 512KB, start LBA = n + 512KB
> > > > > submit bio: size = 4KB, start LBA = n + 1024KB
> > > > > submit bio: size = 512KB, start LBA = n + 1028KB
> > > > > submit bio: size = 512KB, start LBA = n + 1540KB
> > > > > submit bio: size = 4KB, start LBA = n + 2052KB
> > > > > submit bio: size = 512KB, start LBA = n + 2056KB
> > > > > submit bio: size = 512KB, start LBA = n + 2568KB
> > > > > submit bio: size = 4KB, start LBA = n + 3080KB
> > > > > ...
> > > > >
> > > > > The original expectation was for the storage to receive
> > > > > large,
> > > > > contiguous requests. However, due to non-alignment, many
> > > > > small
> >
> > I/O
> > > > > requests are generated. This problem is easily visible
> > > > > because
> >
> > the
> > > > > user pages passed in are often allocated by the buddy system
> > > > > as
> > > >
> > > > order 0
> > > > > pages during page faults, resulting in highly non-contiguous
> > > >
> > > > memory.
> > > >
> > > > If order 0 page is added to bio, the multipage bvec becomes nop
> > > > basically(256bvec holds 256 pages), then how can it make a
> >
> > difference
> > > > for you?
> > >
> > >
> > >
> > > >
> > > > >
> > > > > As observed in the Antutu Sequential Read test below, it is
> >
> > similar
> > > > to
> > > > > the description above where the splitting caused by the queue
> >
> > limit
> > > > > leaves small requests sandwiched in between:
> > > > >
> > > > > block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
> > > > > block_split: 8,32 R 86925864 / 86926888 [Thread-51]
> > > > > block_split: 8,32 R 86926888 / 86927912 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
> > > > > block_split: 8,32 R 86928008 / 86929032 [Thread-51]
> > > > > block_split: 8,32 R 86929032 / 86930056 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
> > > > > block_split: 8,32 R 86930152 / 86931176 [Thread-51]
> > > > > block_split: 8,32 R 86931176 / 86932200 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
> > > > > block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
> > > > > block_split: 8,32 R 86932264 / 86933288 [Thread-51]
> > > > > block_split: 8,32 R 86933288 / 86934312 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
> > > > > block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
> > > > > block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]
> > > > >
> > > > > I simply prevents non-aligned situations in
> >
> > bio_iov_iter_get_pages.
> > > >
> > > > But there is still 4KB IO left if you limit max bio size is
> >
> > 512KB,
> > > > then how does this 4KB IO finally go in case of 1028KB IO?
> > > >
> > > > > Besides making the upper layer application aware of the queue
> > > >
> > > > limit, I
> > > > > would appreciate any other directions or suggestions you may
> >
> > have.
> > > >
> > > > The problem is related with IO size from application.
> > > >
> > > > If you send unaligned IO, you can't avoid the last IO with
> > > > small
> > > > size, no
> > > > matter if block layer bio split is involved or not. Your patch
> >
> > just
> > > > lets
> > > > __bio_iov_iter_get_pages split the bio, and you still have 4KB
> >
> > left
> > > > finally when application submits 1028KB, right?
> > > >
> > > > Then I don't understand why your patch improves sequential IO
> > > > performance.
> > > >
> > > > Thanks,
> > > > Ming
> > > >
> > >
> > > The application performs I/O with a sufficitenly large I/O size,
> > > causing it to constantly fill up and submit full bios. However,
> > > in
> >
> > the
> > > iomap direct I/O scenario, pages are added to the bio one by one
> >
> > from
> > > the user buffer. This typically triggers page faults, resulting
> > > in
> >
> > the
> > > allocation of order 0 pages from the buddy system.
> > >
> > > The remaining amount of each order in the buddy system varies
> > > over
> > > time. If there are not enough pages available in a particular
> >
> > order,
> > > pages are split from higher orders. When pages are obtained from
> >
> > the
> > > higher order, the user buffer may contain some small consecutive
> > > patterns.
> > >
> > > In summary, the physical layout of the user buffer is
> >
> > unpredictable,
> > > and when it contains some small consecutive patterns, the size of
> >
> > the
> > > bio becomes randomly unaligned during filling.
> >
> > Yes, multipage bvec depends on physical layout of user buffer, but
> > it
> > doesn't affect bio size, which is decided by userspace, and the
> > bvec
> > page layout doesn't affect IO pattern.
> >
>
> This will result in variable sizes of full bios when filling them, as
> the size of the bio_vec depends on the physical layout of the user
> buffer.
>
> > If userspace submits 1028K IO, the IO is split into 512K, 512K and
> > 4K,
> > no matter if each bvec includes how many pages.
> >
>
> Sorry for the confusion caused by my description of 1028KB. It was
> meant to illustrate the scenario of submitting an unaligned bio. In
> my
> opinion, the least ideal size for a full bio due to physical layout
> would be 1028KB, as it would result in an I/O with only 4KB.
>
>
> > If userspace submits very large IO, such as 512M, which will be
> > split
> > into 1K bios with 512K size.
> >
>
> No. The block layer cannot receive a 512MB bio to split into 512K
> size.
> It will encounter full bios of various size, because the size of each
> bio_vec is based on the physical layout.
>
>
> > You patch doesn't makes any difference actually from block layer
> > viewpoint, such as:
> >
> > 1) dd if=/dev/ublkb0 of=/dev/null bs=1028k count=1 iflag=direct
> >
> > 2) observe IO pattern by the following bpftrace:
> >
> > kprobe:blk_mq_start_request
> > {
> > $rq = (struct request *)arg0;
> >
> > printf("%lu %16s %d %d: %s %s bio %lx-%lu\n",
> > nsecs / 1000, comm, pid, cpu, ksym(reg("ip")),
> > $rq->q->disk->disk_name, $rq->__sector, $rq->__data_len);
> > }
> >
> > 3) no matter if your patch is applied or not, the following pattern
> > is always observed:
> >
> > 117828811 dd 1475 0: blk_mq_start_request ublkb0 bio
> > 0-
> > 524288
> > 117828823 dd 1475 0: blk_mq_start_request ublkb0 bio
> > 400-524288
> > 117828825 dd 1475 0: blk_mq_start_request ublkb0 bio
> > 800-4096
> >
> > Similar pattern can be observed when running dd inside FS(xfs).
> >
> > >
> > > This patch limits the bio to be filled up to the max_sectors. The
> > > submission is an async operation, so once the bio is queued, it
> >
> > will
> > > immediately return and continue filled and submit the next bio.
> >
> > bio split doesn't change this behavior too, the difference is just
> > how
> > many bios the upper layer(iomap) observed.
> >
> > Your patch just moves the split into upper layer, and iomap can see
> > 3 bios with your patch when `dd bs=1028K`, and in vanilla tree,
> > iomap
> > just submits single bio with 1028K, block layer splits it into
> > 512k, 512k, and 4k. So finally UFS driver observes same IO pattern.
> >
> > In short, please root cause why your patch improves performance, or
> > please share your workloads, so we can observe the IO pattern and
> > related mm/fs behavior and try to root cause it.
> >
> >
> > Thanks,
> > Ming
> >
>
> Antutu Sequential Read performs 72 reads of 64MB using aio. I
> simulated
> a 64MB read using dd and observed that the actual bio queue sizes
> were
> slightly larger than 1MB. Not a single bio with 64MB, but rather
> multiple various size of bios.
>
> dd-5970[001] ..... 758.485446: block_bio_queue: 254,52 R 2933336 +
> 2136
> dd-5970[001] ..... 758.487145: block_bio_queue: 254,52 R 2935472 +
> 2152
> dd-5970[001] ..... 758.488822: block_bio_queue: 254,52 R 2937624 +
> 2128
> dd-5970[001] ..... 758.490478: block_bio_queue: 254,52 R 2939752 +
> 2160
> dd-5970[001] ..... 758.492154: block_bio_queue: 254,52 R 2941912 +
> 2096
> dd-5970[001] ..... 758.493874: block_bio_queue: 254,52 R 2944008 +
> 2160
>

Sorry for missing out on my dd command. Here it is:
dd if=/data/test_file of=/dev/null bs=64m count=1 iflag=direct

Best,
Ed