Re: Small writes being split with fdatasync based on non-aligned partition ending

From: Sitsofe Wheeler
Date: Wed Feb 10 2016 - 22:48:52 EST


Trying to cc the GNU parted and linux-block mailing lists.

On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
> While trying to reproduce some performance issues I have been seeing
> with Ceph, I have come across a strange behaviour which is seemingly
> affected only by the end point (and thereby the size) of a partition
> being an odd number of sectors. Since all documentation about
> alignment only refers to the starting point of the partition, this was
> pretty surprising and I would like to know whether this is expected
> behaviour or maybe a kernel issue.
>
> The command I am using is pretty simple:
>
> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k
> --filename=/dev/sdb2 --runtime=10 --name=test
>
> The difference shows itself when the partition is created either by
> sgdisk or by parted:
>
> sgdisk --new=2:6000M: /dev/sdb
>
> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100%
>
> The difference in the partition table looks like this:
>
> < 2 6291456000B 1600320962559B 1594029506560B
> osd-device-1-block
> ---
>> 2 6291456000B 1600321297919B 1594029841920B osd-device-1-block

Looks like parted took you at your word when you asked for your
partition at 100%. Just out of curiosity if you try and make the same
partition interactively with parted do you get any warnings after
making and after running align-check ?

> So this is really only the end of the partition that is different.
> However, in the first case, the 4k writes all get broken up into 512b
> writes somewhere in the kernel, as can be seen with btrace:
>
> 8,16 3 36 0.000102666 8184 A WS 12353985 + 1 <- (8,18) 65985
> 8,16 3 37 0.000102739 8184 Q WS 12353985 + 1 [fio]
> 8,16 3 38 0.000102875 8184 M WS 12353985 + 1 [fio]
> 8,16 3 39 0.000103038 8184 A WS 12353986 + 1 <- (8,18) 65986
> 8,16 3 40 0.000103109 8184 Q WS 12353986 + 1 [fio]
> 8,16 3 41 0.000103196 8184 M WS 12353986 + 1 [fio]
> 8,16 3 42 0.000103335 8184 A WS 12353987 + 1 <- (8,18) 65987
> 8,16 3 43 0.000103403 8184 Q WS 12353987 + 1 [fio]
> 8,16 3 44 0.000103489 8184 M WS 12353987 + 1 [fio]
> 8,16 3 45 0.000103609 8184 A WS 12353988 + 1 <- (8,18) 65988
> 8,16 3 46 0.000103678 8184 Q WS 12353988 + 1 [fio]
> 8,16 3 47 0.000103767 8184 M WS 12353988 + 1 [fio]
> 8,16 3 48 0.000103879 8184 A WS 12353989 + 1 <- (8,18) 65989
> 8,16 3 49 0.000103947 8184 Q WS 12353989 + 1 [fio]
> 8,16 3 50 0.000104035 8184 M WS 12353989 + 1 [fio]
> 8,16 3 51 0.000104150 8184 A WS 12353990 + 1 <- (8,18) 65990
> 8,16 3 52 0.000104219 8184 Q WS 12353990 + 1 [fio]
> 8,16 3 53 0.000104307 8184 M WS 12353990 + 1 [fio]
> 8,16 3 54 0.000104452 8184 A WS 12353991 + 1 <- (8,18) 65991
> 8,16 3 55 0.000104520 8184 Q WS 12353991 + 1 [fio]
> 8,16 3 56 0.000104609 8184 M WS 12353991 + 1 [fio]
> 8,16 3 57 0.000104885 8184 I WS 12353984 + 8 [fio]
>
> whereas in the second case, I'm getting the expected 4k writes:
>
> 8,16 6 42 1266874889.659842036 8409 A WS 12340232 + 8 <-
> (8,18) 52232
> 8,16 6 43 1266874889.659842167 8409 Q WS 12340232 + 8 [fio]
> 8,16 6 44 1266874889.659842393 8409 G WS 12340232 + 8 [fio]

This is weird because --size=1G should mean that fio is "seeing" an
aligned end. Does direct=1 with a sequential job of iodepth=1 show the
problem too?

> The above examples are from running with an SSD, where the small
> writes get merged together again before hitting the block device,
> which is still pretty o.k. performance wise. But when I run the same
> test on some NVMe device, the writes do not get merged, instead the
> performance drops to less then 10% of what I get in the second case.

Perhaps the ioscheduler doesn't have the opportunity with the NVMe device...

> If this is indeed expected behaviour from the kernel pov, it might
> need some better documentation and probably sgdisk should also be
> enhanced to align the end of the partition as well. FWIW, this happens
> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels.

Do you mean parted?

--
Sitsofe | http://sucs.org/~sits/