Re: Slow file transfer speeds with CFQ IO scheduler in some cases

From: Vladislav Bolkhovitin
Date: Tue Apr 21 2009 - 14:19:57 EST

Next message: Sven-Thorsten Dietrich: "[PATCH][RT] - Mismatching declaration and export"
Previous message: Steven Rostedt: "Re: [PATCH 1/5] ftrace: use module notifier for function tracer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Wu Fengguang, on 03/23/2009 04:42 AM wrote:

Here are the conclusions from tests:

1. Making all IO threads work in the same IO context with CFQ (vanilla RA and default RA size) brings near 100% link utilization on single stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there is 100% improvement of CFQ over deadline. With 2 read streams CFQ has ever more advantage: >400% (23MB/s vs 5MB/s).

The ideal 2-stream throughput should be >60MB/s, so I guess there are
still room of improvements for the CFQ's 23MB/s?

Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K we were able to get ~40MB/s, see the previous e-mail and below.

The one fact I cannot understand is that SCST seems to breaking up the
client side 64K reads into server side 4K reads(above readahead layer).
But I remember you told me that SCST don't do NFS rsize style split-ups.
Is this a bug? The 4K read size is too small to be CPU/network friendly...
Where are the split-up and re-assemble done? On the client side or
internal to the server?

This is on the client's side. See the target's log in the attachment. Here is the summary of commands data sizes, came to the server, for "dd if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:

4K 11
8K 0
16K 0
32K 0
64K 0
128K 81
256K 8
512K 0
1024K 0
2048K 0
4096K 0

There's a way too many 4K requests. Apparently, the requests submission path isn't optimal.

Actually, this is another question I wanted to rise from the very beginning.

6. Unexpected result. In case, when ll IO threads work in the same IO context with CFQ increasing RA size *decreases* throughput. I think this is, because RA requests performed as single big READ requests, while requests coming from remote clients are much smaller in size (up to 256K), so, when the read by RA data transferred to the remote client on 100MB/s speed, the backstorage media gets rotated a bit, so the next read request must wait the rotation latency (~0.1ms on 7200RPM). This is well conforms with (3) above, when context RA has 40% advantage over vanilla RA with default RA, but much smaller with higher RA.

Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

That doesn't matter, because new request from the client won't come until all data for the previous one transferred to it. And that transfer is done on a very *finite* speed.

Bottom line IMHO conclusions:

1. Context RA should be considered after additional examination to replace current RA algorithm in the kernel

That's my plan to push context RA to mainline. And thank you very much
for providing and testing out a real world application for it!

You're welcome!

2. It would be better to increase default RA size to 1024K

That's a long wish to increase the default RA size. However I have a
vague feeling that it would be better to first make the lower layers
more smart on max_sectors_kb granularity request splitting and batching.

Can you elaborate more on that, please?

*AND* one of the following:

3.1. All RA requests should be split in smaller requests with size up to 256K, which should not be merged with any other request

Are you referring to max_sectors_kb?

Yes

What's your max_sectors_kb and nr_requests? Something like

grep -r . /sys/block/sda/queue/

Default: 512 and 128 correspondingly.

OR

3.2. New RA requests should be sent before the previous one completed to don't let the storage device rotate too far to need full rotation to serve the next request.

Linus has a mmap readahead cleanup patch that can do this. It
basically replaces a {find_lock_page(); readahead();} sequence into
{find_get_page(); readahead(); lock_page();}.

I'll try to push that patch into mainline.

Good!

I like suggestion 3.1 a lot more, since it should be simple to implement and has the following 2 positive side effects:

1. It would allow to minimize negative effect of higher RA size on the I/O delay latency by allowing CFQ to switch to too long waiting requests, when necessary.

2. It would allow better requests pipelining, which is very important to minimize uplink latency for synchronous requests (i.e. with only one IO request at time, next request issued, when the previous one completed). You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware recommends for maximum performance set max_sectors_kb as low as *64K* with 16MB RA. It allows to maximize serving commands pipelining. And this suggestion really works allowing to improve throughput in 50-100%!

Seems I should elaborate more on this. Case, when client is remote has a fundamental difference from the case, when client is local, for which Linux currently optimized. When client is local data delivered to it from the page cache with a virtually infinite speed. But when client is remote data delivered to it from the server's cache on a *finite* speed. In our case this speed is about the same as speed of reading data to the cache from the storage. It has the following consequences:

1. Data for any READ request at first transferred from the storage to the cache, then from the cache to the client. If those transfers are done purely sequentially without overlapping, i.e. without any readahead, resulting throughput T can be found from equation: 1/T = 1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the local (i.e. from the storage) and remote links. In case, when Tlocal ~= Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

2. If data transfers on the local and remote links aren't coordinated, it is possible that only one link transfers data at some time. From the (1) above you can calculate that % of this "idle" time is % of the lost throughput. I.e. to get the maximum throughput both links should transfer data as simultaneous as possible. For our case, when Tlocal ~= Tremote, both links should be all the time busy. Moreover, it is possible that the local transfer finished, but during the remote transfer the storage media rotated too far, so for the next request it will be needed to wait the full rotation to finish (i.e. several ms of lost bandwidth).

Thus, to get the maximum possible throughput, we need to maximize simultaneous load of both local and remote links. It can be done by using well known pipelining technique. For that client should read the same amount of data at once, but those read should be split on smaller chunks, like 64K at time. This approach looks being against the "conventional wisdom", saying that bigger request means bigger throughput, but, in fact, it doesn't, because the same (big) amount of data are read at time. Bigger count of smaller requests will make more simultaneous load on both participating in the data transfers links. In fact, even if client is local, in most cases there is a second data transfer link. It's in the storage. This is especially true for RAID controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K and increase RA in the above link? ;)

Of course, max_sectors_kb should be decreased only for smart devices, which allow >1 outstanding requests at time, i.e. for all modern SCSI/SAS/SATA/iSCSI/FC/etc. drives.

There is an objection against having too many outstanding requests at time. This is latency. But, since overall size of all requests remains unchanged, this objection isn't relevant in this proposal. There is the same latency-related objection against increasing RA. But many small individual RA requests it isn't relevant as well.

We did some measurements to support the this proposal. They were done only with deadline scheduler to make the picture clearer. They were done with context RA. The tests were the same as before.

--- Baseline, all default:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 51,1 MB/s
b) 51,4 MB/s
c) 51,1 MB/s

Run at the same time:
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s

--- Client - all default, on the server max_sectors_kb set to 64K:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 100 MB/s
- 100 MB/s
- 102 MB/s

# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 5,2 MB/s
- 5,3 MB/s
- 4,2 MB/s

100% and 8% improvement comparing to the baseline.

From the previous e-mail you can see that with 4096K RA

# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,9 MB/s
b) 39,5 MB/s
c) 38,4 MB/s

I.e. there is 760% improvement over the baseline.

Thus, I believe, that for all devices, supporting queue depths >1, max_sectors_kb should be set by default to 64K (or to 128K, maybe, but not more), and default RA increased to at least 1M, better 2-4M.

(Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Sure

Thanks,
Vlad

Attachment: req_split.log.bz2
Description: application/bzip

Next message: Sven-Thorsten Dietrich: "[PATCH][RT] - Mismatching declaration and export"
Previous message: Steven Rostedt: "Re: [PATCH 1/5] ftrace: use module notifier for function tracer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]