Re: Linux I/O subsystem performance (was: linuxcon 2010...)

From: Chris Worley
Date: Thu Sep 16 2010 - 11:06:09 EST


On Tue, Aug 24, 2010 at 2:31 PM, Chris Worley <worleys@xxxxxxxxx> wrote:
> On Tue, Aug 24, 2010 at 11:43 AM, Vladislav Bolkhovitin <vst@xxxxxxxx> wrote:
>> Pasi Kärkkäinen, on 08/24/2010 11:25 AM wrote:
>>>
>>> On Mon, Aug 23, 2010 at 02:03:26PM -0400, Chetan Loke wrote:
>>>>
>>>> I actually received 3+ off-post emails asking whether I was talking
>>>> about initiator or target in the 100K IOPS case below and what did I
>>>> mean by the ACKs.
>>>> I was referring to the 'Initiator' side.
>>>> ACKs == When scsi-ML down-calls the LLD via the queue-command, process
>>>> the sgl's(if you like) and then trigger the scsi_done up-call path.
>>>>
>>>
>>> Uhm, Intel and Microsoft demonstrated over 1 million IOPS
>>> using software iSCSI and a single 10 Gbit Ethernet NIC (Intel 82599).
>>>
>>> How come there is such a huge difference? What are we lacking in Linux?
>>
>> I also have an impression that Linux I/O subsystem has some performance
>> problems. For instance, in one recent SCST performance test only 8 Linux
>> initiators with fio as a load generator were able to saturate a single SCST
>> target with dual IB cards (SRP) on 4K AIO direct accesses over an SSD
>> backend. This rawly means that any initiator took several times (8?) more
>> processing time than the target.
>
> While I can't tell you where the bottlenecks are, I can share some
> performance numbers...

I've been asked to share more details of the single SRP initiator
case, comparing Windows to Linux...

The configurations tested are represented by four digits separated by dashes:

- The number of initiators used in the test (always one in this case).
- The number of target ports used.
- The number of initiator ports used.
- the number of drives used.

SRP Upstream Initiator

1-1-1-1 1-1-1-2 1-2-2-2 1-1-1-4
1-2-2-4 1-1-1-8 1-2-2-8
Random Write 122880 141568 206592 144384 163840 141824 165376
30/70 R/W mix 72113 123136 144640 143616 163072 145920 163584
70/30 R/W mix 55938 91392 114176 135680 156160 145920 162304
Random Read 50688 78336 107008 121600 149760 143872 161536

SRP Windows Initiator

1-?-1-1 1-?-1-2 1-?-2-2 1-?-1-4 1-?-2-4
1-?-1-8 1-?-2-8
Random Write 57774 116738 114464 146972 202891 221819
30/70 R/W mix 49719 95697 97831 154328 181221 227786
70/30 R/W mix 45242 90694 89559 167341 176178 244661
Random Read 48016 94867 92984 178227 183631 257449

Note that the question marks are where I'm not sure how Windows is
using the second target port... in Linux, you select the target port
from the initiator, but there's no such option in Windows, so the
target port could be used in those cases. The 1-1-1-8 case is where I
tried to force it to use just one target port (by disabling the target
port), and Windows wouldn't do any I/O at all.

Chris
>
> 4 initiators can get >600K random 4KB IOPS off a single target...
> which is ~150% of what the Emulex/Intel/Microsoft results show using 8
> targets at 4KB (their 1M IOPS was at 512 byte blocks, which is not a
> realistic test point) here:
>
> http://itbrandpulse.com/Documents/Test2010001%20-%20The%20Sun%20Rises%20on%20CNAs%20Test%20Report.pdf
>
> The blog referenced earlier used 10 targets... and I'm not sure how
> many 10G ports per target.
>
> In general, my target seems capable of 65% the local small-block
> random write performance over IB,  and 85% the local small-block
> random read performance.  For large block performance, ~95% efficiency
> is easily achievable, read or write (i.e. 5.6GB/s over fabric, where
> 6GB/s is achievable on the drives locally at 1MB random blocks).
> These small-block efficiencies are achievable only when tested with
> multiple initiators.
>
> The single initiator is only capable of <150K 4KB IOPS... but gets
> full bandwidth w/ larger blocks.
>
> If I were to chose my problem, target or initiator bottleneck, I'd
> certainly rather have an initiator bottleneck rather than Microsoft's
> target bottleneck.
>
>> Hardware used for that target and
>> initiators was the same. I can't see on this load why the initiators would
>> need to do something more than the target. Well, I know we in SCST did an
>> excellent work to maximize performance, but such a difference looks too much
>> ;)
>>
>> Also it looks very suspicious why nobody even tried to match that
>> Microsoft/Intel record, even Intel itself who closely works with Linux
>> community in the storage area and could do it using the same hardware.
>
> The numbers are suspicious for other reasons.  "Random" is often used
> loosely (and the blog referenced earlier doesn't even claim "random").
>  If there is any merging/coalescing going on, then the "IOPS" are
> going to look vastly better.  If I allow coalescing, I can easily get
> 4M 4KB IOPS, but can't honestly call those 4KB IOPS (even if the
> benchmark thinks it's doing 4KB I/O).  They need to show that their
> advertised block size is maintained end-to-end.
>
> Chris
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/