Re: [dm-devel] dm-crypt performance

From: Milan Broz
Date: Tue Mar 26 2013 - 16:06:34 EST


On 26.3.2013 13:27, Alasdair G Kergon wrote:
[Adding dm-crypt + linux-kernel]

Thanks.


On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
I performed some dm-crypt performance tests as Mike suggested.

It turns out that unbound workqueue performance has improved somewhere
between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the
patches for hand-built dispatch are no longer needed.

For RAID-0 composed of two disks with total throughput 260MB/s, the
unbound workqueue performs as well as the hand-built dispatch (both
sustain the 260MB/s transfer rate).

For ramdisk, unbound workqueue performs better than hand-built dispatch
(620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested
(git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves
performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).

I found that ramdisk tests are usualy quite misleading for dmcrypt.
Better use some fast SSD, ideally in RAID0 (so you get >500MB or so).
Also be sure you compare recent machines which uses AES-NI.
For reference, null cipher (no crypt, data copy only) works as well,
but this is not real-world scenario.

After introducing Andi's patches, we created performance regression
for people which created "RAID over several dmcrypt devices".
(All IOs were processed by one core.) Rerely use case but several people
complained.
But most of people reported that current approach works much better
(even with stupid dd test - I think it is because page cache sumbits
requests from different CPUs so it in fact run in parallel).

But using dd with direct-io is trivial way how to simulate the "problem".
(I guess we all like using dd for performance testing... :-])

However, there is still the problem with request ordering. Milan found out
that under some circumstances parallel dm-crypt has worse performance than
the previous dm-crypt code. I found out that this is not caused by
deficiencies in the code that distributes work to individual processors.
Performance drop is caused by the fact that distributing write bios to
multiple processors causes the encryption to finish out of order and the
I/O scheduler is unable to merge these out-of-order bios.

If the IO scheduler is unable to merge these request because of out of order
bios, please try to FIX IO scheduler and do not invent workarounds in dmcrypt.
(With recent accelerated crypto this should not happen so often btw.)

I know it is not easy but I really do not like that in "little-walled
device-mapper garden" is something what should be done on different
layer (again).

The deadline and noop schedulers perform better (only 50% slowdown
compared to old dm-crypt), CFQ performs very badly (8 times slowdown).


If I sort the requests in dm-crypt to come out in the same order as they
were received, there is no longer any slowdown, the new crypt performs as
well as the old crypt, but the last time I submitted the patches, people
objected to sorting requests in dm-crypt, saying that the I/O scheduler
should sort them. But it doesn't. This problem still persists in the
current kernels.

I have probable no vote here anymore but for the record: I am strictly
against any sorting of requests in dmcrypt. My reasons are:

- dmcrypt should be simple transparent layer (doing one thing - encryption),
sorting of requests was always primarily IO scheduler domain
(which has well-known knobs to control it already)

- Are we sure we are not inroducing some another side channel in disc
encryption? (Unprivileged user can measure timing here).
(Perhaps stupid reason but please do not prefer performance to security
in encryption. Enough we have timing attacks for AES implementations...)

- In my testing (several months ago) output was very unstable - in some
situations it helps, in some it was worse. I have no longer hard
data but some test output was sent to Alasdair.

For best performance we could use the unbound workqueue implementation
with request sorting, if people don't object to the request sorting being
done in dm-crypt.

So again:

- why IO scheduler is not working properly here? Do it need some extensions?
If fixed, it can help even is some other non-dmcrypt IO patterns.
(I mean dmcrypt can set some special parameter for underlying device queue
automagically to fine-tune sorting parameters.)

- can we have some cpu-bound workqueue which automatically switch to unbound
(relocates work to another cpu) if it detects some saturation watermark etc?
(Again, this can be used in other code.
http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
(Yes, I see skepticism there :-)

On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
FYI, XFS also does it's own request ordering for the metadata buffers,
because it knows the needed ordering and has a bigger view than than
than especially CFQ. You at least have precedence in a widely used
subsystem for this code.

Nice. But XFS is much more complex system.
Isn't it enough that multipath uses own IO queue (so we have one IO scheduler
on top of another, and now we have metadata io sorting in XFS on top of it
and planning one more in dmcrypt? Is it really good approach?)

Milan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/