Re: Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage) with check/repair/sync

From: Matthias Dahl
Date: Fri Jul 15 2016 - 03:11:12 EST


Hello...

I am rather persistent (stubborn?) when it comes to tracking down bugs,
if somehow possible... and it seems it paid off... somewhat. ;-)

So I did quite a lot more further tests and came up with something very
interesting: As long as the RAID is in sync (as-in: sync_action=idle),
I can not for the life of me trigger this issue -- the used memory
still explodes to most of the RAM but it oscillates back and forth.

I did very stupid things to stress the machine while dd was running as
usual on the dm-crypt device. I opened a second dd instance with the
same parameters on the dm-crypt device. I wrote a simple program that
allocated random amounts of memory (up to 10 GiB), memset them and after
a random amount of time released it again -- in a continuous loop. I
put heavy network stress on the machine... whatever I could think of.

No matter what, the issue did not trigger. And I repeated said tests
quite a few times over extended time periods (usually an hour or so).
Everything worked beautifully with nice speeds and no noticeable system
slow-downs/lag.

As soon as I issued a "check" to sync_action of the RAID device, it was
just a matter of a second until the OOM killer kicked in and all hell
broke loose again. And basically all of my tests where done while the
RAID device was syncing -- due to a very unfortunate series of events.

I tried to repeat that same test with an external (USB3) connected disk
with a Linux s/w RAID10 over two partitions... but unfortunately that
behaves rather differently. I assume it is because it is connected
through USB and not SATA. While doing those tests on my RAID10 with the
4 internal SATA3 disks, you can see w/ free that the "used memory" does
explode to most of the RAM and then oscillates back and forth. With the
same test on the external disk through, that does not happen at all. The
used memory stays pretty much constant and only the buffers vary... but
most of the memory is still free in that case.

I hope my persistence on the matter is not annoying and finally leads us
somewhere where the real issue hides.

Any suggestions, opinions and ideas are greatly appreciated as I have
pretty much exhausted mine at this time.

Last but not least: I switched my testing to a OpenSuSE Tumbleweed Live
system (x86_64 w/ kernel 4.6.3) as Rawhide w/ 4.7.0rcX behaves rather
strangely and unstable at times.

Thanks,
Matthias

--
Dipl.-Inf. (FH) Matthias Dahl | Software Engineer | binary-island.eu
services: custom software [desktop, mobile, web], server administration