Re: [PATCH v1 00/11] mm/kasan: support per-page shadow memory to reduce memory consumption

From: Dmitry Vyukov
Date: Tue May 16 2017 - 16:49:36 EST


On Mon, May 15, 2017 at 11:23 PM, Joonsoo Kim <js1304@xxxxxxxxx> wrote:
>> >
>> > Hello, all.
>> >
>> > This is an attempt to recude memory consumption of KASAN. Please see
>> > following description to get the more information.
>> >
>> > 1. What is per-page shadow memory
>>
>> Hi Joonsoo,
>
> Hello, Dmitry.
>
>>
>> First I need to say that this is great work. I wanted KASAN to consume
>
> Thanks!
>
>> 1/8-th of _kernel_ memory rather than total physical memory for a long
>> time.
>>
>> However, this implementation does not work inline instrumentation. And
>> the inline instrumentation is the main mode for KASAN. Outline
>> instrumentation is merely a rudiment to support gcc 4.9, and it needs
>> to be removed as soon as we stop caring about gcc 4.9 (do we at all?
>> is it the current compiler in any distro? Ubuntu 12 has 4.8, Ubuntu 14
>> already has 5.4. And if you build gcc yourself or get a fresher
>> compiler from somewhere else, you hopefully get something better than
>> 4.9).
>
> Hmm... I don't think that outline instrumentation is something to be
> removed. In embedded world, there is a fixed partition table and
> enlarging the kernel binary would cause the problem. Changing that
> table is possible but is really uncomfortable thing for debugging
> something. So, I think that outline instrumentation has it's own merit.

Fair. Let's consider both as important.

> Anyway, I have missed inline instrumentation completely.
>
> I will attach the fix in the bottom. It doesn't look beautiful
> since it breaks layer design (some check will be done at report
> function). However, I think that it's a good trade-off.


I can confirm that inline works with that patch.

I can also confirm that it reduces memory usage. I've booted qemu with
2G ram and run some fixed workload. Before:
31853 dvyukov 20 0 3043200 765464 21312 S 366.0 4.7 2:39.53
qemu-system-x86
7528 dvyukov 20 0 3043200 732444 21676 S 333.3 4.5 2:23.19
qemu-system-x86
After:
6192 dvyukov 20 0 3043200 394244 20636 S 17.9 2.4 2:32.95
qemu-system-x86
6265 dvyukov 20 0 3043200 388860 21416 S 399.3 2.4 3:02.88
qemu-system-x86
9005 dvyukov 20 0 3043200 383564 21220 S 397.1 2.3 2:35.33
qemu-system-x86

However, I see some very significant slowdowns with inline
instrumentation. I did 3 tests:
1. Boot speed, I measured time for a particular message to appear on
console. Before:
[ 2.504652] random: crng init done
[ 2.435861] random: crng init done
[ 2.537135] random: crng init done
After:
[ 7.263402] random: crng init done
[ 7.263402] random: crng init done
[ 7.174395] random: crng init done

That's ~3x slowdown.

2. I've run bench_readv benchmark:
https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_readv.c
as:
while true; do time ./bench_readv bench_readv 300000 1; done

Before:
sys 0m7.299s
sys 0m7.218s
sys 0m6.973s
sys 0m6.892s
sys 0m7.035s
sys 0m6.982s
sys 0m6.921s
sys 0m6.940s
sys 0m6.905s
sys 0m7.006s

After:
sys 0m8.141s
sys 0m8.077s
sys 0m8.067s
sys 0m8.116s
sys 0m8.128s
sys 0m8.115s
sys 0m8.108s
sys 0m8.326s
sys 0m8.529s
sys 0m8.164s
sys 0m8.380s

This is ~19% slowdown.

3. I've run bench_pipes benchmark:
https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_pipes.c
as:
while true; do time ./bench_pipes 10 10000 1; done

Before:
sys 0m5.393s
sys 0m6.178s
sys 0m5.909s
sys 0m6.024s
sys 0m5.874s
sys 0m5.737s
sys 0m5.826s
sys 0m5.664s
sys 0m5.758s
sys 0m5.421s
sys 0m5.444s
sys 0m5.479s
sys 0m5.461s
sys 0m5.417s

After:
sys 0m8.718s
sys 0m8.281s
sys 0m8.268s
sys 0m8.334s
sys 0m8.246s
sys 0m8.267s
sys 0m8.265s
sys 0m8.437s
sys 0m8.228s
sys 0m8.312s
sys 0m8.556s
sys 0m8.680s

This is ~52% slowdown.


This does not look acceptable to me. I would ready to pay for this,
say, 10% of performance. But it seems that this can have up to 2-4x
slowdown for some workloads.


Your use-case is embed devices where you care a lot about both code
size and memory consumption, right?

I see 2 possible ways forward:
1. Enable this new mode only for outline, but keep current scheme for
inline. Then outline will be "small but slow" type of configuration.
2. Somehow fix slowness (at least in inline mode).


> Mapping zero page to non-kernel memory could cause true-negative
> problem since we cannot flush the TLB in all cpus. We will read zero
> shadow value value in this case even if actual shadow value is not
> zero. This is one of the reason that black page is introduced in this
> patchset.

What does make your current patch work then?
Say we map a new shadow page, update the page shadow to say that there
is mapped shadow. Then another CPU loads the page shadow and then
loads from the newly mapped shadow. If we don't flush TLB, what makes
the second CPU see the newly mapped shadow?