Re: [PATCH v2 2/3] x86/mm/KASLR: Calculate the actual size of vmemmap region

From: Ingo Molnar
Date: Wed Sep 12 2018 - 02:32:01 EST



* Baoquan He <bhe@xxxxxxxxxx> wrote:

> On 09/11/18 at 08:08pm, Baoquan He wrote:
> > On 09/11/18 at 11:28am, Ingo Molnar wrote:
> > > Yeah, so proper context is still missing, this paragraph appears to assume from the reader a
> > > whole lot of prior knowledge, and this is one of the top comments in kaslr.c so there's nowhere
> > > else to go read about the background.
> > >
> > > For example what is the range of randomization of each region? Assuming the static,
> > > non-randomized description in Documentation/x86/x86_64/mm.txt is correct, in what way does
> > > KASLR modify that layout?
>
> Re-read this paragraph, found I missed saying the range for each memory
> region, and in what way KASLR modify the layout.
>
> > >
> > > All of this is very opaque and not explained very well anywhere that I could find. We need to
> > > generate a proper description ASAP.
> >
> > OK, let me try to give an context with my understanding. And copy the
> > static layout of memory regions at below for reference.
> >
> Here, Documentation/x86/x86_64/mm.txt is correct, and it's the
> guideline for us to manipulate the layout of kernel memory regions.
> Originally the starting address of each region is aligned to 512GB
> so that they are all mapped at the 0-th entry of PGD table in 4-level
> page mapping. Since we are so rich to have 120 TB virtual address space,
> they are aligned at 1 TB actually. So randomness comes from three parts
> mainly:
>
> 1) The direct mapping region for physical memory. 64 TB are reserved to
> cover the maximum physical memory support. However, most of systems only
> have much less RAM memory than 64 TB, even much less than 1 TB most of
> time. We can take the superfluous to join the randomization. This is
> often the biggest part.

So i.e. in the non-KASLR case we have this description (from mm.txt):

ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
... unused hole ...
ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
... unused hole ...
vaddr_end for KASLR
fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping
...

The problems start here, this map is already *horribly* confusing:

- we mix size in TB with 'bits'
- we sometimes mention a size in the description and sometimes not
- we sometimes list holes by address, sometimes only as an 'unused hole' line ...

So how about first cleaning up the memory maps in mm.txt and streamlining them, like this:

ffff880000000000 - ffffc7ffffffffff (=46 bits, 64 TB) direct mapping of all phys. memory (page_offset_base)
ffffc80000000000 - ffffc8ffffffffff (=40 bits, 1 TB) ... unused hole
ffffc90000000000 - ffffe8ffffffffff (=45 bits, 32 TB) vmalloc/ioremap space (vmalloc_base)
ffffe90000000000 - ffffe9ffffffffff (=40 bits, 1 TB) ... unused hole
ffffea0000000000 - ffffeaffffffffff (=40 bits, 1 TB) virtual memory map (vmemmap_base)
ffffeb0000000000 - ffffebffffffffff (=40 bits, 1 TB) ... unused hole
ffffec0000000000 - fffffbffffffffff (=44 bits, 16 TB) KASAN shadow memory
fffffc0000000000 - fffffdffffffffff (=41 bits, 2 TB) ... unused hole
vaddr_end for KASLR
fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping
...

Please double check all the calculations and ranges, and I'd suggest doing it for the whole
file. Note how I added the global variables describing the base addresses - this makes it very
easy to match the pointers in kaslr_regions[] to the static map, to see the intent of
kaslr_regions[].

BTW., isn't that 'vaddr_end for KASLR' entry position inaccurate? In the typical case it could
very well be that by chance all 3 areas end up being randomized into the first 64 TB region,
right?

I.e. vaddr_end could be at any 1 TB boundary in the above ranges. I'd suggest leaving out all
KASLR from this static mappings table - explain it separately in this file, maybe even create
its own memory map. I'll help with the wording.

> 2) The hole between memory regions, even though they are only 1 TB.

There's a 2 TB hole too.

> 3) KASAN region takes up 16 TB, while it won't take effect when KASLR is
> enabled. This is another big part.

Ok.

> As you can see, in these three memory regions, the physical memory
> mapping region has variable size according to the existing system RAM.
> However, the remaining two memory regions have fixed size, vmalloc is 32
> TB, vmemmap is 1 TB.
>
> With this superfluous address space as well as changing the starting address
> of each memory region to be PUD level, namely 1 GB aligned, we can have
> thousands of candidate position to locate those three memory regions.

Would be nice provide the number of bits randomized, maximum, from which the number of GBs of
physical RAM has to be subtracted.

Because 'thousands' of randomization targets is *excessively* poor randomization - caused by
the ridiculously high rounding to 1GB. It would be _very_ nice to extend randomization to at
least 2MB boundaries instead. (If the half cacheline of PTE entries possibly 'wasted' is an
issue we could increase that to 128 MB, but should start with 2MB first.)

That would instantly multiply the randomization selection by 512 ...

> Above is for 4-level paging mode . As for 5-level, since the virtual
> address space is too big, Kirill makes the starting address of regions
> P4D aligned, namely 512 GB.

512 GB of every region? That's ridiculously poor randomization too: we should *utilize* the
extra randomness and match the randomization on 56 bits CPUs as well, instead of wasting it!

> When randomize the layout, their order are kept, still the physical
> memory mapping region is handled fistly, next vmalloc and vmemmap. Let's
> take the physical memory mapping region as example, we limit the
> starting address to be taken from the 1st 1/3 part of the whole
> available virtual address space which is from 0xffff880000000000 to
> 0xfffffe0000000000, namely the original starting address of the physical
> memory mapping region to the starting address of cpu_entry_area mapping
> region. Once a random address is chosen for the physical memory mapping,
> we jump over the region and add 1G to begin the next region handling
> with the remaining available space.

Ok, makes sense now!

I'd suggest adding an explanation like this to @size_tb:

@size_tb is physical RAM size, rounded up to the next 1 TB boundary so that the base
addresses following this region still start on 1 TB boundaries.

Once we improve randomization to be at the 2 MB granularity this should be renamed
->size_rounded_up or so.

Would you like to work on this? These would be really nice additions, once the code is cleaned
up to be maintainable and the pending bug fixes you have are merged.

In terms of patch logistics I'd suggest this ordering:

- documentation fixes
- simple cleanups
- fixes
- enhancements

With no more than ~5 patches sent in a series. Feel free to integrate all pending
boot-memory-map fixes and features as well, we'll figure out the right way to do them as they
happen - but let's start with the simple stuff first, ok?

Thanks,

Ingo