Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

From: Catalin Marinas
Date: Tue Nov 10 2020 - 13:17:50 EST


On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote:
> On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > > boot table initialization, so move it later in the boot process.
> > > Specifically into mem_init(), this is the last place crashkernel will be
> > > able to reserve the memory before the page allocator kicks in.
> > > There
> > > isn't any apparent reason for doing this earlier.
> >
> > It's so that map_mem() can carve it out of the linear/direct map.
> > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> > kernel. We depend on this if we continue with kdump, but failed to offline all the other
> > CPUs.
>
> I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
> happen further down the line, after having loaded the kdump kernel image. But
> it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
> NO_CONT_MAPPINGS).

IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image,
not the whole reserved memory that the crashkernel will use. For the
latter, we avoid the linear map by marking it as nomap in map_mem().

> > We also depend on this when skipping the checksum code in purgatory, which can be
> > exceedingly slow.
>
> This one I don't fully understand, so I'll lazily assume the prerequisite is
> the same WRT how memory is mapped. :)
>
> Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> prerequisite.
>
> Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> having the linear mappings available.

So it looks like reserve_crashkernel() wants to reserve memory before
setting up the linear map with the information about the DMA zones in
place but that comes later when we can parse the firmware tables.

I wonder, instead of not mapping the crashkernel reservation, can we not
do an arch_kexec_protect_crashkres() for the whole reservation after we
created the linear map?

> Let me stress that knowing the DMA constraints in the system before reserving
> crashkernel's regions is necessary if we ever want it to work seamlessly on all
> platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> memory.

Indeed. So we have 3 options (so far):

1. Allow the crashkernel reservation to go into the linear map but set
it to invalid once allocated.

2. Parse the flattened DT (not sure what we do with ACPI) before
creating the linear map. We may have to rely on some SoC ID here
instead of actual DMA ranges.

3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
reservations and not rely on arm64_dma_phys_limit in
reserve_crashkernel().

I think (2) we tried hard to avoid. Option (3) brings us back to the
issues we had on large crashkernel reservations regressing on some
platforms (though it's been a while since, they mostly went quiet ;)).
However, with Chen's crashkernel patches we end up with two
reservations, one in the low DMA zone and one higher, potentially above
4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
reservations than what we have now.

If (1) works, I'd go for it (James knows this part better than me),
otherwise we can go for (3).

--
Catalin