Re: [PATCH] x86/mm/ident_map: Use full gbpages in identity maps except on UV platform.

From: Dave Hansen
Date: Sun Mar 24 2024 - 14:16:58 EST


On 3/23/24 21:45, Eric W. Biederman wrote:
> Dave Hansen <dave.hansen@xxxxxxxxx> writes:
>> On 3/22/24 09:21, Steve Wahl wrote:
>>> Some systems have ACPI tables that don't include everything that needs
>>> to be mapped for a successful kexec. These systems rely on identity
>>> maps that include the full gigabyte surrounding any smaller region
>>> requested for kexec success. Without this, they fail to kexec and end
>>> up doing a full firmware reboot.
>>
>> I'm still missing something here. Which ACPI tables are we talking
>> about? What don't they map? I normally don't think of ACPI _tables_ as
>> "mapping" things.
>
> Either E820 or ACPI lists which areas of memory are present in a
> machine. Those tables are used to build the identity memory mappings.
>
> Those identity mapped page tables not built with GB pages cause kexec to
> fail for at least 3 people. Presumably because something using those
> page tables accesses memory that is not mapped.

But why is it not mapped? Are the firmware-provided memory maps
inaccurate? Or did the kernel read those maps and then forget to map
something.

Using GB pages could paper over either class of bug.

>> It seems like there's a theory that some ACPI table isn't mapped, but
>> looking through the discussion so far I don't see a smoking gun. Let's
>> say the kernel has a bug and the kernel was actively not mapping
>> something that it should have mapped. The oversized 1GB mappings made
>> the bug harder to hit. If that's the case, we'll just be adding a hack
>> which papers over the bug instead of fixing it properly.
>>
>> I'm kind of leaning to say that we should just revert d794734c9bbf and
>> have the UV folks go back to the nogbpages until we get this properly
>> sorted.
>
> That is exactly what this patch does. It reverts the change except
> on UV systems.

Maybe it's splitting hairs, but I see a difference between reverting the
_commit_ and adding new code that tries to revert the commit's behavior.

I think reverting the commit is more conservative and that's what I was
referring to.

>>> @@ -10,6 +10,7 @@ struct x86_mapping_info {
>>> unsigned long page_flag; /* page flag for PMD or PUD entry */
>>> unsigned long offset; /* ident mapping offset */
>>> bool direct_gbpages; /* PUD level 1GB page support */
>>> + bool direct_gbpages_always; /* use 1GB pages exclusively */
>>> unsigned long kernpg_flag; /* kernel pagetable flag override */
>>> };
>>
>> But let's at least talk about this patch in case we decide to go forward
>> with it. We've really got two things:
>>
>> 1. Can the system use gbpages in the first place?
>> 2. Do the gbpages need to be exact (UV) or sloppy (everything else)?
>>
>> I wouldn't refer to this at all as "always" use gbpages. It's really a
>> be-sloppy-and-paper-over-bugs mode. They might be kernel bugs or
>> firmware bugs, but they're bugs _somewhere_ right?
>
> Is it?
>
> As far as I can tell the UV mode is be exact and avoid cpu bugs mode.

The fact is that there are parts of the physical address space that have
read side effects. If you want to have them mapped, you need to use a
mapping type where speculative accesses won't occur (like UC).

I don't really think these are CPU bugs. They're just a fact of life.

> My sense is that using GB pages for everything (when we want an identity
> mapping) should be much cheaper TLB wise, so we probably want to use GB
> pages for everything if we can.

Sure. But the "if we can" situation is where the physical address space
is uniform underneath that GB page.

It's not at all uncommon to have those goofy, undesirable read
side-effects. We've had several issues around them over the years. You
really can't just map random physical memory and hope for the best.

That means that you are limited to mapping memory that you *know* is
uniform, like "all RAM" or "all PMEM".