Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

From: Eric DeVolder
Date: Mon Sep 26 2022 - 15:20:37 EST


Boris,
I've a few questions for you below. With your responses, I am hopeful we can finish this series soon!
Thanks,
eric

On 9/13/22 14:12, Eric DeVolder wrote:
Boris,
Thanks for the feedback! Inline responses below.
eric

On 9/12/22 01:52, Borislav Petkov wrote:
On Fri, Sep 09, 2022 at 05:05:09PM -0400, Eric DeVolder wrote:
For x86_64, when CPU or memory is hot un/plugged, the crash
elfcorehdr, which describes the CPUs and memory in the system,
must also be updated.

When loading the crash kernel via kexec_load or kexec_file_load,

Please end function names with parentheses. Check the whole patch pls.
Done.


the elfcorehdr is identified at run time in
crash_core:handle_hotplug_event().

To update the elfcorehdr for x86_64, a new elfcorehdr must be
generated from the available CPUs and memory. The new elfcorehdr
is prepared into a buffer, and then installed over the top of
the existing elfcorehdr.

In the patch 'kexec: exclude elfcorehdr from the segment digest'
the need to update purgatory due to the change in elfcorehdr was
eliminated.  As a result, no changes to purgatory or boot_params
(as the elfcorehdr= kernel command line parameter pointer
remains unchanged and correct) are needed, just elfcorehdr.

To accommodate a growing number of resources via hotplug, the
elfcorehdr segment must be sufficiently large enough to accommodate
changes, see the CRASH_MAX_MEMORY_RANGES configure item.

With this change, crash hotplug for kexec_file_load syscall
is supported.

Redundant sentence.
Removed.


The kexec_load is also supported, but also
requires a corresponding change to userspace kexec-tools.

Ditto.
Removed.


Signed-off-by: Eric DeVolder <eric.devolder@xxxxxxxxxx>
Acked-by: Baoquan He <bhe@xxxxxxxxxx>
---
  arch/x86/Kconfig             |  11 ++++
  arch/x86/include/asm/kexec.h |  20 +++++++
  arch/x86/kernel/crash.c      | 102 +++++++++++++++++++++++++++++++++++
  3 files changed, 133 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..cdfc9b2fdf98 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2056,6 +2056,17 @@ config CRASH_DUMP
        (CONFIG_RELOCATABLE=y).
        For more details see Documentation/admin-guide/kdump/kdump.rst
+config CRASH_MAX_MEMORY_RANGES
+    depends on CRASH_DUMP && KEXEC_FILE && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+    int
+    default 32768
+    help
+      For the kexec_file_load path, specify the maximum number of
+      memory regions, eg. as represented by the 'System RAM' entries
+      in /proc/iomem, that the elfcorehdr buffer/segment can accommodate.
+      This value is combined with NR_CPUS and multiplied by Elf64_Phdr
+      size to determine the final buffer size.

If I'm purely a user, I'm left wondering how to determine what to
specify. Do you have a guidance text somewhere you can point to from
here?

This topic was discussed previously https://lkml.org/lkml/2022/3/3/372.
David points out that terminology is tricky here due to differing behaviors.
And perhaps that is your point in asking for guidance text. It can be
complicated, but it all comes down to System RAM entries.

I could perhaps offer an overly simplified example such that for 1GiB block
size, for example, the CRASH_MAX_MEMORY_RANGES of 32768 would allow for 32TiB
of memory?


+
  config KEXEC_JUMP
      bool "kexec jump"
      depends on KEXEC && HIBERNATION
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index a3760ca796aa..432073385b2d 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -212,6 +212,26 @@ typedef void crash_vmclear_fn(void);
  extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
  extern void kdump_nmi_shootdown_cpus(void);
+void *arch_map_crash_pages(unsigned long paddr, unsigned long size);
+#define arch_map_crash_pages arch_map_crash_pages
+
+void arch_unmap_crash_pages(void **ptr);
+#define arch_unmap_crash_pages arch_unmap_crash_pages
+
+void arch_crash_handle_hotplug_event(struct kimage *image,
+        unsigned int hp_action);
+#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+
  #endif /* __ASSEMBLY__ */
  #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 9ceb93c176a6..8fc7d678ac72 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -25,6 +25,7 @@
  #include <linux/slab.h>
  #include <linux/vmalloc.h>
  #include <linux/memblock.h>
+#include <linux/highmem.h>
  #include <asm/processor.h>
  #include <asm/hardirq.h>
@@ -397,7 +398,18 @@ int crash_load_segments(struct kimage *image)
      image->elf_headers = kbuf.buffer;
      image->elf_headers_sz = kbuf.bufsz;
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)
+    /* Ensure elfcorehdr segment large enough for hotplug changes */
+    kbuf.memsz =
+        (CONFIG_NR_CPUS_DEFAULT + CONFIG_CRASH_MAX_MEMORY_RANGES) *
+            sizeof(Elf64_Phdr);


    kbuf.memsz  = CONFIG_NR_CPUS_DEFAULT + CONFIG_CRASH_MAX_MEMORY_RANGES;
    kbuf.memsz *= sizeof(Elf64_Phdr);

looks more readable to me.
Done.



+    /* Mark as usable to crash kernel, else crash kernel fails on boot */
+    image->elf_headers_sz = kbuf.memsz;
+    image->elfcorehdr_index = image->nr_segments;
+    image->elfcorehdr_index_valid = true;
+#else
      kbuf.memsz = kbuf.bufsz;

Do that initialization at the top where you declare kbuf and get rid of
the #else branch.
The kbuf.bufsz value is obtained via a call to prepare_elf_headers(); I can not initialize it at its declaration.


+#endif
      kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
      kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
      ret = kexec_add_buffer(&kbuf);
@@ -412,3 +424,93 @@ int crash_load_segments(struct kimage *image)
      return ret;
  }
  #endif /* CONFIG_KEXEC_FILE */
+
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)

This ugly ifdeffery is still here. Why don't you have stubs for the
!defined() cases in the header so that you can drop those here?


I'm at a loss as to what to do differently here. You've raised this issue before and I went back and looked at the suggestions then and I don't see how that applies to this situation. How is this situation different than the #ifdef CONFIG_KEXEC_FILE that immediately preceeds it?

I've included a copy of the current state of this section below for additional markup.

+/*
+ * NOTE: The addresses and sizes passed to this routine have
+ * already been fully aligned on page boundaries. There is no
+ * need for massaging the address or size.
+ */
+void *arch_map_crash_pages(unsigned long paddr, unsigned long size)
+{
+    void *ptr = NULL;
+
+    if (size > 0) {
+        struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
+
+        ptr = kmap_local_page(page);
+    }
+
+    return ptr;
+}

    if (size > 0)
        return kmap_local_page(pfn_to_page(paddr >> PAGE_SHIFT));
    else
        return NULL;

That's it.
Done.


+
+void arch_unmap_crash_pages(void **ptr)
+{
+    if (ptr) {
+        if (*ptr)
+            kunmap_local(*ptr);
+        *ptr = NULL;
+    }

Oh wow, this is just nuts. Why does it have to pass in a pointer to
pointer which you have to carefully check twice? And why is it a void
**?
A long time ago this made sense, but it no longer makes sense. I've corrected this.


And why are those called arch_ if all I see is the x86 variants? Are
there gonna be other arches? And even if, why can't the other arches do
kmap_local_page() too?
Currently there is a concurrent effort for PPC support by Sourabh Jain, and in that effort arch_map_crash_pages() is using __va(paddr).

I do not know the nuances between kmap_local_page() and __va() to answer the question.

If kmap_local_page() works for all archs, then I'm happy to drop these arch_ variants
and use it directly.


+}
+
+/**
+ * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
+ * @image: the active struct kimage
+ * @hp_action: the hot un/plug action being handled
+ *
+ * To accurately reflect hot un/plug changes, the new elfcorehdr
+ * is prepared in a kernel buffer, and then it is written on top
+ * of the existing/old elfcorehdr.
+ */
+void arch_crash_handle_hotplug_event(struct kimage *image,
+    unsigned int hp_action)

Align arguments on the opening brace.
Done.


+{
+    struct kexec_segment *ksegment;
+    unsigned char *ptr = NULL;
+    unsigned long elfsz = 0;
+    void *elfbuf = NULL;
+    unsigned long mem, memsz;

Please sort function local variables declaration in a reverse christmas
tree order:

    <type A> longest_variable_name;
    <type B> shorter_var_name;
    <type C> even_shorter;
    <type D> i;

Done.

+
+    /*
+     * Elfcorehdr_index_valid checked in crash_core:handle_hotplug_event()

Elfcorehdr_index_valid??
Comment reworked.



+     */
+    ksegment = &image->segment[image->elfcorehdr_index];
+    mem = ksegment->mem;
+    memsz = ksegment->memsz;
+
+    /*
+     * Create the new elfcorehdr reflecting the changes to CPU and/or
+     * memory resources.
+     */
+    if (prepare_elf_headers(image, &elfbuf, &elfsz)) {
+        pr_err("crash hp: unable to prepare elfcore headers");
            ^^^^^^^^

this thing is done with pr_fmt(). Grep the tree for examples.
Done, thanks for pointing that out.


+        goto out;
+    }

The three lines above reading ksegment need to be here, where the test
is done.
Done.


+    if (elfsz > memsz) {
+        pr_err("crash hp: update elfcorehdr elfsz %lu > memsz %lu",
+            elfsz, memsz);
+        goto out;
+    }
+
+    /*
+     * At this point, we are all but assured of success.

Who is "we"?

Comment reworked.


Here is a copy of the current state of this code, for determining how to address the question above.

#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_MEMORY_HOTPLUG)

#undef pr_fmt
#define pr_fmt(fmt) "crash hp: " fmt

/*
 * NOTE: The addresses and sizes passed to this routine have
 * already been fully aligned on page boundaries. There is no
 * need for massaging the address or size.
 */
void *arch_map_crash_pages(unsigned long paddr, unsigned long size)
{
        if (size > 0)
                return kmap_local_page(pfn_to_page(paddr >> PAGE_SHIFT));
        else
                return NULL;
}

void arch_unmap_crash_pages(void *ptr)
{
        if (ptr)
                kunmap_local(ptr);
}

/**
 * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
 * @image: the active struct kimage
 * @hp_action: the hot un/plug action being handled
 *
 * To accurately reflect hot un/plug changes, the new elfcorehdr
 * is prepared in a kernel buffer, and then it is written on top
 * of the existing/old elfcorehdr.
 */
void arch_crash_handle_hotplug_event(struct kimage *image,
                                    unsigned int hp_action)
{
        unsigned long mem, memsz;
        unsigned long elfsz = 0;
        void *elfbuf = NULL;
        void *ptr;

        /*
         * Create the new elfcorehdr reflecting the changes to CPU and/or
         * memory resources.
         */
        if (prepare_elf_headers(image, &elfbuf, &elfsz)) {
                pr_err("unable to prepare elfcore headers");
                goto out;
        }

        /*
         * Obtain address and size of the elfcorehdr segment, and
         * check it against the new elfcorehdr buffer.
         */
        mem = image->segment[image->elfcorehdr_index].mem;
        memsz = image->segment[image->elfcorehdr_index].memsz;
        if (elfsz > memsz) {
                pr_err("update elfcorehdr elfsz %lu > memsz %lu",
                        elfsz, memsz);
                goto out;
        }

        /*
         * Copy new elfcorehdr over the old elfcorehdr at destination.
         */
        ptr = arch_map_crash_pages(mem, memsz);
        if (ptr) {
                /*
                 * Temporarily invalidate the crash image while the
                 * elfcorehdr is updated.
                 */
                xchg(&kexec_crash_image, NULL);
                memcpy_flushcache(ptr, elfbuf, elfsz);
                xchg(&kexec_crash_image, image);
        }
        arch_unmap_crash_pages(ptr);
        pr_debug("re-loaded elfcorehdr at 0x%lx\n", mem);

out:
        if (elfbuf)
                vfree(elfbuf);
}
#endif