[PATCH v25 09/10] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()

From: Eric DeVolder
Date: Thu Jun 29 2023 - 15:24:11 EST


The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the CPUs and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
CPUs in the system.

With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed,
the elfcorehdr must again be updated to reflect the new set of CPUs.

The reasoning behind the move to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
possible CPUs; that is, crash_notes are not allocated dynamically
when CPUs are plugged/unplugged. Thus the crash_notes for each
possible CPU are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
Changing to for_each_possible_cpu() is valid as the crash_notes
pointed to by each CPU PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
elfcorehdr /proc/vmcore vmcore

reveals how the ELF CPU PT_NOTEs are utilized:

- Upon panic, each CPU is sent an IPI and shuts itself down, recording
its state in its crash_notes. When all CPUs are shutdown, the
crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the CPU
PT_NOTEs to craft a nr_cpus variable, which is reported in a
header but otherwise generally unused. Makedumpfile creates the
vmcore.

- The 'crash' dump analyzer does not appear to reference the CPU
PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
symbols and directly examines those structure contents from vmcore
memory. From that information it is able to determine which CPUs
are present and online, and locate the corresponding crash_notes.
Said differently, it appears that 'crash' analyzer does not rely
on the ELF PT_NOTEs for CPUs; rather it obtains the information
directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
common solution.)

This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on CPU changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible CPUs.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <sourabhjain@xxxxxxxxxxxxx>
Signed-off-by: Eric DeVolder <eric.devolder@xxxxxxxxxx>
Reviewed-by: Sourabh Jain <sourabhjain@xxxxxxxxxxxxx>
Acked-by: Hari Bathini <hbathini@xxxxxxxxxxxxx>
Acked-by: Baoquan He <bhe@xxxxxxxxxx>
---
kernel/crash_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index fa918176d46d..7378b501fada 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
ehdr->e_ehsize = sizeof(Elf64_Ehdr);
ehdr->e_phentsize = sizeof(Elf64_Phdr);

- /* Prepare one phdr of type PT_NOTE for each present CPU */
- for_each_present_cpu(cpu) {
+ /* Prepare one phdr of type PT_NOTE for each possible CPU */
+ for_each_possible_cpu(cpu) {
phdr->p_type = PT_NOTE;
notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
phdr->p_offset = phdr->p_paddr = notes_addr;
--
2.31.1