Re: [PATCH] riscv: head: use 0 as the default text_offset

From: Palmer Dabbelt
Date: Tue Nov 29 2022 - 13:33:47 EST


On Mon, 28 Nov 2022 22:55:24 PST (-0800), samuel@xxxxxxxxxxxx wrote:
On 11/29/22 00:19, Palmer Dabbelt wrote:
On Mon, 28 Nov 2022 21:04:48 PST (-0800), samuel@xxxxxxxxxxxx wrote:
On 11/28/22 14:11, Atish Kumar Patra wrote:
On Mon, Nov 28, 2022 at 7:34 AM Jisheng Zhang <jszhang@xxxxxxxxxx>
wrote:

Commit 0f327f2aaad6 ("RISC-V: Add an Image header that boot loader can
parse.") adds an image header which "is based on ARM64 boot image
header and provides an opportunity to combine both ARM64 & RISC-V
image headers in future.". At that time, arm64's default text_offset
is 0x80000, this is to give "512 KB of guaranteed BSS space to put
the swapper page tables" as commit cfa7ede20f13 ("arm64: set
TEXT_OFFSET
to 0x0 in preparation for removing it entirely") pointed out, but
riscv doesn't need the space, so use 0 as the default text_offset.

Before this patch, booting linux kernel on Sipeed bl808 M1s Dock
with u-boot booti cmd:
[    0.000000] OF: fdt: Ignoring memory range 0x50000000 - 0x50200000
...
[    0.000000]   DMA32    [mem 0x0000000050200000-0x0000000053ffffff]
As can be seen, 2MB DDR(0x50000000 - 0x501fffff) can't be used by
linux.

After this patch, the 64MB DDR is fully usable by linux
[    0.000000]   DMA32    [mem 0x0000000050000000-0x0000000053ffffff]

Signed-off-by: Jisheng Zhang <jszhang@xxxxxxxxxx>
---
 arch/riscv/kernel/head.S | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index b865046e4dbb..ef95943f7a70 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -38,18 +38,8 @@ ENTRY(_start)
        .word 0
 #endif
        .balign 8
-#ifdef CONFIG_RISCV_M_MODE
-       /* Image load offset (0MB) from start of RAM for M-mode */
+       /* Image load offset (0MB) from start of RAM */
        .dword 0
-#else
-#if __riscv_xlen == 64
-       /* Image load offset(2MB) from start of RAM */
-       .dword 0x200000
-#else
-       /* Image load offset(4MB) from start of RAM */
-       .dword 0x400000
-#endif
-#endif

NACK.
RV64 needs to boot at a 2MB aligned address and RV32 needs to boot at
a 4MB aligned address.
The firmware is assumed to live at the start of DRAM for Linux running
in S-mode.

What needs to happen so we can stop making this assumption? If the SBI
implementation wants to reserve memory, it should use the devicetree to
do so. OpenSBI already does this.

IMO we've really screwed up the boot flow on RISC-V.  Having Linux
reserve space for the firmware is just all backwards, Linux can't know
how much memory the firmware needs (which manifests under large hart
counts in OpenSBI, for example).  Unfortunately there's no specification
that defines these platform-level details, so we're stuck depending on 
unspecified behavior like this.

I think we could fix this by either making Linux's early boot relocation
code work sanely (fix whatever bugs are there, document what can't be
fixed, and then add some sort of Image flag to tell firmware the kernel
can be relocated) or relying on relocatable firmware, but both of those
come with some costs ...

It sounds like Alexandre's patch[1] lets us use memory below this
offset, so we don't have to relocate the kernel to the beginning of RAM.
In fact, we could even increase the offset if we are concerned about the
kernel link address conflicting with the SBI implementation.

[1]:
https://patchwork.kernel.org/project/linux-riscv/patch/20221122084141.1849421-1-alexghiti@xxxxxxxxxxxx/

Throwing away 2 MiB of RAM is quite wasteful considering we have
multiple SoCs (D1s, BL808) that are limited to 64 MiB of in-package RAM.

... and I'd argue that users on systems don't want to pay those costs. 

What does fixing the early relocation code cost? Just longer boot time?
If the bootloader takes care of avoiding reserved-memory regions, and
Linux can run from wherever it gets loaded, that would be ideal to me.

There's a runtime cost to either the relocatable kernel or firmware, but if those VA tricks are enough to make it work then it's pretty much free in terms of performance -- I think we'd even be able to avoid copying the kernel image around if the bootloader aligns it properly?

In fact, I'd argue that systems like that don't want resident firmware
at all.

I would much rather pay 256 KiB for resident firmware than reimplement
all of the power management and PMU logic in Linux. It's not as bad as
losing 2 MiB when I know most of that is unused.

OK, sounds like just having the firmware accurately report the memory it needs to Linux and then reclaiming whatever's left in that 2MiB region is sufficient here? That's way easier.

So let's just add a CONFIG_SBI=n, and then just use direct drivers for
everything.  If the firmware doesn't need to be resident then it's
pretty straight-forward to support these 0 offsets, so we can just add
that as another Kconfig.  Sure this will trip up firmware that depends
on these fixed reservations, but saying "the resident firmware fits in 0
superpages" is just as much of a platform-specific dependency as saying
"the resident firmware fits in 1 superpage".  If firmware can't handle
this field in the Image format then we're going to end up with breakages
at some point, it might as well be now.

If these systems don't have all the ISA bits necessary to avoid M-mode
entirely then we can just implement a tiny M-mode stub in Linux that
gets left around during early boot and then shims stuff to S-mode. 
That'll be a bit of a headache and with some extensions it can be
avoided, the standard stuff won't allow for that until the latest round
of specs is done but if it's possible via whatever custom extensions are
in these things then that's probably the way to go.

I don't think Linux has a choice here, when started in S-mode. And
neither does the bootloader parsing the Image, because it most likely
runs in S-mode as well.

I think there's some options here, but I guess we can leave that up to someone who wants the feature ;)

And when started in M-mode, we already don't use SBI.

Regards,
Samuel