Re: [RFC PATCH 00/22] riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode

From: Arnd Bergmann
Date: Sat May 20 2023 - 06:14:08 EST


On Sat, May 20, 2023, at 04:53, Guo Ren wrote:
> On Sat, May 20, 2023 at 4:20 AM Arnd Bergmann <arnd@xxxxxxxx> wrote:
>> On Thu, May 18, 2023, at 15:09, guoren@xxxxxxxxxx wrote:
>>
>> I've tried to run the same numbers for the debate about running
>> 32-bit vs 64-bit arm kernels in the past, but focused mostly on
>> slightly larger systems, but I looked mainly at the 512MB case,
>> as that is the most cost-efficient DDR3 memory configuration
>> and fairly common.
> 512MB is extravagant, in my opinion. In the IPC market, 32/64MB is for
> 480P/720P/1080p, 128/256MB is for 1080p/2k, and 512/1024MB is for 4K.
>> 512MB chips is less than 5% of the total (I guess). Even in 512MB
> chips, the additional memory is for the frame buffer, not the Linux
> system.

This depends a lot on the target application of course. For
a phone or NAS box, 512MB is probably the lower limit.

What I observe in arch/arm/ devicetree submissions, in board-db.org,
and when looking at industrial Arm board vendor websites is that
512MB is the most common configuration, and I think 1GB is still
more common than 256MB even for 32-bit machines. There is of course
a difference between number of individual products, and number of
machines shipped in a given configuration, and I guess you have
a good point that the cheapest ones are also the ones that ship
in the highest volume.

>> What I'd like to understand better in your example is where
>> the 14MB of memory went. I assume this is for 128MB of total
>> RAM, so we know that 1MB went into additional 'struct page'
>> objects (32 bytes * 32768 pages). It would be good to know
>> where the dynamic allocations went and if they are reclaimable
>> (e.g. inodes) or non-reclaimable (e.g. kmalloc-128).
>>
>> For the vmlinux size, is this already a minimal config
>> that one would run on a board with 128MB of RAM, or a
>> defconfig that includes a lot of stuff that is only relevant
>> for other platforms but also grows on 64-bit?
> It's not minimal config, it's defconfig. So I say it's a roungh
> measurement :)
>
> I admit I wanted a little bit to exaggerate it, but that's the
> starting point for cutting down memory usage for most people, right?
> During the past year, we have been convincing our customers to use the
> s64lp64 + u32ilp32, but they can't tolerate even 1% memory additional
> cost in 64MB/128MB scenarios and then chose cortex-a7/a35, which could
> run 32-bit Linux. I think it's too early to talk about throwing 32-bit
> Linux into the garbage, not only for the reason of memory footprint
> but also for the ingrained opinion of the people. Changing their mind
> needs a long time.
>
>>
>> What do you see in /proc/slabinfo, /proc/meminfo/, and
>> 'size vmlinux' for the s64ilp32 and s64lp64 kernels here?
> Both s64ilp32 & s64lp64 use the same u32ilp32_rootfs.ext2 binary and
> the same opensbi binary.
> All are opensbi(2MB) + Linux(126MB) memory layout.
>
> Here is the result:
>
> s64ilp32:
> [ 0.000000] Virtual kernel memory layout:
> [ 0.000000] fixmap : 0x9ce00000 - 0x9d000000 (2048 kB)
> [ 0.000000] pci io : 0x9d000000 - 0x9e000000 ( 16 MB)
> [ 0.000000] vmemmap : 0x9e000000 - 0xa0000000 ( 32 MB)
> [ 0.000000] vmalloc : 0xa0000000 - 0xc0000000 ( 512 MB)
> [ 0.000000] lowmem : 0xc0000000 - 0xc7e00000 ( 126 MB)
> [ 0.000000] Memory: 97748K/129024K available (8699K kernel code,
> 8867K rwdata, 4096K rodata, 4204K init, 361K bss, 31276K reserved, 0K
> cma-reserved)

Ok, so it saves only a little bit on .text/.init/.bss/.rodata, but
there is a 4MB difference in rwdata, and a total of 10.4MB difference
in "reserved" size, which I think includes all of the above plus
the mem_map[] array.

89380K/131072K available (8638K kernel code, 4979K rwdata, 4096K rodata, 2191K init, 477K bss, 41692K reserved, 0K cma-reserved)

Oddly, I don't see anywhere close to 8KB in a riscv64 defconfig
build (linux-next, gcc-13), so I don't know where that comes
from:

$ size -A build/tmp/vmlinux | sort -k2 -nr | head
Total 13518684
.text 8896058 18446744071562076160
.rodata 2219008 18446744071576748032
.data 933760 18446744071583039488
.bss 476080 18446744071584092160
.init.text 264718 18446744071572553728
__ksymtab_strings 183986 18446744071579214312
__ksymtab_gpl 122928 18446744071579091384
__ksymtab 109080 18446744071578982304
__bug_table 98352 18446744071583973248



> KReclaimable: 644 kB
> Slab: 4536 kB
> SReclaimable: 644 kB
> SUnreclaim: 3892 kB
> KernelStack: 344 kB

These look like the only notable differences in meminfo:

KReclaimable: 1092 kB
Slab: 6900 kB
SReclaimable: 1092 kB
SUnreclaim: 5808 kB
KernelStack: 688 kB

The largest chunk here is 2MB in non-reclaimable slab allocations,
or a 50% growth of those.

The kernel stacks are doubled as expected, but that's only 344KB,
similarly for reclaimable slabs.

> # cat /proc/slabinfo
>
> [68/1691]
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
> slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k 28 28 144 28 1 : tunables 0 0
> 0 : slabdata 1 1 0
> p9_req_t 0 0 104 39 1 : tunables 0 0

Did you perhaps miss a few lines while pasting these? It seems
odd that some caches only show up in the ilp32 case (proc_dir_entry,
bd2_journa_handle, buffer_head, biovec_max, anon_vma_chain, ...) and
some others are only in the lp64 case (UNIX, ext4_prealloc_space,
files_cache, filp, ip_fib_alias, task_struct, uid_cache, ...).

Looking at the ones that are in both and have the largest size
increase, I see

# lp64
1788 kernfs_node_cache 14304 128
590 shmem_inode_cache 646 936
272 inode_cache 360 776
153 ext4_inode_cache 105 1496
250 dentry 1188 216
192 names_cache 48 4096
199 radix_tree_node 350 584
307 kmalloc-64 4912 64
60 kmalloc-128 480 128
47 kmalloc-192 252 192
204 kmalloc-256 816 256
72 kmalloc-512 144 512
840 kmalloc-1k 840 1024

# ilp32
1197 kernfs_node_cache 13938 88
373 shmem_inode_cache 637 600
174 inode_cache 360 496
84 ext4_inode_cache 88 984
177 dentry 1196 152
32 names_cache 8 4096
100 radix_tree_node 338 304
331 kmalloc-64 5302 64
132 kmalloc-128 1056 128
23 kmalloc-192 126 192
16 kmalloc-256 64 256
428 kmalloc-512 856 512
88 kmalloc-1k 88 1024

So sysfs (kernfs_node_cache) has the largest chunk of the
2MB non-reclaimable slab, grown 50% from 1.2MB to 1.8MB.
In some cases, this could be avoided entirely by turning
off sysfs, but most users can't do that.
shmem_inode_cache is probably mostly devtmpfs, the
other inode caches ones are smaller and likely reclaimable.

It's interesting how the largest slab cache ends up
being the kmalloc-1k cache (840 1K objects) on lp64,
but the kmalloc-512 cache (856 512B objects) on ilp32.
My guess is that the majority of this is from a single
callsite that has an allocation groing just beyond 512B.
This alone seems significant enough to need further
investigation, I would hope we can completely avoid
these by adding a custom slab cache. I don't see this
effect on an arm64 boot though, for me the 512B allocations
are much higher the 1K ones.

Maybe you can identify the culprit using the boot-time traces
as listed in https://elinux.org/Kernel_dynamic_memory_analysis#Dynamic
That might help everyone running a 64-bit kernel on
low-memory configurations, though it would of course slightly
weaken your argument for an ilp32 kernel ;-)

Arnd