Re: [PATCH] mm/sparsemem: fix race in accessing memory_section->usage

From: David Hildenbrand
Date: Wed Oct 18 2023 - 03:52:58 EST


On 16.10.23 15:38, Charan Teja Kalla wrote:
Thanks Andrew/David,

On 10/16/2023 1:53 PM, David Hildenbrand wrote:
The below race is observed on a PFN which falls into the device memory
region with the system memory configuration where PFN's are such that
[ZONE_NORMAL ZONE_DEVICE  ZONE_NORMAL]. Since normal zone start and
end pfn contains the device memory PFN's as well, the compaction
triggered will try on the device memory PFN's too though they end up in
NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory
sections). When from other core, the section mappings are being removed
for the ZONE_DEVICE region, that the PFN in question belongs to,
on which compaction is currently being operated is resulting into the
kernel crash with CONFIG_SPASEMEM_VMEMAP enabled.

Seems this bug is four years old, yes?  It must be quite hard to hit.

From the description, it's not quite clear to me if this was actually
hit -- usually people include the dmesg bug/crash info.

On Snapdragon SoC, with the mentioned memory configuration of PFN's as
[ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of
issues daily while testing on a device farm.

I note that from next time on wards will send the demsg bug/crash info
for these type of issues. For this particular issue below is the log.
Though the below log is not directly pointing to the
pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32
lauterbach tool, it is pointing.

[ 540.578056] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
[ 540.578068] Mem abort info:
[ 540.578070] ESR = 0x0000000096000005
[ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits
[ 540.578077] SET = 0, FnV = 0
[ 540.578080] EA = 0, S1PTW = 0
[ 540.578082] FSC = 0x05: level 1 translation fault
[ 540.578085] Data abort info:
[ 540.578086] ISV = 0, ISS = 0x00000005
[ 540.578088] CM = 0, WnR = 0
[ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS
BTYPE=--)
[ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579454] lr : compact_zone+0x994/0x1058
[ 540.579460] sp : ffffffc03579b510
[ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:
000000000000000c
[ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:
ffffffc03579b640
[ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:
0000000000000000
[ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:
ffffffdebf66d140
[ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:
00000000009f4bff
[ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:
0000000000000001
[ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :
ffffff897d2cd440
[ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :
ffffffc03579b5b4
[ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :
0000000000000001
[ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :
0000000000235800
[ 540.579524] Call trace:
[ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c
[ 540.579533] compact_zone+0x994/0x1058
[ 540.579536] try_to_compact_pages+0x128/0x378
[ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0
[ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10
[ 540.579547] __alloc_pages+0x250/0x2d0
[ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc
[ 540.579561] iommu_dma_alloc+0xa0/0x320
[ 540.579565] dma_alloc_attrs+0xd4/0x108

Fix this issue by the below steps:
a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage.
b) RCU protected read side critical section will either return NULL when
SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage.
c) Synchronize the rcu on the write side and free the ->usage. No
attempt will be made to access ->usage after this as the
SECTION_HAS_MEM_MAP is cleared thus valid_section() return false.


This affects any kind of memory hotunplug. When hotunplugging memory we
will end up calling synchronize_rcu() for each and every memory section,
which sounds extremely wasteful.

Can't we find a way to kfree_rcu() that thing and read/write the pointer
using READ?ONCE?WRITE_ONCE instead?

I am inspired to use the synchronize_rcu() because of [1] where we did
use it in offline_page_ext(). And my limited understanding is that, a
user can trigger the offline operation more often than the unplug operation.

In theory yes. In practice there are not many use cases where we do that.

Further, page_ext is already not used that frequently (page owner, young, idle tracking only), especially in most production (!debug) environments.


I agree here that there is a scope to use kfree_rcu() unlike in [1]. Let
me check for a way to use it.

Exactly, if we could use that, it would be ideal.

--
Cheers,

David / dhildenb