Re: [next] [dragonboard 410c] Unable to handle kernel paging request at virtual address 00000000007c4240

From: Jani Nikula
Date: Thu Oct 21 2021 - 04:40:22 EST


On Thu, 21 Oct 2021, Vlastimil Babka <vbabka@xxxxxxx> wrote:
> On 10/20/21 20:24, Naresh Kamboju wrote:
>> Following kernel crash noticed on linux next 20211020 tag.
>> while booting on arm64 architecture dragonboard 410c device.
>>
>> I see the following config is enabled in 20211020 tag builds.
>> CONFIG_STACKDEPOT=y
>>
>> Crash log,
>> [ 18.583097] Unable to handle kernel paging request at virtual
>> address 00000000007c4240
>> [ 18.583521] Mem abort info:
>> [ 18.590286] ESR = 0x96000004
>> [ 18.592920] EC = 0x25: DABT (current EL), IL = 32 bits
>> [ 18.596103] SET = 0, FnV = 0
>> [ 18.601512] EA = 0, S1PTW = 0
>> [ 18.604384] FSC = 0x04: level 0 translation fault
>> [ 18.607447] Data abort info:
>> [ 18.612296] ISV = 0, ISS = 0x00000004
>> [ 18.615451] CM = 0, WnR = 0
>> [ 18.618990] user pgtable: 4k pages, 48-bit VAs, pgdp=000000008b4c7000
>> [ 18.622054] [00000000007c4240] pgd=0000000000000000, p4d=0000000000000000
>> [ 18.628974] Internal error: Oops: 96000004 [#1] SMP
>> [ 18.635073] Modules linked in: adv7511 cec snd_soc_lpass_apq8016
>> snd_soc_lpass_cpu snd_soc_lpass_platform snd_soc_msm8916_digital
>> qcom_camss qrtr snd_soc_apq8016_sbc videobuf2_dma_sg qcom_pon
>> qcom_spmi_vadc snd_soc_qcom_common qcom_q6v5_mss qcom_vadc_common
>> rtc_pm8xxx qcom_spmi_temp_alarm msm qcom_pil_info v4l2_fwnode
>> qcom_q6v5 snd_soc_msm8916_analog qcom_sysmon qcom_common v4l2_async
>> qnoc_msm8916 qcom_rng gpu_sched qcom_glink_smem venus_core
>> videobuf2_memops icc_smd_rpm qmi_helpers drm_kms_helper v4l2_mem2mem
>> mdt_loader display_connector i2c_qcom_cci videobuf2_v4l2 crct10dif_ce
>> videobuf2_common socinfo drm rmtfs_mem fuse
>> [ 18.672948] CPU: 0 PID: 178 Comm: kworker/u8:3 Not tainted
>> 5.15.0-rc6-next-20211020 #1
>> [ 18.695000] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
>> [ 18.695012] Workqueue: events_unbound deferred_probe_work_func
>> [ 18.695033] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 18.715282] pc : __stack_depot_save+0x13c/0x4e0
>> [ 18.722130] lr : stack_depot_save+0x14/0x20
>> [ 18.726641] sp : ffff800014a23500
>> [ 18.730801] x29: ffff800014a23500 x28: 00000000000f8848 x27: ffff800013acdf68
>> [ 18.734294] x26: 0000000000000000 x25: 00000000007c4240 x24: ffff800014a23780
>> [ 18.741413] x23: 0000000000000008 x22: ffff800014a235b8 x21: 0000000000000008
>> [ 18.748530] x20: 00000000c32f8848 x19: ffff00001038cc18 x18: ffffffffffffffff
>> [ 18.755649] x17: ffff80002d9f8000 x16: ffff800010004000 x15: 000000000000c426
>> [ 18.762767] x14: 0000000000000000 x13: ffff800014a23780 x12: 0000000000000000
>> [ 18.769885] x11: ffff00001038cc80 x10: ffff8000136e9ba0 x9 : ffff800014a235f4
>> [ 18.777003] x8 : 0000000000000001 x7 : 00000000b664620b x6 : 0000000011a58b4a
>> [ 18.784121] x5 : 000000001aa43464 x4 : 000000009e7d8b67 x3 : 0000000000000001
>> [ 18.791239] x2 : 0000000000002800 x1 : ffff800013acd000 x0 : 00000000f2d429d8
>> [ 18.798358] Call trace:
>> [ 18.805451] __stack_depot_save+0x13c/0x4e0
>> [ 18.807716] stack_depot_save+0x14/0x20
>> [ 18.811881] __drm_stack_depot_save+0x44/0x70 [drm]
>> [ 18.815710] modeset_lock.part.0+0xe0/0x1a4 [drm]
>> [ 18.820571] drm_modeset_lock_all_ctx+0x2d4/0x334 [drm]
>
> This stack_depot_save path appears to be new from Jani's commit
> cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks
> without backoff")
> And there's a semantic conflict with my patch in mmotm:
> - sha1 (valid only in next-20211020) 5e6d063de5cd ("lib/stackdepot: allow
> optional init and stack_table allocation by kvmalloc()")
> - lore: https://lore.kernel.org/all/20211013073005.11351-1-vbabka@xxxxxxx/
> - patchwork: https://patchwork.freedesktop.org/series/95549/#rev3
>
> With my patch, to-be callers of stack_depot_save() need to call
> stack_depot_init() at least once, to avoid unnecessary runtime overhead
> otherwise I have added that calls into three DRM contexts in my patch, but
> didn't see cd06ab2fd48f yet at the time.

Auch, I did see your changes fly by, but overlooked this type of
conflict with my patch. Sorry for the trouble.

> This one seems a bit more tricky and I could really use some advice.
> cd06ab2fd48f adds stackdepot usage to drm_modeset_lock which itself has a
> number of different users and requiring those to call stack_depot_init()
> would be likely error prone. Would it be ok to add the call of
> stack_depot_init() (guarded by #ifdef CONFIG_DRM_DEBUG_MODESET_LOCK) to
> drm_modeset_lock_init()? It will do a mutex_lock()/unlock(), and kvmalloc()
> on first call.
> I don't know how much of hotpath this is, but hopefully should be acceptable
> in debug config. Or do you have better suggestion? Thanks.

I think that should be fine.

Maybe add __drm_stack_depot_init() in the existing #if
IS_ENABLED(CONFIG_DRM_DEBUG_MODESET_LOCK), similar to the other
__drm_stack_depot_*() functions, with an empty stub for
CONFIG_DRM_DEBUG_MODESET_LOCK=n, and call it unconditionally in
drm_modeset_lock_init().

> Then we have to figure out how to order a fix between DRM and mmotm...

That is the question! The problem exists only in the merge of the
two. On current DRM side stack_depot_init() exists but it's __init and
does not look safe to call multiple times. And obviously my changes
don't exist at all in mmotm.

I guess one (admittedly hackish) option is to first add a patch in
drm-next (or drm-misc-next) that makes it safe to call
stack_depot_init() multiple times in non-init context. It would be
dropped in favour of your changes once the trees get merged together.

Or is there some way for __drm_stack_depot_init() to detect whether it
should call stack_depot_init() or not, i.e. whether your changes are
there or not?


BR,
Jani.


>
>> [ 18.825435] drm_client_firmware_config.constprop.0.isra.0+0xc0/0x5d0 [drm]
>> [ 18.830478] drm_client_modeset_probe+0x328/0xbb0 [drm]
>> [ 18.837413] __drm_fb_helper_initial_config_and_unlock+0x54/0x5b4
>> [drm_kms_helper]
>> [ 18.842633] drm_fb_helper_initial_config+0x5c/0x70 [drm_kms_helper]
>> [ 18.850266] msm_fbdev_init+0x98/0x100 [msm]
>> [ 18.856767] msm_drm_bind+0x650/0x720 [msm]
>> [ 18.861021] try_to_bring_up_master+0x230/0x320
>> [ 18.864926] __component_add+0xc8/0x1c4
>> [ 18.869435] component_add+0x20/0x30
>> [ 18.873253] mdp5_dev_probe+0xe0/0x11c [msm]
>> [ 18.877077] platform_probe+0x74/0xf0
>> [ 18.881328] really_probe+0xc4/0x470
>> [ 18.884883] __driver_probe_device+0x11c/0x190
>> [ 18.888534] driver_probe_device+0x48/0x110
>> [ 18.892786] __device_attach_driver+0xa4/0x140
>> [ 18.896869] bus_for_each_drv+0x84/0xe0
>> [ 18.901380] __device_attach+0xe4/0x1c0
>> [ 18.905112] device_initial_probe+0x20/0x30
>> [ 18.908932] bus_probe_device+0xac/0xb4
>> [ 18.913098] deferred_probe_work_func+0xc8/0x120
>> [ 18.916920] process_one_work+0x280/0x6a0
>> [ 18.921780] worker_thread+0x80/0x454
>> [ 18.925683] kthread+0x178/0x184
>> [ 18.929326] ret_from_fork+0x10/0x20
>> [ 18.932634] Code: d37d4e99 92404e9c f940077a 8b190359 (c8dfff33)
>> [ 18.936203] ---[ end trace 3e289b724840642d ]---
>>
>> Full log,
>> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20211020/testrun/6177937/suite/linux-log-parser/test/check-kernel-oops-3786583/log
>> https://lkft.validation.linaro.org/scheduler/job/3786583#L2549
>>
>> Build config:
>> https://builds.tuxbuild.com/1zlLlQrUyHVr1MQ1gcler3dKaE6/config
>>
>> Reported-by: Linux Kernel Functional Testing <lkft@xxxxxxxxxx>
>>
>> steps to reproduce:
>> 1) https://builds.tuxbuild.com/1zlLlQrUyHVr1MQ1gcler3dKaE6/tuxmake_reproducer.sh
>> 2) Boot db410c device
>>
>> --
>> Linaro LKFT
>> https://lkft.linaro.org
>>
>

--
Jani Nikula, Intel Open Source Graphics Center