Re: [PATCH] drm/scheduler: Add missing RCU flag to fence slab

From: Rob Clark
Date: Tue Jul 11 2023 - 10:50:23 EST


On Tue, Jul 11, 2023 at 12:46 AM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 10.07.23 um 23:15 schrieb Luben Tuikov:
> > On 2023-07-10 16:56, Rob Clark wrote:
> >> From: Rob Clark <robdclark@xxxxxxxxxxxx>
> >>
> >> Fixes the KASAN splat:
> >>
> >> ==================================================================
> >> BUG: KASAN: use-after-free in msm_ioctl_wait_fence+0x31c/0x7b0
> >> Read of size 4 at addr ffffff808cb7c2f8 by task syz-executor/12236
> >> CPU: 6 PID: 12236 Comm: syz-executor Tainted: G W 5.15.119-lockdep-19932-g4a017c53fe63 #1 b15455e5b94c63032dd99eb0190c27e582b357ed
> >> Hardware name: Google Homestar (rev3) (DT)
> >> Call trace:
> >> dump_backtrace+0x0/0x4e8
> >> show_stack+0x34/0x50
> >> dump_stack_lvl+0xdc/0x11c
> >> print_address_description+0x30/0x2d8
> >> kasan_report+0x178/0x1e4
> >> kasan_check_range+0x1b0/0x1b8
> >> __kasan_check_read+0x44/0x54
> >> msm_ioctl_wait_fence+0x31c/0x7b0
> >> drm_ioctl_kernel+0x214/0x418
> >> drm_ioctl+0x524/0xbe8
> >> __arm64_sys_ioctl+0x154/0x1d0
> >> invoke_syscall+0x98/0x278
> >> el0_svc_common+0x214/0x274
> >> do_el0_svc+0x9c/0x19c
> >> el0_svc+0x5c/0xc0
> >> el0t_64_sync_handler+0x78/0x108
> >> el0t_64_sync+0x1a4/0x1a8
> >> Allocated by task 12224:
> >> kasan_save_stack+0x38/0x68
> >> __kasan_slab_alloc+0x6c/0x88
> >> kmem_cache_alloc+0x1b8/0x428
> >> drm_sched_fence_alloc+0x30/0x94
> >> drm_sched_job_init+0x7c/0x178
> >> msm_ioctl_gem_submit+0x2b8/0x5ac4
> >> drm_ioctl_kernel+0x214/0x418
> >> drm_ioctl+0x524/0xbe8
> >> __arm64_sys_ioctl+0x154/0x1d0
> >> invoke_syscall+0x98/0x278
> >> el0_svc_common+0x214/0x274
> >> do_el0_svc+0x9c/0x19c
> >> el0_svc+0x5c/0xc0
> >> el0t_64_sync_handler+0x78/0x108
> >> el0t_64_sync+0x1a4/0x1a8
> >> Freed by task 32:
> >> kasan_save_stack+0x38/0x68
> >> kasan_set_track+0x28/0x3c
> >> kasan_set_free_info+0x28/0x4c
> >> ____kasan_slab_free+0x110/0x164
> >> __kasan_slab_free+0x18/0x28
> >> kmem_cache_free+0x1e0/0x904
> >> drm_sched_fence_free_rcu+0x80/0x9c
> >> rcu_do_batch+0x318/0xcf0
> >> rcu_nocb_cb_kthread+0x1a0/0xc4c
> >> kthread+0x2e4/0x3a0
> >> ret_from_fork+0x10/0x20
> >> Last potentially related work creation:
> >> kasan_save_stack+0x38/0x68
> >> kasan_record_aux_stack+0xd4/0x114
> >> __call_rcu_common+0xd4/0x1478
> >> call_rcu+0x1c/0x28
> >> drm_sched_fence_release_scheduled+0x108/0x158
> >> dma_fence_release+0x178/0x564
> >> drm_sched_fence_release_finished+0xb4/0x124
> >> dma_fence_release+0x178/0x564
> >> __msm_gem_submit_destroy+0x150/0x488
> >> msm_job_free+0x9c/0xdc
> >> drm_sched_main+0xec/0x9ac
> >> kthread+0x2e4/0x3a0
> >> ret_from_fork+0x10/0x20
> >> Second to last potentially related work creation:
> >> kasan_save_stack+0x38/0x68
> >> kasan_record_aux_stack+0xd4/0x114
> >> __call_rcu_common+0xd4/0x1478
> >> call_rcu+0x1c/0x28
> >> drm_sched_fence_release_scheduled+0x108/0x158
> >> dma_fence_release+0x178/0x564
> >> drm_sched_fence_release_finished+0xb4/0x124
> >> dma_fence_release+0x178/0x564
> >> drm_sched_entity_fini+0x170/0x238
> >> drm_sched_entity_destroy+0x34/0x44
> >> __msm_file_private_destroy+0x60/0x118
> >> msm_submitqueue_destroy+0xd0/0x110
> >> __msm_gem_submit_destroy+0x384/0x488
> >> retire_submits+0x6a8/0xa14
> >> recover_worker+0x764/0xa50
> >> kthread_worker_fn+0x3f4/0x9ec
> >> kthread+0x2e4/0x3a0
> >> ret_from_fork+0x10/0x20
> >> The buggy address belongs to the object at ffffff808cb7c280
> >> The buggy address is located 120 bytes inside of
> >> The buggy address belongs to the page:
> >> page:000000008b01d27d refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10cb7c
> >> head:000000008b01d27d order:1 compound_mapcount:0
> >> flags: 0x8000000000010200(slab|head|zone=2)
> >> raw: 8000000000010200 fffffffe06844d80 0000000300000003 ffffff80860dca00
> >> raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
> >> page dumped because: kasan: bad access detected
> >> Memory state around the buggy address:
> >> ffffff808cb7c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ffffff808cb7c200: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
> >> >ffffff808cb7c280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ^
> >> ffffff808cb7c300: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
> >> ffffff808cb7c380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
> >> ==================================================================
> >>
> >> Suggested-by: Alexander Potapenko <glider@xxxxxxxxxx>
> >> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> >> ---
> >> drivers/gpu/drm/scheduler/sched_fence.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> >> index ef120475e7c6..b624711c6e03 100644
> >> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> >> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> >> @@ -35,7 +35,7 @@ static int __init drm_sched_fence_slab_init(void)
> >> {
> >> sched_fence_slab = kmem_cache_create(
> >> "drm_sched_fence", sizeof(struct drm_sched_fence), 0,
> >> - SLAB_HWCACHE_ALIGN, NULL);
> >> + SLAB_HWCACHE_ALIGN | SLAB_TYPESAFE_BY_RCU, NULL);
> >> if (!sched_fence_slab)
> >> return -ENOMEM;
> >>
> > Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx>
> >
> > But let it simmer for 24 hours so Christian can see it too (CC-ed).
>
> Well that won't work like this, using the the SLAB_TYPESAFE_BY_RCU flag
> was suggested before and is not allowed for dma_fence objects.
>
> The flag SLAB_TYPESAFE_BY_RCU can only be used if the objects in the
> slab can't be reused within an RCU time period or if that reuse doesn't
> matter for the logic. And exactly that is not guaranteed for dma_fence
> objects.

I think that is only true because of the drm_sched_fence_free() path?
But that could also use call_rcu(). (It looks like it is only an
error path.)

> It should also not be necessary because the scheduler fences are
> released using call_rcu:
>
> static void drm_sched_fence_release_scheduled(struct dma_fence *f)
> {
> struct drm_sched_fence *fence = to_drm_sched_fence(f);
>
> dma_fence_put(fence->parent);
> call_rcu(&fence->finished.rcu, drm_sched_fence_free_rcu);
> }
>
> That looks more like you have a reference count problem in MSM.

Possibly I need to use rcu_read_lock()? But I think the idr_lock
which protected dma_fence_get_rcu() (and is held until the fence is
removed from fence_idr, before it's reference is dropped in
__msm_gem_submit_destroy()) makes that unnecessary.

BR,
-R

> Regards,
> Christian.