Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

From: Christian König
Date: Wed Apr 19 2023 - 04:12:33 EST


Am 19.04.23 um 09:00 schrieb Mikhail Gavrilov:
Christian?

I'm already looking into this, but can't figure out why we run into problems here.

What happens is that a CS is aborted without sending the job to the scheduler and in this case the cleanup function doesn't seem to work.

Christian.


❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line
/lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug
drm_sched_job_cleanup+0x9a
drm_sched_job_cleanup+0x9a/0x130:
drm_sched_job_cleanup at
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808
(discriminator 3)

❯ cat -s -n /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c
| head -818 | tail -20
799 /* drm_sched_job_arm() has been called */
800 dma_fence_put(&job->s_fence->finished);
801 } else {
802 /* aborted job before committing to run it */
803 drm_sched_fence_free(job->s_fence);
804 }
805
806 job->s_fence = NULL;
807
808 xa_for_each(&job->dependencies, index, fence) {
809 dma_fence_put(fence);
810 }
811 xa_destroy(&job->dependencies);
812
813 }
814 EXPORT_SYMBOL(drm_sched_job_cleanup);
815
816 /**
817 * drm_sched_ready - is the scheduler ready
818 *

git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 800)
dma_fence_put(&job->s_fence->finished);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 801) } else {
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 802) /* aborted job
before committing to run it */
d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c (Boris
Brezillon 2021-09-03 14:05:54 +0200 803)
drm_sched_fence_free(job->s_fence);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 804) }
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 805)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 806) job->s_fence = NULL;
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 807)
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 808)
xa_for_each(&job->dependencies, index, fence) {
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 809)
dma_fence_put(fence);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 810) }
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 811)
xa_destroy(&job->dependencies);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 812)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 813) }
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 814)
EXPORT_SYMBOL(drm_sched_job_cleanup);
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 815)
e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian
König 2015-08-20 17:01:01 +0200 816) /**
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 817) * drm_sched_ready - is the
scheduler ready
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 818) *
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 819) * @sched: scheduler instance

Daniel, because Christian, looks a little busy. Can you help? The git
blame says that you are the author of code which KASAN mentions in its
report.
The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX.