Re: Crash in crc32_le during kmemleak_scan()

From: vigneshr
Date: Wed May 20 2015 - 09:16:54 EST


Hi,

Sharing an update. We got back the test results. As such we did not
observe the bug, but we hit another side effect.

After kmemleak was disabled due to low memory

96.825883: <6> kmemleak: Cannot allocate a kmemleak_object structure
96.825902: <6> kmemleak: Cannot allocate a kmemleak_object structure
96.825923: <6> kmemleak: Kernel memory leak detector disabled


We saw the following warning appearing in the dmesg :

97.023202: <6> kmemleak: Trying to color unknown object at
0xffffffc01f0b2c80 as Grey
97.029735: <6> CPU: 0 PID: 49 Comm: kworker/u8:3 Tainted: G W
O 3.10.49-gde07ba9-00027-g9ce640d #1
97.029751: <2> Workqueue: diag_wq diag_read_smd_work_fn
97.029758: <2> Call trace:
97.029769: <2> [<ffffffc000206a2c>] dump_backtrace+0x0/0x270
97.029777: <2> [<ffffffc000206cac>] show_stack+0x10/0x1c
97.029786: <2> [<ffffffc000bf0eec>] dump_stack+0x1c/0x28
97.029794: <2> [<ffffffc0002f1da8>] paint_ptr+0x64/0xb8
97.029801: <2> [<ffffffc000be885c>] kmemleak_not_leak+0x64/0x98
97.029808: <2> [<ffffffc00056bd94>] diagmem_alloc+0x148/0x174
97.029816: <2> [<ffffffc00056b780>] diag_usb_write+0x80/0x180
97.029823: <2> [<ffffffc00056ac38>] diag_mux_write+0x38/0x4c
97.029830: <2> [<ffffffc0005682b0>]
diag_process_smd_read_data+0x270/0x2e4
97.029837: <2> [<ffffffc00056889c>] diag_smd_send_req+0x578/0x5b0
97.029844: <2> [<ffffffc0005688e0>] diag_read_smd_work_fn+0xc/0x18
97.029852: <2> [<ffffffc00023589c>] process_one_work+0x258/0x3d0
97.029858: <2> [<ffffffc0002368dc>] worker_thread+0x1f0/0x340
97.029867: <2> [<ffffffc00023bdc8>] kthread+0xac/0xb8

This should be fixed by the addition of the following to the patch that
was provided earlier :

--------------------- 8< --------------------------------


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5ec8b71..7d67c87 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
{
pr_debug("%s(0x%p)\n", __func__, ptr);

- if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+ if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
make_gray_object((unsigned long)ptr);
else if (kmemleak_early_log)
log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);






--------------------- 8< --------------------------------



However, the more concerning thing was that we crashed later at around 100
seconds. I have pasted the snippet of the crash log below :



100.946902: <6> Unable to handle kernel NULL pointer dereference at
virtual address 00000000
100.953953: <6> pgd = ffffffc00007d000
100.957338: <2> [00000000] *pgd=000000008c307003,
*pmd=000000008c308003, *pte=00e000000b000407
100.965587: <6> Internal error: Oops: 96000006 [#1] PREEMPT SMP
100.971139: <2> Modules linked in: wlan(O)
100.974876: <6> CPU: 0 PID: 246 Comm: kworker/0:2 Tainted: G W
O 3.10.49-gde07ba9-00027-g9ce640d #1
100.984254: <2> Workqueue: events kmemleak_do_cleanup
100.988936: <6> task: ffffffc02e7d9580 ti: ffffffc02da0c000 task.ti:
ffffffc02da0c000
100.996404: <2> PC is at rb_erase+0x224/0x344
101.000393: <2> LR is at __delete_object+0x3c/0xc8
101.004820: <2> pc : [<ffffffc000444e48>] lr : [<ffffffc0002f1afc>]
pstate: 600001c5
101.012198: <2> sp : ffffffc02da0fcb0
101.015497: <2> x29: ffffffc02da0fcb0 x28: 0000000000000009
101.020792: <2> x27: ffffffc0015f9000 x26: 0000000000000000
101.026086: <2> x25: ffffffc034ad2a40 x24: ffffffc02da0c000
101.031380: <2> x23: ffffffc034ad7b00 x22: 0000000000000000
101.036676: <2> x21: 0000000000000140 x20: ffffffc0014e26b8
101.041972: <2> x19: ffffffc0236b03c0 x18: 0000000000000000
101.047266: <2> x17: 0000007f973e7f8c x16: ffffffc00030447c
101.052561: <2> x15: 0000000000000000 x14: 0000000066666667
101.057857: <2> x13: 000000000000000a x12: 0000007f7c95e9f0
101.063152: <2> x11: 731a976167c4c606 x10: 0000000000000013
101.068447: <2> x9 : 0000000000127500 x8 : 0000001780764fe3
101.073743: <2> x7 : ffffffc0236b1bc0 x6 : 0000000000000000
101.079038: <2> x5 : ffffffc02eeb6f41 x4 : 0000000000000000
101.084333: <2> x3 : 0000000000000000 x2 : ffffffc02eeb6f40
101.089628: <2> x1 : ffffffc0016e05b8 x0 : 0000000000000000


<SNIP .................>

101.454125: <6> Process kworker/0:2 (pid: 246, stack limit =
0xffffffc02da0c058)
101.461155: <2> Call trace:
101.463590: <2> [<ffffffc000444e48>] rb_erase+0x224/0x344
101.468624: <2> [<ffffffc0002f1c6c>] delete_object_full+0x1c/0x34
101.474353: <2> [<ffffffc0002f1cbc>] __kmemleak_do_cleanup+0x38/0x54
101.480342: <2> [<ffffffc0002f1d10>] kmemleak_do_cleanup+0x38/0x6c
101.486160: <2> [<ffffffc00023589c>] process_one_work+0x258/0x3d0
101.491887: <2> [<ffffffc0002368dc>] worker_thread+0x1f0/0x340
101.497359: <2> [<ffffffc00023bdc8>] kthread+0xac/0xb8
101.502132: <6> Code: f9000046 14000040 f9400843 aa0303e0 (f9400064)


Analysis so far :

I haven't been able to figure out exactly how the dots are connected and
the relation to the change provided earlier but the following is what i
could put together :

When i checked the post mortem analysis i see the following in the
crashing stack :

-018|rb_erase(
| [locdesc] node = ?,
| [X1] root = 0xFFFFFFC0016E05B8)
| [X1] root = 0xFFFFFFC0016E05B8
| [locdesc] node = ?
| [X2] parent = 0xFFFFFFC02EEB6F40
| [X4] node = 0x0
| [X3] sibling = 0x0
-019|__delete_object(
| [X19] object = 0xFFFFFFC0236B03C0)
| [X21] flags = 0x0140
-020|delete_object_full(
| ?)
| [X19] object = 0xFFFFFFC0236B03C0
-021|__kmemleak_do_cleanup()
| [X19] object = 0xFFFFFFC0236B03C0 -> (
| [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock =
([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner
| [0xFFFFFFC0236B03D8] flags = 0x0,
| [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next
= 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
| [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next =
0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200),
[0xFFFFFFC0236B0400] rb_node = (
[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
[0xFFFFFFC0236B0408] rb_right = 0x0,
[0xFFFFFFC0236B0410] rb_left = 0x0


The node is becoming 0x0 here and is crashing in the following piece of
code in rb_erase:


309 } else {
310 sibling = parent->rb_left;
311 if (rb_is_red(sibling)) {

and rb_is_red(sibling) resolves into :

96 #define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color)



1. Here the parent ptr refers to Object pointer residing at
0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 (
this is from the values observed in the crashed stack)

[0xFFFFFFC0236B0400] rb_node = (
[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
[0xFFFFFFC0236B0408] rb_right = 0x0,
[0xFFFFFFC0236B0410] rb_left = 0x0

2. Therefore the sibling will take the value 0x0

3. And then when rb->__rb_parent_color operation takes place, we crash
since its not able to resolve 0x0 (since rb=0x0 here).


So the question is how this variable ended up being this way.

1. I checked the object_list via list walkthrough and it did not report
any corruptions.

2. However, the object in question from our call stack,
object->object_list pointers looks a bit amiss.

object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920,
[NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),

But its not clear to me how it ended up being in this state.


I will continue to debug what went wrong and share my analysis. Please let
me know if anything looks obvious to you that i can try out.

====================
Thanks and regards,
Vignesh Radhakrishnan


QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of the Code Aurora Forum, hosted by The Linux Foundation.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/