Re: [PATCH 1/4] cachefiles: Fix assertion "6 == 5 is false" at fs/fscache/operation.c:494

From: Vegard Nossum
Date: Fri Jul 06 2018 - 04:31:35 EST


On 6 July 2018 at 01:45, NeilBrown <neilb@xxxxxxxx> wrote:
> On Thu, Jul 05 2018, David Howells wrote:
>
>> From: kiran modukuri <kiran.modukuri@xxxxxxxxx>
>>
>> There is a potential race in fscache operation enqueuing for reading and
>> copying multiple pages from cachefiles to netfs.
>> Under some heavy load system, it will happen very often.
>>
>> If this race occurs, an oops similar to the following is seen:
>>
>> kernel BUG at fs/fscache/operation.c:69!
>> invalid opcode: 0000 [#1] SMP
>> ...
>> #0 [ffff883fff0838d8] machine_kexec at ffffffff81051beb
>> #1 [ffff883fff083938] crash_kexec at ffffffff810f2542
>> #2 [ffff883fff083a08] oops_end at ffffffff8163e1a8
>> #3 [ffff883fff083a30] die at ffffffff8101859b
>> #4 [ffff883fff083a60] do_trap at ffffffff8163d860
>> #5 [ffff883fff083ab0] do_invalid_op at ffffffff81015204
>> #6 [ffff883fff083b60] invalid_op at ffffffff8164701e
>> [exception RIP: fscache_enqueue_operation+246]
>> RIP: ffffffffa0b793c6 RSP: ffff883fff083c18 RFLAGS: 00010046
>> RAX: 0000000000000019 RBX: ffff8832ed1a9ec0 RCX: 0000000000000006
>> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
>> RBP: ffff883fff083c20 R8: 0000000000000086 R9: 000000000000178f
>> R10: ffffffff816aeb00 R11: ffff883fff08392e R12: ffff8802f0525620
>> R13: ffff88407ffc01d8 R14: 0000000000000000 R15: 0000000000000003
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>> #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
>> #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
>> #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028
>>
>> Reported-by: Lei Xue <carmark.dlut@xxxxxxxxx>
>> Reported-by: Vegard Nossum <vegard.nossum@xxxxxxxxx>
>> Reported-by: Anthony DeRobertis <aderobertis@xxxxxxxxxxx>
>> Reported-by: NeilBrown <neilb@xxxxxxxx>
>> Reported-by: Daniel Axtens <dja@xxxxxxxxxx>
>> Reported-by: KiranKumar Modukuri <kiran.modukuri@xxxxxxxxx>
>> Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
>> ---

[...]

> Thanks - I like this approach. Taking the extra reference makes it a
> lot more clear what is happening and why.

The changelog is a bit sparse, no? We have more info here:

https://lkml.org/lkml/2018/5/8/520
https://lkml.org/lkml/2018/7/3/1184

Why not crib some of that and explain the issue properly (or at
minimum link the previous threads)?

Thanks,


Vegard