Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

From: alex chen
Date: Thu Dec 28 2017 - 01:13:49 EST




On 2017/12/28 10:48, Gang He wrote:
> Hi Alex,
>
>
>>>>
>> Hi Gang,
>>
>> On 2017/12/27 18:37, Gang He wrote:
>>> Hi Jun,
>>>
>>>
>>>>>>
>>>> Hi Gang,
>>>>
>>>> Do you mean that too many retrys in loop cast losts of CPU-time and
>>>> block page-fault interrupt? We should not add any delay in
>>>> ocfs2_fault(), right? And I still feel a little confused why your
>>>> method can solve this problem.
>>> You can see the related code in function filemap_fault(), if ocfs2 fails to
>> read a page since
>>> it can not get a inode lock with non-block mode, the VFS layer code will
>> invoke ocfs2
>>> read page call back function circularly, this will lead to a softlockup
>> problem (like the below back trace).
>>> So, we should get a blocking lock to let the dlm lock to this node and also
>> can avoid CPU loop,
>> Can we use 'cond_resched()' to allow the thread to release the CPU
>> temperately for solving this softlockup?
> Yes, we can use cond_resched() function to avoid this softlockup.
> In fact, if the kernel is configured with CONFIG_PREEMPT=y, this softlockup does not happen since the kernel can help.
> But, this way still leads to CPU resource waste, CPU usage can reach about 80% - 100% when
> multiple nodes read/write/mmap-access the same file concurrently, and more, the read/write/mmap-access
> speed is more lower (50% decrease).
> Why?
> Because we need to get DLM lock for each node, before one node gets DLM lock, another node has
> to down-convert this DLM lock, that means flushing the memory data to the disk before DLM lock down-conversion.
> this disk IO operation is very slow compared with CPU cycle, that means the node which want to get DLM lock,
> will do lots of reties before another node complete down-converting this DLM lock, actual, these retries do not make
> sense, just waste CPU cycle.
> So, if we add a blocking lock/unlock here, we will avoid these unnecessary reties, especially in case slow-speed disk and more ocfs2 nodes(>=3).
> I did the ocfs2 test case (multi_mmap in multiple_run.sh), after applied this patch, the CPU rate on each node was about 40%-50%, and the test case
> execution time reduced by half.
> the full command is as below,
> multiple_run.sh -i eth0 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n nd1,nd2,nd3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> the shared storage is a iscsi disk.
>
OK, I think it is more better if you can add you test method and result in change log.

Thanks,
Alex
> Thanks
> Gang
>
>>
>>> second, base on my testing, the patch also can improve the efficiency in
>> case modifying the same
>>> file frequently from multiple nodes, since the lock acquisition chance is
>> more fair.
>>> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not
>> lock/unlock() inode DLM lock"),
>>> before that patch, the code is the same, this patch can be considered to
>> revert that patch, except adding more
>>> clear comments.
>> In patch 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock"),
>> Goldwyn says blocking lock and unlock will only make
>> the performance worse where contention over the locks is high, which is the
>> opposite of your described above.
>> IMO, blocking lock and unlock here is indeed unnecessary.
>>
>> Thanks,
>> Alex
>>>
>>> Thanks
>>> Gang
>>>
>>>
>>>>
>>>> thanks,
>>>> Jun
>>>>
>>>> On 2017/12/27 17:29, Gang He wrote:
>>>>> If we can't get inode lock immediately in the function
>>>>> ocfs2_inode_lock_with_page() when reading a page, we should not
>>>>> return directly here, since this will lead to a softlockup problem.
>>>>> The method is to get a blocking lock and immediately unlock before
>>>>> returning, this can avoid CPU resource waste due to lots of retries,
>>>>> and benefits fairness in getting lock among multiple nodes, increase
>>>>> efficiency in case modifying the same file frequently from multiple
>>>>> nodes.
>>>>> The softlockup problem looks like,
>>>>> Kernel panic - not syncing: softlockup: hung tasks
>>>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>>>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>>> Call Trace:
>>>>> <IRQ>
>>>>> dump_stack+0x5c/0x82
>>>>> panic+0xd5/0x21e
>>>>> watchdog_timer_fn+0x208/0x210
>>>>> ? watchdog_park_threads+0x70/0x70
>>>>> __hrtimer_run_queues+0xcc/0x200
>>>>> hrtimer_interrupt+0xa6/0x1f0
>>>>> smp_apic_timer_interrupt+0x34/0x50
>>>>> apic_timer_interrupt+0x96/0xa0
>>>>> </IRQ>
>>>>> RIP: 0010:unlock_page+0x17/0x30
>>>>> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
>>>>> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
>>>>> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
>>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
>>>>> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
>>>>> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
>>>>> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>>>> ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>>>> ? pagecache_get_page+0x30/0x200
>>>>> filemap_fault+0x12b/0x5c0
>>>>> ? recalc_sigpending+0x17/0x50
>>>>> ? __set_task_blocked+0x28/0x70
>>>>> ? __set_current_blocked+0x3d/0x60
>>>>> ocfs2_fault+0x29/0xb0 [ocfs2]
>>>>> __do_fault+0x1a/0xa0
>>>>> __handle_mm_fault+0xbe8/0x1090
>>>>> handle_mm_fault+0xaa/0x1f0
>>>>> __do_page_fault+0x235/0x4b0
>>>>> trace_do_page_fault+0x3c/0x110
>>>>> async_page_fault+0x28/0x30
>>>>> RIP: 0033:0x7fa75ded638e
>>>>> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
>>>>> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
>>>>> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
>>>>> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
>>>>> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
>>>>> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>>>>>
>>>>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
>>>>> Signed-off-by: Gang He <ghe@xxxxxxxx>
>>>>> ---
>>>>> fs/ocfs2/dlmglue.c | 9 +++++++++
>>>>> 1 file changed, 9 insertions(+)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>>>> index 4689940..5193218 100644
>>>>> --- a/fs/ocfs2/dlmglue.c
>>>>> +++ b/fs/ocfs2/dlmglue.c
>>>>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>>>> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>>>> if (ret == -EAGAIN) {
>>>>> unlock_page(page);
>>>>> + /*
>>>>> + * If we can't get inode lock immediately, we should not return
>>>>> + * directly here, since this will lead to a softlockup problem.
>>>>> + * The method is to get a blocking lock and immediately unlock
>>>>> + * before returning, this can avoid CPU resource waste due to
>>>>> + * lots of retries, and benefits fairness in getting lock.
>>>>> + */
>>>>> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>>>> + ocfs2_inode_unlock(inode, ex);
>>>>> ret = AOP_TRUNCATED_PAGE;
>>>>> }
>>>>>
>>>>>
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel@xxxxxxxxxxxxxx
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>
>>> .
>>>
>
> .
>