Re: next-20090310: ext4 hangs

From: Alexander Beregalov
Date: Wed Mar 25 2009 - 18:38:53 EST


2009/3/25 Jan Kara <jack@xxxxxxx>:
> On Wed 25-03-09 20:07:46, Alexander Beregalov wrote:
>> 2009/3/25 Jan Kara <jack@xxxxxxx>:
>> > On Wed 25-03-09 18:29:10, Alexander Beregalov wrote:
>> >> 2009/3/25 Jan Kara <jack@xxxxxxx>:
>> >> > On Wed 25-03-09 18:18:43, Alexander Beregalov wrote:
>> >> >> 2009/3/25 Jan Kara <jack@xxxxxxx>:
>> >> >> >> > So, I think I need to try it on 2.6.29-rc7 again.
>> >> >> >> Â I've looked into this. Obviously, what's happenning is that we delete
>> >> >> >> an inode and jbd2_journal_release_jbd_inode() finds inode is just under
>> >> >> >> writeout in transaction commit and thus it waits. But it gets never woken
>> >> >> >> up and because it has a handle from the transaction, every one eventually
>> >> >> >> blocks on waiting for a transaction to finish.
>> >> >> >> Â But I don't really see how that can happen. The code is really
>> >> >> >> straightforward and everything happens under j_list_lock... Strange.
>> >> >> > ÂBTW: Is the system SMP?
>> >> >> No, it is UP system.
>> >> > ÂEven stranger. And do you have CONFIG_PREEMPT set?
>> >> >
>> >> >> The bug exists even in 2.6.29, I posted it with a new topic.
>> >> > ÂOK, I've sort-of expected this.
>> >>
>> >> CONFIG_PREEMPT_RCU=y
>> >> CONFIG_PREEMPT_RCU_TRACE=y
>> >> # CONFIG_PREEMPT_NONE is not set
>> >> # CONFIG_PREEMPT_VOLUNTARY is not set
>> >> CONFIG_PREEMPT=y
>> >> CONFIG_DEBUG_PREEMPT=y
>> >> # CONFIG_PREEMPT_TRACER is not set
>> >>
>> >> config is attached.
>> > ÂThanks for the data. I still don't see how the wakeup can get lost. The
>> > process even cannot be preempted when we are in the section protected by
>> > j_list_lock... Can you send me a disassembly of functions
>> > jbd2_journal_release_jbd_inode() and journal_submit_data_buffers() so that
>> > I can see whether the compiler has not reordered something unexpectedly?
> ÂThanks for the disassembly...
>
>> By default gcc inlines journal_submit_data_buffers()
>> Here is -fno-inline version. Default version is in attach.
>> ====
>>
>> static int journal_submit_data_buffers(journal_t *journal,
>> Â Â Â Â Â Â Â Â transaction_t *commit_transaction)
>> {
>> Â Â Â 9c: Â Â Â 9d e3 bf 40 Â Â save Â%sp, -192, %sp
>> Â Â Â a0: Â Â Â 11 00 00 00 Â Â sethi Â%hi(0), %o0
>> Â Â Â Â struct jbd2_inode *jinode;
>> Â Â Â Â int err, ret = 0;
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â a4: Â Â Â a4 06 25 70 Â Â add Â%i0, 0x570, %l2
>> Â* our inode list. We use JI_COMMIT_RUNNING flag to protect inode we currently
>> Â* operate on from being released while we write out pages.
>> Â*/
>> static int journal_submit_data_buffers(journal_t *journal,
>> Â Â Â Â Â Â Â Â transaction_t *commit_transaction)
>> {
>> Â Â Â a8: Â Â Â 90 12 20 00 Â Â mov Â%o0, %o0
>> Â Â Â ac: Â Â Â 40 00 00 00 Â Â call Âac <journal_submit_data_buffers+0x10>
>> Â Â Â b0: Â Â Â b0 10 20 00 Â Â clr Â%i0
>> Â Â Â Â struct jbd2_inode *jinode;
>> Â Â Â Â int err, ret = 0;
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> Â Â Â b4: Â Â Â a6 06 60 60 Â Â add Â%i1, 0x60, %l3
>> {
>> Â Â Â Â struct jbd2_inode *jinode;
>> Â Â Â Â int err, ret = 0;
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â b8: Â Â Â 40 00 00 00 Â Â call Âb8 <journal_submit_data_buffers+0x1c>
>> Â Â Â bc: Â Â Â 90 10 00 12 Â Â mov Â%l2, %o0
>> Â Â Â Â list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>>    c0:    10 68 00 1d   b Â%xcc, 134 <journal_submit_data_buffers+0x98>
>> Â Â Â c4: Â Â Â c2 5e 60 60 Â Â ldx Â[ %i1 + 0x60 ], %g1
>> Â Â Â Â Â Â Â Â mapping = jinode->i_vfs_inode->i_mapping;
>> Â Â Â Â Â Â Â Â jinode->i_flags |= JI_COMMIT_RUNNING;
>> Â Â Â Â Â Â Â Â spin_unlock(&journal->j_list_lock);
>> Â Â Â c8: Â Â Â 90 10 00 12 Â Â mov Â%l2, %o0
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> Â Â Â Â Â Â Â Â mapping = jinode->i_vfs_inode->i_mapping;
>> Â Â Â Â Â Â Â Â jinode->i_flags |= JI_COMMIT_RUNNING;
>> Â Â Â cc: Â Â Â c2 04 60 28 Â Â ld Â[ %l1 + 0x28 ], %g1
> ÂHere we load jbd2_inode->i_flags into %g1.
>
>> Â Â Â Â int err, ret = 0;
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> Â Â Â Â Â Â Â Â mapping = jinode->i_vfs_inode->i_mapping;
>> Â Â Â d0: Â Â Â e0 58 a1 e0 Â Â ldx Â[ %g2 + 0x1e0 ], %l0
>> Â Â Â Â Â Â Â Â jinode->i_flags |= JI_COMMIT_RUNNING;
>> Â Â Â d4: Â Â Â 82 10 60 01 Â Â or Â%g1, 1, %g1
> ÂHere we set JI_COMMIT_RUNNING.
>
>> Â Â Â Â Â Â Â Â spin_unlock(&journal->j_list_lock);
>> Â Â Â d8: Â Â Â 40 00 00 00 Â Â call Âd8 <journal_submit_data_buffers+0x3c>
> ÂHere we seem to call preempt_disable() (it would be useful if we could
> confirm that - easiest option I know is compiling JBD2 into a kernel but
> some object file trickery should be able to find it out as well...)
55bab0: 82 10 60 01 or %g1, 1, %g1
spin_unlock(&journal->j_list_lock);
55bab4: 40 06 4b 20 call 6ee734 <_spin_unlock>
55bab8: c2 24 e0 28 st %g1, [ %l3 + 0x28 ]
<..>
void __lockfunc _spin_unlock(spinlock_t *lock)
{
6ee734: 9d e3 bf 40 save %sp, -192, %sp
6ee738: 11 00 1f f9 sethi %hi(0x7fe400), %o0
6ee73c: 7f fb 36 59 call 5bc0a0 <_mcount>
6ee740: 90 12 22 40 or %o0, 0x240, %o0 ! 7fe640
<rt_trace_on+0xc8>
spin_release(&lock->dep_map, 1, _RET_IP_);
6ee744: 94 10 00 1f mov %i7, %o2
6ee748: 92 10 20 01 mov 1, %o1
6ee74c: 7f f6 21 18 call 476bac <lock_release>
6ee750: 90 06 20 18 add %i0, 0x18, %o0
_raw_spin_unlock(lock);
6ee754: 7f fb 7f 91 call 5ce598 <_raw_spin_unlock>
6ee758: 90 10 00 18 mov %i0, %o0
preempt_enable();
6ee75c: 40 00 05 fd call 6eff50 <sub_preempt_count>
6ee760: 90 10 20 01 mov 1, %o0
6ee764: c2 59 a0 08 ldx [ %g6 + 8 ], %g1
6ee768: 82 08 60 08 and %g1, 8, %g1
6ee76c: 02 c8 40 04 brz %g1, 6ee77c <_spin_unlock+0x48>
6ee770: 01 00 00 00 nop
6ee774: 7f ff f4 f1 call 6ebb38 <preempt_schedule>
6ee778: 01 00 00 00 nop
6ee77c: 81 cf e0 08 rett %i7 + 8
6ee780: 01 00 00 00 nop
}


>
>> Â Â Â dc: Â Â Â c2 24 60 28 Â Â st Â%g1, [ %l1 + 0x28 ]
> ÂAnd here we store the register back to memory - but we could be already
> preempted here which could cause bugs...
>
>> Â Â Â Â Â Â Â Â Â* submit the inode data buffers. We use writepage
>> Â Â Â Â Â Â Â Â Â* instead of writepages. Because writepages can do
>> Â Â Â Â Â Â Â Â Â* block allocation Âwith delalloc. We need to write
>> Â Â Â Â Â Â Â Â Â* only allocated blocks here.
>> Â Â Â Â Â Â Â Â Â*/
>> Â Â Â Â Â Â Â Â err = journal_submit_inode_data_buffers(mapping);
>> Â Â Â e0: Â Â Â 7f ff ff d3 Â Â call Â2c <journal_submit_inode_data_buffers>
>> Â Â Â e4: Â Â Â 90 10 00 10 Â Â mov Â%l0, %o0
>> Â Â Â Â Â Â Â Â if (!ret)
>> Â Â Â e8: Â Â Â 80 a6 20 00 Â Â cmp Â%i0, 0
>> Â Â Â ec: Â Â Â b1 64 40 08 Â Â move Â%icc, %o0, %i0
>> Â Â Â Â Â Â Â Â Â Â Â Â ret = err;
>> Â Â Â Â Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â f0: Â Â Â 40 00 00 00 Â Â call Âf0 <journal_submit_data_buffers+0x54>
>> Â Â Â f4: Â Â Â 90 10 00 12 Â Â mov Â%l2, %o0
>> Â Â Â Â Â Â Â Â J_ASSERT(jinode->i_transaction == commit_transaction);
>> Â Â Â f8: Â Â Â c2 5c 40 00 Â Â ldx Â[ %l1 ], %g1
>> Â Â Â fc: Â Â Â 80 a0 40 19 Â Â cmp Â%g1, %i1
>>   Â100:    22 68 00 07   be,a  %xcc, 11c
>> <journal_submit_data_buffers+0x80>
>> Â Â Â104: Â Â Â c2 04 60 28 Â Â ld Â[ %l1 + 0x28 ], %g1
> ÂAgain, here we load jinode->i_flags.
>
>> Â Â Â108: Â Â Â 11 00 00 00 Â Â sethi Â%hi(0), %o0
>> Â Â Â10c: Â Â Â 92 10 21 04 Â Â mov Â0x104, %o1
>> Â Â Â110: Â Â Â 40 00 00 00 Â Â call Â110 <journal_submit_data_buffers+0x74>
>> Â Â Â114: Â Â Â 90 12 20 00 Â Â mov Â%o0, %o0
>> Â Â Â118: Â Â Â 91 d0 20 05 Â Â ta Â5
>> Â Â Â Â Â Â Â Â jinode->i_flags &= ~JI_COMMIT_RUNNING;
>> Â Â Â Â Â Â Â Â wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> Â Â Â11c: Â Â Â 90 04 60 28 Â Â add Â%l1, 0x28, %o0
>> Â Â Â120: Â Â Â 92 10 20 00 Â Â clr Â%o1
>> Â Â Â Â Â Â Â Â err = journal_submit_inode_data_buffers(mapping);
>> Â Â Â Â Â Â Â Â if (!ret)
>> Â Â Â Â Â Â Â Â Â Â Â Â ret = err;
>> Â Â Â Â Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â Â Â Â Â J_ASSERT(jinode->i_transaction == commit_transaction);
>> Â Â Â Â Â Â Â Â jinode->i_flags &= ~JI_COMMIT_RUNNING;
>>   Â124:    82 08 7f fe   and Â%g1, -2, %g1
> ÂHere we go &= ~JI_COMMIT_RUNNING
>
>> Â Â Â Â Â Â Â Â wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> Â Â Â128: Â Â Â 40 00 00 00 Â Â call Â128 <journal_submit_data_buffers+0x8c>
>> Â Â Â12c: Â Â Â c2 24 60 28 Â Â st Â%g1, [ %l1 + 0x28 ]
> ÂAnd only here we store it back to memory...

spin_lock(&journal->j_list_lock);
55c104: 40 06 48 3c call 6ee1f4 <_spin_lock>
55c108: 90 10 00 12 mov %l2, %o0
jinode->i_flags &= ~JI_COMMIT_RUNNING;
55c10c: c2 04 20 28 ld [ %l0 + 0x28 ], %g1
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
55c110: 90 04 20 28 add %l0, 0x28, %o0
55c114: 92 10 20 00 clr %o1
if (!ret)
ret = err;
}
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
55c118: 82 08 7f fe and %g1, -2, %g1
wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
55c11c: 7f fc 27 6e call 465ed4 <wake_up_bit>
55c120: c2 24 20 28 st %g1, [ %l0 + 0x28 ]
<..>
void __lockfunc _spin_lock(spinlock_t *lock)
{
6ee1f4: 9d e3 bf 30 save %sp, -208, %sp
6ee1f8: 11 00 1f f9 sethi %hi(0x7fe400), %o0
6ee1fc: 7f fb 37 a9 call 5bc0a0 <_mcount>
6ee200: 90 12 21 c8 or %o0, 0x1c8, %o0 ! 7fe5c8
<rt_trace_on+0x50>
preempt_disable();
6ee204: 40 00 07 83 call 6f0010 <add_preempt_count>
6ee208: 90 10 20 01 mov 1, %o0
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
6ee20c: 92 10 20 00 clr %o1
6ee210: 90 06 20 18 add %i0, 0x18, %o0
6ee214: fe 73 a8 af stx %i7, [ %sp + 0x8af ]
6ee218: 94 10 20 00 clr %o2
6ee21c: 96 10 20 00 clr %o3
6ee220: 98 10 20 02 mov 2, %o4
6ee224: 7f f6 21 a5 call 4768b8 <lock_acquire>
6ee228: 9a 10 20 00 clr %o5
LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
6ee22c: 7f fb 81 33 call 5ce6f8 <_raw_spin_lock>
6ee230: 90 10 00 18 mov %i0, %o0
}
6ee234: 81 cf e0 08 rett %i7 + 8
6ee238: 01 00 00 00 nop
<..>
void wake_up_bit(void *word, int bit)
{
465ed4: 9d e3 bf 40 save %sp, -192, %sp
465ed8: 11 00 1f f2 sethi %hi(0x7fc800), %o0
465edc: 40 05 58 71 call 5bc0a0 <_mcount>
465ee0: 90 12 23 c8 or %o0, 0x3c8, %o0 ! 7fcbc8
<kthread_stop_lock+0x88>
__wake_up_bit(bit_waitqueue(word, bit), word, bit);
465ee4: 92 10 00 19 mov %i1, %o1
465ee8: 7f ff ff c5 call 465dfc <bit_waitqueue>
465eec: 90 10 00 18 mov %i0, %o0
465ef0: 92 10 00 18 mov %i0, %o1
465ef4: 7f ff ff e7 call 465e90 <__wake_up_bit>
465ef8: 94 10 00 19 mov %i1, %o2
}
465efc: 81 cf e0 08 rett %i7 + 8
465f00: 01 00 00 00 nop


>
>> Â Â Â Â struct jbd2_inode *jinode;
>> Â Â Â Â int err, ret = 0;
>> Â Â Â Â struct address_space *mapping;
>>
>> Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>> Â Â Â130: Â Â Â c2 5c 60 10 Â Â ldx Â[ %l1 + 0x10 ], %g1
>> Â Â Â134: Â Â Â a2 00 7f f0 Â Â add Â%g1, -16, %l1
>> Â Â Â Â Â* prefetches into the prefetch-cache which only is accessible
>> Â Â Â Â Â* by floating point operations in UltraSPARC-III and later.
>> Â Â Â Â Â* By contrast, "#one_write" prefetches into the L2 cache
>> Â Â Â Â Â* in shared state.
>> Â Â Â Â Â*/
>> Â Â Â Â __asm__ __volatile__("prefetch [%0], #one_write"
>> Â Â Â138: Â Â Â c2 5c 60 10 Â Â ldx Â[ %l1 + 0x10 ], %g1
>> Â Â Â13c: Â Â Â c7 68 40 00 Â Â prefetch Â[ %g1 ], #one_write
>> Â Â Â140: Â Â Â 82 04 60 10 Â Â add Â%l1, 0x10, %g1
>> Â Â Â144: Â Â Â 80 a4 c0 01 Â Â cmp Â%l3, %g1
>>   Â148:    32 6f ff e0   bne,a  %xcc, c8
>> <journal_submit_data_buffers+0x2c>
>> Â Â Â14c: Â Â Â c4 5c 60 20 Â Â ldx Â[ %l1 + 0x20 ], %g2
>> Â Â Â Â Â Â Â Â spin_lock(&journal->j_list_lock);
>> Â Â Â Â Â Â Â Â J_ASSERT(jinode->i_transaction == commit_transaction);
>> Â Â Â Â Â Â Â Â wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
>> Â Â Â Â }
>> Â Â Â Â spin_unlock(&journal->j_list_lock);
>> Â Â Â150: Â Â Â 90 10 00 12 Â Â mov Â%l2, %o0
>> Â Â Â154: Â Â Â 40 00 00 00 Â Â call Â154 <journal_submit_data_buffers+0xb8>
>> Â Â Â158: Â Â Â b1 3e 20 00 Â Â sra Â%i0, 0, %i0
>> Â Â Â Â return ret;
>> }
>> Â Â Â15c: Â Â Â 81 cf e0 08 Â Â rett Â%i7 + 8
>> Â Â Â160: Â Â Â 01 00 00 00 Â Â nop
> ÂSo the compiled code looks a bit suspitious to me. Having the disassembly
> with symbols properly resolved would help confirm it. I'm adding sparc list
> to CC just in case someone sees the problem...
>
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂHonza
> --
> Jan Kara <jack@xxxxxxx>
> SUSE Labs, CR
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/