Re: next-20090310: ext4 hangs

From: Jan Kara
Date: Wed Mar 25 2009 - 15:43:44 EST


On Wed 25-03-09 20:07:46, Alexander Beregalov wrote:
> 2009/3/25 Jan Kara <jack@xxxxxxx>:
> > On Wed 25-03-09 18:29:10, Alexander Beregalov wrote:
> >> 2009/3/25 Jan Kara <jack@xxxxxxx>:
> >> > On Wed 25-03-09 18:18:43, Alexander Beregalov wrote:
> >> >> 2009/3/25 Jan Kara <jack@xxxxxxx>:
> >> >> >> > So, I think I need to try it on 2.6.29-rc7 again.
> >> >> >>   I've looked into this. Obviously, what's happenning is that we delete
> >> >> >> an inode and jbd2_journal_release_jbd_inode() finds inode is just under
> >> >> >> writeout in transaction commit and thus it waits. But it gets never woken
> >> >> >> up and because it has a handle from the transaction, every one eventually
> >> >> >> blocks on waiting for a transaction to finish.
> >> >> >>   But I don't really see how that can happen. The code is really
> >> >> >> straightforward and everything happens under j_list_lock... Strange.
> >> >> >  BTW: Is the system SMP?
> >> >> No, it is UP system.
> >> >  Even stranger. And do you have CONFIG_PREEMPT set?
> >> >
> >> >> The bug exists even in 2.6.29, I posted it with a new topic.
> >> >  OK, I've sort-of expected this.
> >>
> >> CONFIG_PREEMPT_RCU=y
> >> CONFIG_PREEMPT_RCU_TRACE=y
> >> # CONFIG_PREEMPT_NONE is not set
> >> # CONFIG_PREEMPT_VOLUNTARY is not set
> >> CONFIG_PREEMPT=y
> >> CONFIG_DEBUG_PREEMPT=y
> >> # CONFIG_PREEMPT_TRACER is not set
> >>
> >> config is attached.
> >  Thanks for the data. I still don't see how the wakeup can get lost. The
> > process even cannot be preempted when we are in the section protected by
> > j_list_lock... Can you send me a disassembly of functions
> > jbd2_journal_release_jbd_inode() and journal_submit_data_buffers() so that
> > I can see whether the compiler has not reordered something unexpectedly?
Thanks for the disassembly...

> By default gcc inlines journal_submit_data_buffers()
> Here is -fno-inline version. Default version is in attach.
> ====
>
> static int journal_submit_data_buffers(journal_t *journal,
> transaction_t *commit_transaction)
> {
> 9c: 9d e3 bf 40 save %sp, -192, %sp
> a0: 11 00 00 00 sethi %hi(0), %o0
> struct jbd2_inode *jinode;
> int err, ret = 0;
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> a4: a4 06 25 70 add %i0, 0x570, %l2
> * our inode list. We use JI_COMMIT_RUNNING flag to protect inode we currently
> * operate on from being released while we write out pages.
> */
> static int journal_submit_data_buffers(journal_t *journal,
> transaction_t *commit_transaction)
> {
> a8: 90 12 20 00 mov %o0, %o0
> ac: 40 00 00 00 call ac <journal_submit_data_buffers+0x10>
> b0: b0 10 20 00 clr %i0
> struct jbd2_inode *jinode;
> int err, ret = 0;
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> b4: a6 06 60 60 add %i1, 0x60, %l3
> {
> struct jbd2_inode *jinode;
> int err, ret = 0;
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> b8: 40 00 00 00 call b8 <journal_submit_data_buffers+0x1c>
> bc: 90 10 00 12 mov %l2, %o0
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> c0: 10 68 00 1d b %xcc, 134 <journal_submit_data_buffers+0x98>
> c4: c2 5e 60 60 ldx [ %i1 + 0x60 ], %g1
> mapping = jinode->i_vfs_inode->i_mapping;
> jinode->i_flags |= JI_COMMIT_RUNNING;
> spin_unlock(&journal->j_list_lock);
> c8: 90 10 00 12 mov %l2, %o0
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> mapping = jinode->i_vfs_inode->i_mapping;
> jinode->i_flags |= JI_COMMIT_RUNNING;
> cc: c2 04 60 28 ld [ %l1 + 0x28 ], %g1
Here we load jbd2_inode->i_flags into %g1.

> int err, ret = 0;
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> mapping = jinode->i_vfs_inode->i_mapping;
> d0: e0 58 a1 e0 ldx [ %g2 + 0x1e0 ], %l0
> jinode->i_flags |= JI_COMMIT_RUNNING;
> d4: 82 10 60 01 or %g1, 1, %g1
Here we set JI_COMMIT_RUNNING.

> spin_unlock(&journal->j_list_lock);
> d8: 40 00 00 00 call d8 <journal_submit_data_buffers+0x3c>
Here we seem to call preempt_disable() (it would be useful if we could
confirm that - easiest option I know is compiling JBD2 into a kernel but
some object file trickery should be able to find it out as well...)

> dc: c2 24 60 28 st %g1, [ %l1 + 0x28 ]
And here we store the register back to memory - but we could be already
preempted here which could cause bugs...

> * submit the inode data buffers. We use writepage
> * instead of writepages. Because writepages can do
> * block allocation with delalloc. We need to write
> * only allocated blocks here.
> */
> err = journal_submit_inode_data_buffers(mapping);
> e0: 7f ff ff d3 call 2c <journal_submit_inode_data_buffers>
> e4: 90 10 00 10 mov %l0, %o0
> if (!ret)
> e8: 80 a6 20 00 cmp %i0, 0
> ec: b1 64 40 08 move %icc, %o0, %i0
> ret = err;
> spin_lock(&journal->j_list_lock);
> f0: 40 00 00 00 call f0 <journal_submit_data_buffers+0x54>
> f4: 90 10 00 12 mov %l2, %o0
> J_ASSERT(jinode->i_transaction == commit_transaction);
> f8: c2 5c 40 00 ldx [ %l1 ], %g1
> fc: 80 a0 40 19 cmp %g1, %i1
> 100: 22 68 00 07 be,a %xcc, 11c
> <journal_submit_data_buffers+0x80>
> 104: c2 04 60 28 ld [ %l1 + 0x28 ], %g1
Again, here we load jinode->i_flags.

> 108: 11 00 00 00 sethi %hi(0), %o0
> 10c: 92 10 21 04 mov 0x104, %o1
> 110: 40 00 00 00 call 110 <journal_submit_data_buffers+0x74>
> 114: 90 12 20 00 mov %o0, %o0
> 118: 91 d0 20 05 ta 5
> jinode->i_flags &= ~JI_COMMIT_RUNNING;
> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
> 11c: 90 04 60 28 add %l1, 0x28, %o0
> 120: 92 10 20 00 clr %o1
> err = journal_submit_inode_data_buffers(mapping);
> if (!ret)
> ret = err;
> spin_lock(&journal->j_list_lock);
> J_ASSERT(jinode->i_transaction == commit_transaction);
> jinode->i_flags &= ~JI_COMMIT_RUNNING;
> 124: 82 08 7f fe and %g1, -2, %g1
Here we go &= ~JI_COMMIT_RUNNING

> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
> 128: 40 00 00 00 call 128 <journal_submit_data_buffers+0x8c>
> 12c: c2 24 60 28 st %g1, [ %l1 + 0x28 ]
And only here we store it back to memory...

> struct jbd2_inode *jinode;
> int err, ret = 0;
> struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> 130: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1
> 134: a2 00 7f f0 add %g1, -16, %l1
> * prefetches into the prefetch-cache which only is accessible
> * by floating point operations in UltraSPARC-III and later.
> * By contrast, "#one_write" prefetches into the L2 cache
> * in shared state.
> */
> __asm__ __volatile__("prefetch [%0], #one_write"
> 138: c2 5c 60 10 ldx [ %l1 + 0x10 ], %g1
> 13c: c7 68 40 00 prefetch [ %g1 ], #one_write
> 140: 82 04 60 10 add %l1, 0x10, %g1
> 144: 80 a4 c0 01 cmp %l3, %g1
> 148: 32 6f ff e0 bne,a %xcc, c8
> <journal_submit_data_buffers+0x2c>
> 14c: c4 5c 60 20 ldx [ %l1 + 0x20 ], %g2
> spin_lock(&journal->j_list_lock);
> J_ASSERT(jinode->i_transaction == commit_transaction);
> wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
> }
> spin_unlock(&journal->j_list_lock);
> 150: 90 10 00 12 mov %l2, %o0
> 154: 40 00 00 00 call 154 <journal_submit_data_buffers+0xb8>
> 158: b1 3e 20 00 sra %i0, 0, %i0
> return ret;
> }
> 15c: 81 cf e0 08 rett %i7 + 8
> 160: 01 00 00 00 nop
So the compiled code looks a bit suspitious to me. Having the disassembly
with symbols properly resolved would help confirm it. I'm adding sparc list
to CC just in case someone sees the problem...

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/