Re: [RFC] page lock ordering and OCFS2

From: Andrew Morton
Date: Mon Oct 17 2005 - 20:25:46 EST


Zach Brown <zach.brown@xxxxxxxxxx> wrote:
>
> > So where is the lock inversion?
> >
> > Perhaps if you were to cook up one of those little threadA/threadB ascii
> > diagrams we could see where the inversion occurs?
>
> Yeah, let me give that a try. I'll try to trim it down to the relevant
> bits. First let's start with a totally fresh node and have a read get a
> read DLM lock and populate the page cache on this node:
>
> sys_read
> generic_file_aio_read
> ocfs2_readpage
> ocfs2_data_lock
> block_read_full_page
> ocfs2_data_unlock
>
> So it was only allowed to proceed past ocfs2_data_lock() towards
> block_read_full_page() once the DLM granted it a read lock. As it calls
> ocfs2_data_unlock() it only is dropping this caller's local reference on
> the lock. The lock still exists on that node and is still valid and
> holding data in the page cache until it gets a network message saying
> that another node, who is probably going to be writing, would like the
> lock dropped.
>
> DLM kernel threads respond to the network messages and truncate the page
> cache. While the thread is busy with this inode's lock other paths on
> that node won't be able get locks. Say one of those messages arrives.
> While a local DLM thread is invalidating the page cache another user
> thread tries to read:
>
> user thread dlm thread
>
>
> kthread
> ...
> ocfs2_data_convert_worker

I assume there's an ocfs2_data_lock
hereabouts?

> truncate_inode_pages
> sys_read
> generic_file_aio_read
> * gets page lock
> ocfs2_readpage
> ocfs2_data_lock
> (stuck waiting for dlm)
> lock_page
> (stuck waiting for page)
>

Why does ocfs2_readpage() need to take ocfs2_data_lock? (Is
ocfs2_data_lock a range-based read-lock thing, or what?)

> The user task holds a page lock while waiting for the DLM to allow it to
> proceed. The DLM thread is preventing lock granting progress while
> waiting for the page lock that the user task holds.
>
> I don't know how far to go in explaining what leads up to laying out the
> locking like this. It is typical (and OCFS2 used to do this) to wait
> for the DLM locks up in file->{read,write} and pin them for the duration
> of the IO. This avoids the page lock and DLM lock inversion problem,
> but it suffers from a host of other problems -- most fatally needing
> that vma walking to govern holding multiple DLM locks during an IO.

Oh.

Have you considered using invalidate_inode_pages() instead of
truncate_inode_pages()? If that leaves any pages behind, drop the read
lock, sleep a bit, try again - something klunky like that might get you out
of trouble, dunno.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/