Re: [PATCH] writeback: Fix broken sync writeback

From: Linus Torvalds
Date: Tue Feb 16 2010 - 22:37:05 EST




On Wed, 17 Feb 2010, Jan Kara wrote:
>
> I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
> -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
> ...
> writeback_single_inode()
> ...writes 1024 pages.
> if we haven't written everything in the inode (more than 1024 dirty
> pages) we end up doing either requeue_io() or redirty_tail(). In the
> first case the inode is put to b_more_io list, in the second case to
> the tail of b_dirty list. In either case it will not receive further
> writeout until we go through all other members of current b_io list.
>
> So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

Ok, I think that's the bug. I do agree that it may well be intentional,
but considering the performance impact, I suspect it's been "intentional
without any performance numbers".

Which just makes me very unhappy to just paper it over for the sync case,
and leave the now known-broken state alone for the async case. That really
isn't how we want to do things.

That said, if we've done this forever, I can certainly see the allure to
just keep doing it, and then handle the sync case separately.

> I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

Hmm. The thing is, the new radix tree bit you propose also sounds like
overdesigning things.

If we really do switch inodes (which I obviously didn't expect, even if I
may have been aware of it many years ago), then the max rate limiting is
just always bad.

If it's bad for synchronous syncs, then it's bad for background syncing
too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely -
since the whole latency argument goes away if we don't always honor it
("Oh, we have good latency - _except_ if you do 'sync()' to synchronously
write something out" - that's just insane).

> The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

What is affected if we just remove MAX_WRITEBACK_PAGES entirely (as
opposed to the patch under discussion that effectively removes it for
WB_SYNC_ALL)?

I see balance_dirty_pages -> bdi_start_writeback, but that if anything
would be something that I think would be better off with efficient
writeback, and doesn't seem like it should try to round-robin over inodes
for latency reasons.

But I guess we can do it in stages, if it's about "minimal changes for
2.6.32/33.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/