Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb

From: Wu Fengguang
Date: Thu Sep 24 2009 - 04:34:04 EST

Next message: Tejun Heo: "Re: [PATCH 2/4] ia64: allocate percpu area for cpu0 like percpu areasfor other cpus"
Previous message: Ingo Molnar: "Re: linux-next: manual merge of the tip tree with Linus' tree"
Next in thread: Peter Zijlstra: "Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Sep 14, 2009 at 07:17:21PM +0800, Jan Kara wrote:
> On Thu 10-09-09 17:49:10, Peter Zijlstra wrote:
> > On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
> > > Well, what I imagined we could do is:
> > > Have a per-bdi variable 'pages_written' - that would reflect the amount of
> > > pages written to the bdi since boot (OK, we'd have to handle overflows but
> > > that's doable).
> > >
> > > There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> > > in balance_dirty_pages() because we are over limits, it kicks writeback thread
> > > and does:
> > > to_wait = max(pages_waited, pages_written) + sync_dirty_pages() (or
> > > whatever number we decide)
> > > pages_waited = to_wait
> > > sleep until pages_written reaches to_wait or we drop below dirty limits.
> > >
> > > That will make sure each thread will sleep until writeback threads have done
> > > their duty for the writing thread.
> > >
> > > If we make sure sleeping threads are properly ordered on the wait queue,
> > > we could always wakeup just the first one and thus avoid the herding
> > > effect. When we drop below dirty limits, we would just wakeup the whole
> > > waitqueue.
> > >
> > > Does this sound reasonable?
> >
> > That seems to go wrong when there's multiple tasks waiting on the same
> > bdi, you'd count each page for 1/n its weight.
> >
> > Suppose pages_written = 1024, and 4 tasks block and compute their to
> > wait as pages_written + 256 = 1280, then we'd release all 4 of them
> > after 256 pages are written, instead of 4*256, which would be
> > pages_written = 2048.
> Well, there's some locking needed of course. The intent is to stack
> demands as they come. So in case pages_written = 1024, pages_waited = 1024
> we would do:
> THREAD 1:
>
> spin_lock
> to_wait = 1024 + 256
> pages_waited = 1280
> spin_unlock
>
> THREAD 2:
>
> spin_lock
> to_wait = 1280 + 256
> pages_waited = 1536
> spin_unlock
>
> So weight of each page will be kept. The fact that second thread
> effectively waits until the first thread has its demand satisfied looks
> strange at the first sight but we don't do better currently and I think
> it's fine - if they were two writer threads, then soon the thread released
> first will queue behind the thread still waiting so long term the behavior
> should be fair.

Yeah, FIFO queuing should be good enough.

I'd like to propose one more data structure for evaluation :)

- bdi->throttle_lock
- bdi->throttle_list pages to sync for each waiting task, taken from sync_writeback_pages()
- bdi->throttle_pages (counted down) pages to sync for the head task, shall be atomic_t

In balance_dirty_pages(), it would do

nr_to_sync = sync_writeback_pages()
if (list_empty(bdi->throttle_list)) # I'm the only task
bdi->throttle_pages = nr_to_sync
append nr_to_sync to bdi->throttle_list
kick off background writeback
wait
remove itself from bdi->throttle_list and wait list
set bdi->throttle_pages for new head task (or LONG_MAX)

In __bdi_writeout_inc(), it would do

if (--bdi->throttle_pages <= 0)
check and wake up head task

In wb_writeback(), it would do

if (args->for_background && exiting)
wake up all throttled tasks

To prevent wake up too many tasks at the same time, it can relax the
background threshold a bit, so that __bdi_writeout_inc() become the
only wake up point in normal cases.

if (args->for_background && !list_empty(bdi->throttle_list) &&
over background_thresh - background_thresh / 32)
keep write pages;

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tejun Heo: "Re: [PATCH 2/4] ia64: allocate percpu area for cpu0 like percpu areasfor other cpus"
Previous message: Ingo Molnar: "Re: linux-next: manual merge of the tip tree with Linus' tree"
Next in thread: Peter Zijlstra: "Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]