Re: [RFC] writeback and cgroup

From: Jan Kara
Date: Thu Apr 19 2012 - 16:27:29 EST

Next message: J. Bruce Fields: "nfsd bugfixes for 3.4"
Previous message: H. Peter Anvin: "Re: [PATCH 3/3] x86, extable: Handle early exceptions"
In reply to: Tejun Heo: "Re: [RFC] writeback and cgroup"
Next in thread: Fengguang Wu: "Re: [RFC] writeback and cgroup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu 19-04-12 22:23:43, Wu Fengguang wrote:
> For one instance, splitting the request queues will give rise to
> PG_writeback pages. Those pages have been the biggest source of
> latency issues in the various parts of the system.
Well, if we allow more requests to be in flight in total then yes, number
of PG_Writeback pages can be higher as well.

> It's not uncommon for me to see filesystems sleep on PG_writeback
> pages during heavy writeback, within some lock or transaction, which in
> turn stall many tasks that try to do IO or merely dirty some page in
> memory. Random writes are especially susceptible to such stalls. The
> stable page feature also vastly increase the chances of stalls by
> locking the writeback pages.
>
> Page reclaim may also block on PG_writeback and/or PG_dirty pages. In
> the case of direct reclaim, it means blocking random tasks that are
> allocating memory in the system.
>
> PG_writeback pages are much worse than PG_dirty pages in that they are
> not movable. This makes a big difference for high-order page allocations.
> To make room for a 2MB huge page, vmscan has the option to migrate
> PG_dirty pages, but for PG_writeback it has no better choices than to
> wait for IO completion.
>
> The difficulty of THP allocation goes up *exponentially* with the
> number of PG_writeback pages. Assume PG_writeback pages are randomly
> distributed in the physical memory space. Then we have formula
>
> P(reclaimable for THP) = 1 - P(hit PG_writeback)^256
Well, this implicitely assumes that PG_Writeback pages are scattered
across memory uniformly at random. I'm not sure to which extent this is
true... Also as a nitpick, this isn't really an exponential growth since
the exponent is fixed (256 - actually it should be 512, right?). It's just
a polynomial with a big exponent. But sure, growth in number of PG_Writeback
pages will cause relatively steep drop in the number of available huge
pages.

...
> It's worth to note that running multiple flusher threads per bdi means
> not only disk seeks for spin disks, smaller IO size for SSD, but also
> lock contentions and cache bouncing for metadata heavy workloads and
> fast storage.
Well, this heavily depends on particular implementation (and chosen
data structures). But yes, we should have that in mind.

...
> > > To me, balance_dirty_pages() is *the* proper layer for buffered writes.
> > > It's always there doing 1:1 proportional throttling. Then you try to
> > > kick in to add *double* throttling in block/cfq layer. Now the low
> > > layer may enforce 10:1 throttling and push balance_dirty_pages() away
> > > from its balanced state, leading to large fluctuations and program
> > > stalls.
> >
> > Just do the same 1:1 inside each cgroup.
>
> Sure. But the ratio mismatch I'm talking about is inter-cgroup.
> For example there are only 2 dd tasks doing buffered writes in the
> system. Now consider the mismatch that cfq is dispatching their IO
> requests at 10:1 weights, while balance_dirty_pages() is throttling
> the dd tasks at 1:1 equal split because it's not aware of the cgroup
> weights.
>
> What will happen in the end? The 1:1 ratio imposed by
> balance_dirty_pages() will take effect and the dd tasks will progress
> at the same pace. The cfq weights will be defeated because the async
> queue for the second dd (and cgroup) constantly runs empty.
Yup. This just shows that you have to have per-cgroup dirty limits. Once
you have those, things start working again.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: J. Bruce Fields: "nfsd bugfixes for 3.4"
Previous message: H. Peter Anvin: "Re: [PATCH 3/3] x86, extable: Handle early exceptions"
In reply to: Tejun Heo: "Re: [RFC] writeback and cgroup"
Next in thread: Fengguang Wu: "Re: [RFC] writeback and cgroup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]