Re: [RFC 0/8] Cpuset aware writeback

From: Andrew Morton
Date: Tue Jan 16 2007 - 18:41:29 EST


> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <clameter@xxxxxxx> wrote:
>
> ...
>
> > > This may result in a large percentage of a cpuset
> > > to become dirty without writeout being triggered. Under NFS
> > > this can lead to OOM conditions.
> >
> > OK, a big question: is this patchset a performance improvement or a
> > correctness fix? Given the above, and the lack of benchmark results I'm
> > assuming it's for correctness.
>
> It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

> > - Why does NFS go oom? Because it allocates potentially-unbounded
> > numbers of requests in the writeback path?
> >
> > It was able to go oom on non-numa machines before dirty-page-tracking
> > went in. So a general problem has now become specific to some NUMA
> > setups.
>
>
> Right. The issue is that large portions of memory become dirty /
> writeback since no writeback occurs because dirty limits are not checked
> for a cpuset. Then NFS attempt to writeout when doing LRU scans but is
> unable to allocate memory.
>
> > So an obvious, equivalent and vastly simpler "fix" would be to teach
> > the NFS client to go off-cpuset when trying to allocate these requests.
>
> Yes we can fix these allocations by allowing processes to allocate from
> other nodes. But then the container function of cpusets is no longer
> there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller.
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

> > (But is it really bad? What actual problems will it cause once NFS is fixed?)
>
> NFS is okay as far as I can tell. dirty throttling works fine in non
> cpuset environments because we throttle if 40% of memory becomes dirty or
> under writeback.

Repeat: NFS shouldn't go oom. It should fail the allocation, recover, wait
for existing IO to complete. Back that up with a mempool for NFS requests
and the problem is solved, I think?

> > I don't understand why the proposed patches are cpuset-aware at all. This
> > is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> > more general. For example, i386 machines can presumably get into trouble
> > if all of ZONE_DMA or ZONE_NORMAL get dirty. A good implementation would
> > address that problem as well. So I think it should all be per-zone?
>
> No. A zone can be completely dirty as long as we are allowed to allocate
> from other zones.

But we also can get into trouble if a *zone* is all-dirty. Any solution to
the cpuset problem should solve that problem too, no?

> > Do we really need those per-inode cpumasks? When page reclaim encounters a
> > dirty page on the zone LRU, we automatically know that page->mapping->host
> > has at least one dirty page in this zone, yes? We could immediately ask
>
> Yes, but when we enter reclaim most of the pages of a zone may already be
> dirty/writeback so we fail.

No. If the dirty limits become per-zone then no zone will ever have >40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process. I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/