Re: [PATCH 1/6] cpuset write dirty map

From: Christoph Lameter
Date: Mon Sep 17 2007 - 15:11:18 EST


On Fri, 14 Sep 2007, Andrew Morton wrote:

> It'd be nice to see some discussion regarding the memory consumption of
> this patch and the associated tradeoffs.

On small NUMA system < 64 nodes you basically have an additional word
added to the address_space structure. On large NUMA we may have to
allocate very large per node arrays of up to 1024 nodes which may result
in 128 bytes used. However, the very large systems are rare thus we want
to do the allocation dynamically.

> afacit there is no code comment and no changelog text which explains the
> above design decision? There should be, please.

Hmmmm... There was an extensive discussion on that one early in the year
and I had it in the overview of the earlier releases. Ethan: Maybe get
that piece into the overview?

> There is talk of making cpusets available with CONFIG_SMP=n. Will this new
> feature be available in that case? (it should be).

Yes. Cpuset with CONFIG_SMP=n has been run for years on IA64 with Tony
Luck's special configurations.

> > EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
> >
> > +#if MAX_NUMNODES > BITS_PER_LONG
>
> waah. In other places we do "MAX_NUMNODES <= BITS_PER_LONG"

So

#if !(MAX_NUMNODES <= BITS_PER_LONG)

> > +/*
> > + * Special functions for NUMA systems with a large number of nodes.
> > + * The nodemask is pointed to from the address space structures.
> > + * The attachment of the dirty_node mask is protected by the
> > + * tree_lock. The nodemask is freed only when the inode is cleared
> > + * (and therefore unused, thus no locking necessary).
> > + */
>
> hmm, OK, there's a hint as to wghat's going on.
>
> It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
> we might want to tweak that in the future. Yet another argument for
> centralising this comparison.

A pointer has the size of BITS_PER_LONG. If we have less nodes then we
just waste memory and processing by having a pointer there.


> > +void cpuset_update_dirty_nodes(struct address_space *mapping,
> > + struct page *page)
> > +{
> > + nodemask_t *nodes = mapping->dirty_nodes;
> > + int node = page_to_nid(page);
> > +
> > + if (!nodes) {
> > + nodes = kmalloc(sizeof(nodemask_t), GFP_ATOMIC);
>
> Does it have to be atomic? atomic is weak and can fail.
>
> If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
> should at least pass the gfp_t into this function so it can do the stronger
> allocation when possible.

This is called from __set_page_dirty_nobuffers. There is no allocation
context available at that point. If the allocation fails then we do not
allocate a tracking structure which means that no targeted writeout occurs.

> > +void cpuset_clear_dirty_nodes(struct address_space *mapping)
> > +{
> > + nodemask_t *nodes = mapping->dirty_nodes;
> > +
> > + if (nodes) {
> > + mapping->dirty_nodes = NULL;
> > + kfree(nodes);
> > + }
> > +}
>
> Can this race with cpuset_update_dirty_nodes()? And with itself? If not,
> a comment which describes the locking requirements would be good.

You just quoted the comment above that explains the locking requirements.
Here it is again:

/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/