Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()

From: Linus Torvalds
Date: Fri Jan 08 2010 - 12:23:28 EST




On Fri, 8 Jan 2010, Peter Zijlstra wrote:

> On Tue, 2010-01-05 at 20:20 -0800, Linus Torvalds wrote:
> >
> > Yeah, I should have looked more at your callchain. That's nasty. Much
> > worse than the per-mm lock. I thought the page buffering would avoid the
> > zone lock becoming a huge problem, but clearly not in this case.
>
> Right, so I ran some numbers on a multi-socket (2) machine as well:
>
> pf/min
>
> -tip 56398626
> -tip + xadd 174753190
> -tip + speculative 189274319
> -tip + xadd + speculative 200174641
>
> [ variance is around 0.5% for this workload, ran most of these numbers
> with --repeat 5 ]

That's a huge jump. It's clear that the spinlock-based rwsem's simply
suck. The speculation gets rid of some additional mmap_sem contention,
but at least for two sockets it looks like the rwsem implementation was
the biggest problem by far.

> At both the xadd/speculative point the workload is dominated by the
> zone->lock, the xadd+speculative removes some of the contention, and
> removing the various RSS counters could yield another few percent
> according to the profiles, but then we're pretty much there.

I don't know if worrying about a few percent is worth it. "Perfect is the
enemy of good", and the workload is pretty dang artificial with the whole
"remove pages and re-fault them as fast as you can".

So the benchmark is pointless and extreme, and I think it's not worth
worrying too much about details. Especially when compared to just the
*three-fold* jump from just the fairly trivial rwsem implementation change
(with speculation on top of it then adding another 15% improvement -
nothing to sneeze at, but it's still in a different class).

Of course, larger numbers of sockets will likely change the situation, but
at the same time I do suspect that workloads designed for hundreds of
cores will need to try to behave better than that benchmark anyway ;)

> One way around those RSS counters is to track it per task, a quick grep
> shows its only the oom-killer and proc that use them.
>
> A quick hack removing them gets us: 203158058

Yeah, well.. After that 200% and 15% improvement, a 1.5% improvement on a
totally artificial benchmark looks less interesting.

Because let's face it - if your workload does several million page faults
per second, you're just doing something fundamentally _wrong_.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/