Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

From: Mike Travis
Date: Wed Aug 27 2008 - 10:35:40 EST


Nick Piggin wrote:
> On Wednesday 27 August 2008 06:01, Mike Travis wrote:
>> Dave Jones wrote:
>> ...
>>
>>> But yes, for this to be even remotely feasible, there has to be a
>>> negligable performance cost associated with it, which right now, we
>>> clearly don't have. Given that the number of people running 4096 CPU
>>> boxes even in a few years time will still be tiny, punishing the common
>>> case is obviously absurd.
>>>
>>> Dave
>> I did do some fairly extensive benchmarking between configs of NR_CPUS =
>> 128 and 4096 and most performance hits were in the neighborhood of < 5% on
>> systems with 8 cpus and 4GB of memory (our most common test system).
>
> 5% is a pretty nasty performance hit... what sort of benchmarks are we
> talking about here?

It's been a while now, I should go back and check my notes. Many of the
BM's did not have any changes. I believe the ones that were right on the
edge of paging were affected by the fact that less memory was available.
>
> I just made some pretty crazy changes to the VM to get "only" around 5
> or so % performance improvement in some workloads.
>
> What places are making heavy use of cpumasks that causes such a slowdown?
> Hopefully callers can mostly be improved so they don't need to use cpumasks
> for common cases.

That's another study I did, and it seemed that maybe 95% of the functions
would not be affected by passing pointers to cpumasks instead of the cpumasks
themselves, because the data was processed by a cpu_xxx function that
uses a pointer. Most commonly was to create a temp cpumask, using
cpus_and(temp_mask, callers_mask, cpu_online_map); The speedup to use nr_cpu_ids
instead of NR_CPUS in the traversal functions helped quite a bit. Using this
same method in the cpus_xxx functions would further speed up things. (As
well as only allocating the cpumask sized by nr_cpu_ids instead of NR_CPUS
as the current cpumask_t definition specifies.)

>
> Until then, it would be kind of sad for a distro to ship a generic x86
> kernel and lose 5% performance because it is set to 4096 CPUs...
>
> But if I misunderstand and you're talking about specific microbenchmarks to
> find the worst case for huge cpumasks, then I take that back.

Yes, I was (at the time) trying to determine how many of the cpumask functions
were actually in play by user tasks, so I was zeroing in on those (cpusets,
rescheds, etc.)

>
>
>> [But
>> changing cpumask_t's to be pointers instead of values will likely increase
>> this.] I've tried to be very sensitive to this issue with all my previous
>> changes, so convincing the distros to set NR_CPUS=4096 would be as painless
>> for them as possible. ;-)
>>
>> Btw, huge count cpu systems I don't think are that far away. I believe the
>> nextgen Larabbee chips will be geared towards HPC applications [instead of
>> just GFX apps], and putting 4 of these chips on a motherboard would add up
>> to 512 cpu threads (1024 if they support hyperthreading.)
>
> It would be quite interesting if they make them cache coherent / MP capable.
> Will they be?

There's not been a lot of info available yet, but I think the 128 cores will
share at least an L2 cache + memory controller. How the APIC's interact is
also another big question. And most likely some standard system controller
CPU will be needed, but that could be a tiny VIA processor... ;-)

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/