Re: mmotm 2009-04-10-02-21 uploaded - forkbombed by work_for_cpu

From: Linus Torvalds
Date: Mon Apr 13 2009 - 12:09:45 EST




On Sat, 11 Apr 2009, Valdis.Kletnieks@xxxxxx wrote:
>
> Probable cause for my problem:
>
> arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c calls work_on_cpu(). We get into a
> state where we have enough activity to kick us to a high CPU speed, and then
> the activity of writing 90 acct records per sec keeps us there - with continual
> callbacks to see if we can drop the CPU speed.

Ok, I think that that work_on_cpu() commit is broken, but I _also_ think
that cpufreq is doing something fairly insane.

This behavior seems to be triggered by the "ondemand" policy case, btw,
and it literally does basically:

dbs_check_cpu:
for_each_cpu(j, policy->cpus)
...
freq_avg = __cpufreq_driver_getavg(policy, j);

where "__cpufreq_driver_getavg()" will do "freq->getavg(policy, cpu)" and
then acpi-cpufreq.c will do that "work_on_cpu()" as part of the call to
"get_measured_perf()".

So pretty much _all_ use is going to always effectively do a broadcast
"work on each cpu" thing. That's always going to be pretty damn expensive.

And there's no _reason_. As far as I can tell, that ACPI cpufreq thing
doesn't _need_ any "process context". That "get_measured_perf()" will
just do a single read_measured_perf_ctrs() call, and all that does is two
'rdmsr()' calls.

So afaik, acpi-cpufreq.c should not use "work_on_cpu()" for that at all.
It should just do a smp_call_function_single().

So I do think Andrew's commit is broken and we should think about it a bit
more, but I also think that Valdis' problem comes from acpi-cpufreq just
being damn stupid. Doing a smp_call_function_single() to read two MSR's is
going to be a _lot_ more efficient than doing that crazy work_on_cpu() for
that.

So the _real_ problem came through the commits like

cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
cpumask: use work_on_cpu in acpi-cpufreq.c for read_measured_perf_ctrs

that were meant to reduce stack usage with big cpu masks. And sure, the
_old_ way of doing it was also stupid (it rescheduled the process to the
other CPU by using cpus_allowed()).

Mike, Ingo?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/