Re: find_busiest_group using lots of CPU

From: Peter Zijlstra
Date: Mon Oct 05 2009 - 08:29:56 EST


On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> Hi,
>
> I stuffed a few more SSDs into my text box. Running a simple workload
> that just does streaming reads from 10 processes (throughput is around
> 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> a 64 thread box.
>
> The top two profile entries are:
>
> 10.86% fio [kernel] [k] find_busiest_group
> |
> |--99.91%-- thread_return
> | io_schedule
> | sys_io_getevents
> | system_call_fastpath
> | 0x7f4b50b61604
> | |
> | --100.00%-- td_io_getevents
> | io_u_queued_complete
> | thread_main
> | run_threads
> | main
> | __libc_start_main
> --0.09%-- [...]
>
> 5.78% fio [kernel] [k] cpumask_next_and
> |
> |--67.21%-- thread_return
> | io_schedule
> | sys_io_getevents
> | system_call_fastpath
> | 0x7f4b50b61604
> | |
> | --100.00%-- td_io_getevents
> | io_u_queued_complete
> | thread_main
> | run_threads
> | main
> | __libc_start_main
> |
> --32.79%-- find_busiest_group
> thread_return
> io_schedule
> sys_io_getevents
> system_call_fastpath
> 0x7f4b50b61604
> |
> --100.00%-- td_io_getevents
> io_u_queued_complete
> thread_main
> run_threads
> main
> __libc_start_main
>
> This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> both disabled but that yields the same result (well actually worse, 22%
> spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> not set. This seems way excessive!

io_schedule() straight into find_busiest_group() leads me to think this
could be SD_BALANCE_NEWIDLE, does something like:

for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags;
do
val=`cat $i`; echo $((val & ~0x02)) > $i;
done

[ assuming SCHED_DEBUG=y ]

Cure things?

If so, then its spending time looking for work, which there might not be
on your machine, since everything is waiting for IO or somesuch.

Not really sure what to do about it though, this is a quad socket
nehalem, right? We could possibly disable SD_BALANCE_NEWIDLE on the NODE
level, but that would again decrease throughput in things like kbuild.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/