Re: [PATCH 1/1] cputime: Make the reported utime+stime correspond to the actual runtime.

From: Fredrik MarkstrÃm
Date: Mon Jun 15 2015 - 11:34:54 EST


Hello Peter, your patch helps with some of the cases but not all:

(the "called with.." below means cputime_adjust() is called with the
values specified in it's struct task_cputime argument.)

It helps when called with:

sum_exec_runtime=1000000000 utime=0 stime=1
... followed by...
sum_exec_runtime=1010000000 utime=100 stime=1

It doesn't help when called with:

sum_exec_runtime=1000000000 utime=1 stime=0
... followed by...
sum_exec_runtime=1010000000 utime=1 stime=100

Also if we get a call with:

sum_exec_runtime=1000000000 utime=1 stime=1

... then get preempted after your proposed fix and before we are done
with the calls to cpu_advance(), then gets called again (from a
different thread) with:

sum_exec_runtime=1010000000 utime=100 stime=1

... it still breaks.

I think there might be additional concurrency problems before, between
and/or possibly after the calls to cputime_advance(), at least if we
want to guarantee that sys+user should stay sane. I believe my
proposed patch eliminates those potential problems in a pretty
straight forward way.

I tried to come up with a lock free solution but didn't find a simple
solution. Since, from what I understand, the likelihood of scalability
issues here are unlikely I felt that simplicity was preferred. Also
the current implementation has two cmpxchg:s, and my proposal a single
spinlock, so on some setups I bet it's more efficient (like mine with
a lousy interconnect and preempt-rt (but I'm on thin ice here)).

Below is the output from my test application (it's to much of a hack
to post publicly), but I'd be happy to clean it up and post it if
necessary.

/Fredrik


#<test>.<step> <input> => <test>.<step> <output> [=====> FAILED]

0.0 sum_exec=100000000000 utime=0 stime=1 => 0.0 tot=10000
user=0 sys=10000
0.1 sum_exec=101000000000 utime=100 stime=1 => 0.1 tot=10100
user=100 sys=10000

1.0 sum_exec=100000000000 utime=1 stime=0 => 1.0 tot=10000
user=10000 sys=0
1.1 sum_exec=101000000000 utime=1 stime=100 => 1.1 tot=20000
user=10000 sys=10000 =====> FAILED

2.0 sum_exec=100000000000 utime=1 stime=1 => 2.0 tot=10000
user=5000 sys=5000
2.1 sum_exec=101000000000 utime=100 stime=1 => 2.1 tot=10100
user=5100 sys=5000

3.0 sum_exec=100000000000 utime=1 stime=1 => <<PREEMPT>>
3.1 sum_exec=101000000000 utime=100 stime=1 => 3.1
tot=10100 user=10000 sys=100
<<SWITCH BACK>> 3.0 tot=15000 user=10000 sys=5000 =====> FAILED


On Fri, Jun 12, 2015 at 1:01 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Fri, Jun 12, 2015 at 12:16:57PM +0200, Peter Zijlstra wrote:
>> On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
>> > The scaling mechanism might sometimes cause top to report >100%
>> > (sometimes > 1000%) cpu usage for a single thread. This patch makes
>> > sure that stime+utime corresponds to the actual runtime of the thread.
>>
>> This Changelog is inadequate, it does not explain the actual problem.
>>
>> > +static DEFINE_SPINLOCK(prev_time_lock);
>>
>> global (spin)locks are bad.
>
> Since you have a proglet handy to test this; does something like the
> below help anything?
>
> ---
> kernel/sched/cputime.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f5a64ffad176..3d3f60a555a0 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -613,6 +613,10 @@ static void cputime_adjust(struct task_cputime *curr,
>
> stime = scale_stime((__force u64)stime,
> (__force u64)rtime, (__force u64)total);
> +
> + if (stime < prev->stime)
> + stime = prev->stime;
> +
> utime = rtime - stime;
> }
>



--
/Fredrik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/