Re: [PATCH 2/4] Introduce a new fields "gtime" and "cgtime" intask_struct and signal_struct

From: Ingo Molnar
Date: Wed Aug 05 2009 - 03:00:11 EST



* Laurent Vivier <Laurent.Vivier@xxxxxxxx> wrote:

> [PATCH 2/4] like for cpustat, introduce the "gtime" (guest time of
> the task) and "cgtime" (guest time of the task children) fields
> for the tasks. Modify signal_struct and task_struct. Modify
> /proc/<pid>/stat to display these new fields.

> --- kvm.orig/include/linux/sched.h 2007-08-20 11:11:30.000000000 +0200
> +++ kvm/include/linux/sched.h 2007-08-20 13:00:02.000000000 +0200
> @@ -515,6 +515,10 @@ struct signal_struct {
> * in __exit_signal, except for the group leader.
> */
> cputime_t utime, stime, cutime, cstime;
> +#ifdef CONFIG_GUEST_ACCOUNTING
> + cputime_t gtime;
> + cputime_t cgtime;
> +#endif

A handful of general (and less general) observations about these
patches:

1- The code is very ugly due to being an #ifdef fest. Please
always try to avoid them.

2- cputime_t is very coarse on x86: measured in jiffies. This means
that with a default HZ of 250 we'll have units of 4 msecs.
That's almost useless to rely on in new instrumentation: an irq
can come in and out without accounting noticing it, etc. If we
do some new statistics then it should be a lot better than
jiffies granular.

3- stime of vcpu tasks/threads already approximates 'guest time'
adequately. (as Jeremy observed it as well) Yes, it mixes 'true
guest mode' and 'host mode' system time, but then again due to
the jiffies granularity we have a _far_ bigger skew going on
already.

4- namespace collision: 'gtime' is already used as 'group time' in
a few places. One of the two things needs to be renamed.

5- tracepoints and perfcounters could be used to measure guest time
precisely, in a low-overhead mode.

These issues need to be addressed in a meaningful way. #2 probably
means a revamping of cputime_t handling on x86 - of not just the
gtime. But #3 is worth keeping in mind as well.

I think #5 is the most capable solution by a wide margin - we need
just a single tracepoint to emit 'nsecs spent in guest mode'
information and that's it. It would be a far smaller patch.

The tracepoint might even sample the guest RIP and hence could be
used as a VM-exit profiler and 'perf record -e kvm:vm_exit + perf
report' could be used to examine/profile/trace guest exit reasons.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/