Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

From: Dongli Zhang
Date: Mon Oct 30 2017 - 20:16:20 EST


Hi Boris,

On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu 198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu 200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables.
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
>>
>> ---
>> Changed since v1:
>> * relocate modification to xen_get_runstate_snapshot_cpu
>>
>> Changed since v2:
>> * accumulate runstate times before live migration
>>
>> Changed since v3:
>> * do not accumulate times in the case of guest checkpointing
>>
>> Changed since v4:
>> * allocate array of vcpu_runstate_info to reduce number of memory allocation
>>
>> ---
>> drivers/xen/manage.c | 2 ++
>> drivers/xen/time.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
>> include/xen/interface/vcpu.h | 2 ++
>> include/xen/xen-ops.h | 1 +
>> 4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index c425d03..3dc085d 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>> }
>>
>> gnttab_suspend();
>> + xen_accumulate_runstate_time(-1);
>> xen_arch_pre_suspend();
>>
>> /*
>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>> : 0);
>>
>> xen_arch_post_suspend(si->cancelled);
>> + xen_accumulate_runstate_time(si->cancelled);
>
> I am not convinced that the comment above HYPERVISOR_suspend() is
> correct. The call can return an error code and so if it returns -EPERM
> (which AFAICS it can't now but might in the future) then
> xen_accumulate_runstate_time() will do wrong thing.

I would split xen_accumulate_runstate_time() into two functions to avoid the
-EPERM issue, as one is for saving and another is for accumulation, respectively.

Otherwise, can you use xen_accumulate_runstate_time(2) for saving before suspend
and xen_accumulate_runstate_time(si->cancelled) after resume?

>
>
>> gnttab_resume();
>>
>> if (!si->cancelled) {
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..cf3afb9 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -19,6 +19,9 @@
>> /* runstate info updated by Xen */
>> static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>
>> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
>> +static struct vcpu_runstate_info *runstate_delta;
>
> I'd move this inside xen_accumulate_runstate_time() since that's the

If we split xen_accumulate_runstate_time() into two functions, we would leave
runstate_delta as global static.

> only function that uses it. And why does it need to be
> vcpu_runstate_info and not u64[4]?

This was suggested by Juergen to avoid the allocation and reclaim of the second
dimensional array as in v4 of this patch?

Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and emulate
the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) in
each iteration?

>
>> +
>> /* return an consistent snapshot of 64-bit time/counter value */
>> static u64 get64(const u64 *p)
>> {
>> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>> return ret;
>> }
>>
>> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> - unsigned int cpu)
>> +static void xen_get_runstate_snapshot_cpu_delta(
>> + struct vcpu_runstate_info *res, unsigned int cpu)
>> {
>> u64 state_time;
>> struct vcpu_runstate_info *state;
>> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> (state_time & XEN_RUNSTATE_UPDATE));
>> }
>>
>> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> + unsigned int cpu)
>> +{
>> + int i;
>> +
>> + xen_get_runstate_snapshot_cpu_delta(res, cpu);
>> +
>> + for (i = 0; i < RUNSTATE_max; i++)
>> + res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>> +}
>> +
>> +void xen_accumulate_runstate_time(int action)
>> +{
>> + struct vcpu_runstate_info state;
>> + int cpu, i;
>> +
>> + switch (action) {
>> + case -1: /* backup runstate time before suspend */
>> + WARN_ON_ONCE(unlikely(runstate_delta));
>
> pr_warn_once(), to be consistent with the rest of the file. And then
> should you return if this is true?

I would prefer to not return if it is true but just warn the administrator that
there is memory leakage issue while leaving runstate accumulation works normally.

>
>> +
>> + runstate_delta = kcalloc(num_possible_cpus(),
>> + sizeof(*runstate_delta),
>> + GFP_KERNEL);
>> + if (unlikely(!runstate_delta)) {
>> + pr_alert("%s: failed to allocate runstate_delta\n",
>> + __func__);
>
> pr_warn() should be sufficient. Below too.
>
> Also, as a side question --- can we do kmalloc() at this point?

Yes. kmalloc_array() is better than kcalloc, unless we have 2 dimensional array
and we need to guarantee the value of first dimension is always 0.

>
>> + return;
>> + }
>> +
>> + for_each_possible_cpu(cpu) {
>> + xen_get_runstate_snapshot_cpu_delta(&state, cpu);
>> + memcpy(runstate_delta[cpu].time, state.time,
>> + RUNSTATE_max * sizeof(*runstate_delta[cpu].time));
>
> sizeof(runstate_delta[cpu].time).
>
>> + }
>> +
>> + break;
>> +
>> + case 0: /* backup runstate time after resume */
>> + if (unlikely(!runstate_delta)) {
>> + pr_alert("%s: cannot accumulate runstate time as runstate_delta is NULL\n",
>> + __func__);
>> + return;
>> + }
>> +
>> + for_each_possible_cpu(cpu) {
>> + for (i = 0; i < RUNSTATE_max; i++)
>> + per_cpu(old_runstate_time, cpu)[i] +=
>> + runstate_delta[cpu].time[i];
>> + }
>> + break;
>> +
>> + default: /* do not accumulate runstate time for checkpointing */
>> + break;
>> + }
>> +
>> + if (action != -1 && runstate_delta) {
>> + kfree(runstate_delta);
>> + runstate_delta = NULL;
>> + }
>> +}
>> +
>> /*
>> * Runstate accounting
>> */
>> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
>> index 98188c8..85e81ce 100644
>> --- a/include/xen/interface/vcpu.h
>> +++ b/include/xen/interface/vcpu.h
>> @@ -110,6 +110,8 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_runstate_info);
>> */
>> #define RUNSTATE_offline 3
>>
>> +#define RUNSTATE_max 4
>
> This file is part of Xen ABI. While this macro technically doesn't
> change anything I'd rather have those updates first appear in Xen code.
>
> Besides, this change leaves vcpu_runstate_info.time[4] definition
> intact. I think all RUNSTATE_* macros would need to be moved above
> vcpu_runstate_info definition.
>
> TBH, for purposes of this patch I'd use 4.
>
>
> -boris
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel
>

Thank you very much for your suggestions!

Dongli Zhang