Re: [PREEMPT-RT] Oops in rapl_cpu_prepare()

From: Charles (Chas) Williams
Date: Wed Nov 02 2016 - 05:33:57 EST


On 10/28/2016 04:03 AM, Sebastian Andrzej Siewior wrote:
On 2016-10-27 15:00:32 [-0400], Charles (Chas) Williams wrote:
I assume "init_rapl_pmus: maxpkg 4" is from init_rapl_pmus() returning
topology_max_packages(). So it says 4 but then returns 65535 for CPU 2
and 3. That -1 comes probably from topology_update_package_map(). Could
you please send a complete boot log and try the following patch? This
one should fix your boot problem and disable RAPL if the info is
invalid.

But sometimes the topology info is correct and if I get lucky, the
package id could be valid for all the CPU's. Given the behavior,
I have seen so far it makes me thing the RAPL isn't being emulated.
So even if I did boot onto a "valid" set of cores, would I always be
certain that I will be on those cores?

I don't what vmware does here. Nor do they ship source to check. So if
you have a big HW box with say two packages, it might make sense to give
this information to the guest _if_ the CPUs are pinned and the guest
never migrates.

Yes, I agree _if_. That's why it simply isn't clear to me that we should
attempt do any RAPL at all for VMWare. The current behavior doesn't seem
to make sense and I don't expect it to suddenly start acting reasonable.
Since I don't understand why some package id's are valid and others
are not, I would prefer not to trust any of the information as far as
enabling/disabling the RAPL monitoring.


Per your request in your next email:

One thing I forgot to ask: Could you please check if you get the same
pkgid reported for cpu 0-3 on a pre-v4.8 kernel? (before the hotplug
rework).

Our previous kernel was 4.4, and didn't use the logical package id:
I see.

Did the patch I sent fixed it for you and were you not able to test?

Yes, it does prevent RAPL from starting and loading. From the boot log:

[ 2.711481] RAPL PMU: rapl pmu error: max package: 4 but CPU2 belongs to 65535
[ 2.711639] rapl pmu error: max package: 4 but CPU2 belongs to 65535

This was consistent across several reboots. I poked around in the
VM settings. Apparently this guest is configured for four virtual
sockets with one core per socket. Testing with two virtual sockets,
one core per socket:

[ 2.163177] RAPL PMU: rapl pmu error: max package: 2 but CPU1 belongs to 65535
[ 2.163304] rapl pmu error: max package: 2 but CPU1 belongs to 65535

Booting with 1 virtual socket, 1 core per socket:

[ 1.750311] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[ 1.750312] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[ 1.750313] RAPL PMU: hw unit of domain package 2^-0 Joules
[ 1.750314] RAPL PMU: hw unit of domain dram 2^-0 Joules

Booting with 1 virtual socket, 4 cores per socket:

[ 3.527298] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[ 3.527302] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[ 3.527304] RAPL PMU: hw unit of domain package 2^-0 Joules
[ 3.527307] RAPL PMU: hw unit of domain dram 2^-0 Joules

So, it looks like VMWare tends to always get something wrong if you have
more than one virtual socket. The above behavior was consistent across
several reboots.