Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

From: Liang, Kan
Date: Thu Sep 17 2020 - 17:58:36 EST




On 9/17/2020 5:24 PM, Dave Hansen wrote:
On 9/17/20 2:16 PM, Liang, Kan wrote:
One last concern as I look at this: I wish it was a bit more
future-proof.  There are lots of weird things folks are trying to do
with the page tables, like Address Space Isolation.  For instance, if
you get a perf NMI when running userspace, current->mm->pgd is
*different* than the PGD that was in use when userspace was running.
It's close enough today, but it might not stay that way.  But I can't
think of any great ways to future proof this code, other than spitting
out an error message if too many of the page table walks fail when they
shouldn't.


If the page table walks fail, page size 0 will return. So the worst case
is that the page size is not available for users, which is not a fatal
error.

Right, it's not a fatal error. It will just more or less silently break
this feature.

If my understanding is correct, when the above case happens, there is
nothing we can do for now (because we have no idea what it will become),
except disabling the page size support and throw an error/warning.

From the user's perspective, throwing an error message or marking the
page size unavailable should be the same. I think we may leave the code
as-is. We can fix the future case later separately.

The only thing I can think of is to record the number of consecutive
page walk errors without a success. Occasional failures are OK and
expected, such as if reclaim zeroes a PTE and a later perf event goes
and looks at it. But a *LOT* of consecutive errors indicates a real
problem somewhere.

Maybe if you have 10,000 or 1,000,000 successive walk failures, you do a
WARN_ON_ONCE().

The user space perf tool looks like a better place for this kind of warning. The perf tool knows the total number of the samples. It also knows the number of the page size 0 samples. We can set a threshold, e.g., 90%. If 90% of the samples have the page size 0, perf tool will throw out a warning message.

The problem is that the warning from the perf tool usually includes some hints regarding the cause of the warning or possible solution to workaround/fix the warning. What message should we deliver to the users?
"Warning: Too many error page size. Address space isolation feature may be enabled, please check"?


Thanks,
Kan