Re: [PATCH] kfence: Replace local_clock() with ktime_get_boot_fast_ns()

From: Marco Elver
Date: Wed Nov 22 2023 - 17:20:21 EST


On Wed, 22 Nov 2023 at 22:36, Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
>
> On 2023/11/23 4:35, Marco Elver wrote:
> > On Wed, 22 Nov 2023 at 21:01, Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
> >>
> >> The time obtained by local_clock() is the local CPU time, which may
> >> drift between CPUs and is not suitable for comparison across CPUs.
> >>
> >> It is possible for allocation and free to occur on different CPUs,
> >> and using local_clock() to record timestamps may cause confusion.
> >
> > The same problem exists with printk logging.
> >
> >> ktime_get_boot_fast_ns() is based on clock sources and can be used
> >> reliably and accurately for comparison across CPUs.
> >
> > You may be right here, however, the choice of local_clock() was
> > deliberate: it's the same timestamp source that printk uses.
> >
> > Also, on systems where there is drift, the arch selects
> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK (like on x86) and the drift is
> > generally bounded.
> >
> >> Signed-off-by: Juntong Deng <juntong.deng@xxxxxxxxxxx>
> >> ---
> >> mm/kfence/core.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> >> index 3872528d0963..041c03394193 100644
> >> --- a/mm/kfence/core.c
> >> +++ b/mm/kfence/core.c
> >> @@ -295,7 +295,7 @@ metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state nex
> >> track->num_stack_entries = num_stack_entries;
> >> track->pid = task_pid_nr(current);
> >> track->cpu = raw_smp_processor_id();
> >> - track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
> >> + track->ts_nsec = ktime_get_boot_fast_ns();
> >
> > You have ignored the comment placed here - now it's no longer the same
> > source as printk timestamps. I think not being able to correlate
> > information from KFENCE reports with timestamps in lines from printk
> > is worse.
> >
> > For now, I have to Nack: Unless you can prove that
> > ktime_get_boot_fast_ns() can still be correlated with timestamps from
> > printk timestamps, I think this change only trades one problem for
> > another.
> >
> > Thanks,
> > -- Marco
>
> Honestly, the possibility of accurately matching a message in the printk
> log by the timestamp in the kfence report is very low, since allocation
> and free do not directly correspond to a certain event.

It's about being able to compare the timestamps. I don't want to match
an exact event, but be able to figure out which event happened
before/after an allocation or free, i.e. the logical ordering of
events.

With CONFIG_PRINTK_CALLER we can see the CPU ID in printk lines and
are therefore able to accurately compare printk lines with information
given by KFENCE alloc/free info.

> Since time drifts across CPUs, timestamps may be different even if
> allocation and free can correspond to a certain event.

This is not a problem with CONFIG_PRINTK_CALLER.

> If we really need to find the relevant printk logs by the timestamps in
> the kfence report, all we can do is to look for messages that are within
> a certain time range.
>
> If we are looking for messages in a certain time range, there is not
> much difference between local_clock() and ktime_get_boot_fast_ns().
>
> Also, this patch is in preparation for my next patch.
>
> My next patch is to show the PID, CPU number, and timestamp when the
> error occurred, in this case time drift from different CPUs can
> cause confusion.

It's not quite clear how there's a dependency between this patch and a
later patch, but generally it's good practice to send related patches
as a patch series. That way it's easier to see what the overall
changes are and provide feedback as a whole - as is, it's difficult to
provide feedback.

However, from what you say this information is already given.
dump_stack_print_info() shows this - e.g this bit here is printed by
where the error occurred:

| CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2
04/01/2014

And if the printk log has timestamps, then these lines are prefixed
with the timestamp where the error occurred.

> For example, use-after-free caused by a subtle race condition, in which
> the time between the free and the error occur will be very close.
>
> Time drift from different CPUs may cause it to appear in the report that
> the error timestamp precedes the free timestamp.

That doesn't matter. I recommend that you go through a hypothetical
debugging scenario:
1. We are not interested in the absolute timings of events, but the
logical ordering between them.

2. The logical ordering of events is inherent from how KFENCE
operates: an error _always_ follows an allocation and/or free. From a
debugging point of view, the timestamps do not have any value here.

3. The timestamps _do_ add value when trying to figure out the logical
ordering between allocation, free, or the erroneous access _with
other_ events in the system. A stream of other events is always shown
in the kernel log (printk). Other streams of events can be obtained
via e.g. ftrace (which also uses local_clock(), see
kernel/trace/trace_clock.c).

So, the timestamp that KFENCE should show is the one that most likely
allows us to deduce the logical ordering with other events in the
system.