Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu

From: Stephane Eranian
Date: Mon Dec 13 2010 - 11:42:28 EST

Next message: Michael Holzheu: "Re: [patch v2 4/4] taskstats: Export "cdata_wait" CPU times withtaskstats"
Previous message: Olaya, Margarita: "RE: [PATCH 09/30] sound: don't use flush_scheduled_work()"
In reply to: Lin Ming: "Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu"
Next in thread: Andi Kleen: "Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Dec 13, 2010 at 9:27 AM, Lin Ming <ming.m.lin@xxxxxxxxx> wrote:
> On Sat, 2010-12-11 at 13:49 +0800, Stephane Eranian wrote:
>> Hi,
>>
>> Ok, so I have an explanation for what we are seeing. In fact, what
>> bothered me in all
>> of this is that I did not recall ever running into this problem of
>> double-interrupt with HT
>> when I implemented the perfmon support for uncore sampling. The reason
>> is in fact
>> real simple.
>>
>> You are getting interrupts on both threads because you have enabled uncore PMU
>> on all CPUs, in uncore_cpu_starting():
>> + Â Â Â rdmsrl(MSR_IA32_DEBUGCTLMSR, val);
>> + Â Â Â wrmsrl(MSR_IA32_DEBUGCTLMSR, val | DEBUGCTLMSR_ENABLE_UNCORE_PMI);
>
> Yeah! Thanks for the catch.
>
If think there you want something like:

if (cpu == cpumask_first(topology_core_siblings(cpu)) {
Â Â Â rdmsrl(MSR_IA32_DEBUGCTLMSR, val);
Â Â wrmsrl(MSR_IA32_DEBUGCTLMSR, val | DEBUGCTLMSR_ENABLE_UNCORE_PMI);
}
I have verified that with something like that in place, you get only
one interrupt.
But it seems there may be some initialization of counters missing somewhere
because running perf stat, I got bogus counts back with the uncore
fixed counter.

>>
>> You need to do that only on ONE of the two siblings. In fact, you want
>> that only ONCE
>> per socket. You can do this on the first CPU to Âuse the uncore to add
>> something a bit
>> more dynamic and to make sure you have some control over where the
>> overhead is applied.
>> Once you do that, only one CPU/socket will get the interrupt and all
>> will be fine.
>
> I'll update the patches.
>
> Thanks,
> Lin Ming
>
>>
>>
>>
>> On Fri, Dec 10, 2010 at 11:47 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>> > On Fri, 2010-12-10 at 00:46 +0100, Stephane Eranian wrote:
>> >> Hi,
>> >>
>> >> So I have tested this patch a bit on WSM and as I expected there
>> >> are issues with sampling.
>> >>
>> >> When HT is on, both siblings CPUs get the interrupt. The HW does not
>> >> allow you to only point interrupts to a single HT thread (CPU).
>> >
>> > Egads, how ugly :/
>> >
>> >> I did verify that indeed both threads get the interrupt and that you have a
>> >> race condition. Both sibling CPUs stop uncore, get the status. They may get
>> >> the same overflow status. Both will pass the uncore->active_mask because
>> >> it's shared among siblings cores. Thus, Âyou have a race for the whole
>> >> interrupt handler execution.
>> >>
>> >> You need some serialization in there. But the patch does not address this.
>> >> The problem is different from the back-to-back interrupt issue that
>> >> Don worked on.
>> >> The per-cpu marked/handled trick cannot work to avoid this problem.
>> >>
>> >> You cannot simply say "the lowest indexed" CPU of a sibling pair
>> >> handles the interrupt
>> >> because you don't know if this in an uncore intr, core interrupt or
>> >> something else. You
>> >> need to check. That means each HT thread needs to check uncore
>> >> ovfl_status. IF the
>> >> status is zero, then return. Otherwise, you need to do a 2nd level
>> >> check before you can
>> >> execute the handler. You need to know if the sibling CPU has already
>> >> "consumed" that
>> >> interrupt.
>> >>
>> >> I think you need some sort of generation counter per physical core and
>> >> per HT thread.
>> >> On interrupt, you could do something along the line of:
>> >> Â Â Â if (mycpu->intr_count == mysibling->intr_count) {
>> >> Â Â Â Â Â then mycpu->intr_count++
>> >> Â Â Â Â Â execute intr_handler()
>> >> Â Â Â } else {
>> >> Â Â Â Â Â mycpu->intr_count++
>> >> Â Â Â Â Â return;
>> >> Â Â Â }
>> >> Of course, the above needs some atomicity and ad locking
>> >
>> > Does that guarantee that the same sibling handles all interrupts? Since
>> > a lot of the infrastructure uses local*_t we're not good with cross-cpu
>> > stuff.
>> >
>> > Damn what a mess.. we need to serialize enough for both cpus to at least
>> > see the overflow bit.. maybe something like:
>> >
>> >
>> > struct intel_percore {
>> > Â ...
>> > Â atomic_t uncore_barrier;
>> > };
>> >
>> > void uncore_barrier(void)
>> > {
>> > Â Â Â Âstruct intel_percore *percore = this_cpu_ptr(cpu_hw_events)->percore;
>> > Â Â Â Âint armed;
>> >
>> > Â Â Â Âarmed = atomic_cmpxchg(&percore->uncore_barrier, 0, 1) == 0;
>> > Â Â Â Âif (armed) {
>> > Â Â Â Â Â Â Â Â/* we armed, it, now wait for completion */
>> > Â Â Â Â Â Â Â Âwhile (atomic_read(&percore->uncore_barrier))
>> > Â Â Â Â Â Â Â Â Â Â Â Âcpu_relax();
>> > Â Â Â Â} else {
>> > Â Â Â Â Â Â Â Â/* our sibling must have, decrement it */
>> > Â Â Â Â Â Â Â Âif (atomic_cmpxchg(&percore->uncore_barrier, 1, 0) != 1)
>> > Â Â Â Â Â Â Â Â Â Â Â ÂBUG();
>> > Â Â Â Â}
>> > }
>> >
>> > Then have something like:
>> >
>> > handle_uncore_interrupt()
>> > {
>> > Â Â Â Âu64 overflow = rdmsrl(MSR_UNCORE_PERF_GLOBAL_OVF_STATUS);
>> > Â Â Â Âint cpu;
>> >
>> > Â Â Â Âif (!overflow)
>> > Â Â Â Â Â Â Â Âreturn 0; /* not our interrupt to handle */
>> >
>> > Â Â Â Âuncore_barrier(); /* wait so our sibling will also observe the overflow */
>> >
>> > Â Â Â Âcpu = smp_processor_id();
>> > Â Â Â Âif (cpu != cpumask_first(topology_thread_cpumask(cpu)))
>> > Â Â Â Â Â Â Â Âreturn 1; /* our sibling will handle it, eat the NMI */
>> >
>> > Â Â Â Â/* OK, we've got an overflow and we're the first CPU in the thread mask */
>> >
>> > Â Â Â Â... do fancy stuff ...
>> >
>> > Â Â Â Âreturn 1; /* we handled it, eat the NMI */
>> > }
>> >
>> >
>> >> (but I don't think you can use locks in NMI context).
>> >
>> > You can, as long as they're never used from !NMI, its icky, but it
>> > works.
>> >
>> >> This makes me wonder if vectoring uncore to NMI is really needed,
>> >> given you cannot
>> >> correlated to an IP, incl. a kernel IP. If we were to vector to a
>> >> dedicated (lower prio)
>> >> vector, then we could use the trick of saying the lowest indexed CPU in a pair
>> >> handles the interrupt (because we would already know this is an uncore
>> >> interrupt).
>> >> This would be much simpler. Price: not samples in kernel's critical
>> >> section. But those
>> >> are useless anyway with uncore events.
>> >
>> > But the uncore uses the same PMI line, right? You cannot point the
>> > uncore to another vector. /me goes find the docs -- ok, its too early in
>> > the morning to clearly parse that...
>> >
>> > Besides, people asked for the sampling thing didn't they (also we need
>> > it to fold the count to avoid overflow on 48bit). Also the PAPI people
>> > even want per-task uncore counters because that's the only thing PAPI
>> > can do.
>> >
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michael Holzheu: "Re: [patch v2 4/4] taskstats: Export "cdata_wait" CPU times withtaskstats"
Previous message: Olaya, Margarita: "RE: [PATCH 09/30] sound: don't use flush_scheduled_work()"
In reply to: Lin Ming: "Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu"
Next in thread: Andi Kleen: "Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]