Re: [PATCH] perf_events: AMD event scheduling (v2)

From: Stephane Eranian
Date: Thu Feb 04 2010 - 11:05:31 EST


On Thu, Feb 4, 2010 at 3:55 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, 2010-02-01 at 22:15 +0200, Stephane Eranian wrote:
>>
>> Â Â Â This patch adds correct AMD Northbridge event scheduling.
>> Â Â Â It must be applied on top tip-x86 + hw_perf_enable() fix.
>>
>> Â Â Â NB events are events measuring L3 cache, Hypertransport
>> Â Â Â traffic. They are identified by an event code Â>= 0xe0.
>> Â Â Â They measure events on the Northbride which is shared
>> Â Â Â by all cores on a package. NB events are counted on a
>> Â Â Â shared set of counters. When a NB event is programmed
>> Â Â Â in a counter, the data actually comes from a shared
>> Â Â Â counter. Thus, access to those counters needs to be
>> Â Â Â synchronized.
>>
>> Â Â Â We implement the synchronization such that no two cores
>> Â Â Â can be measuring NB events using the same counters. Thus,
>> Â Â Â we maintain a per-NB * allocation table. The available slot
>> Â Â Â is propagated using the event_constraint structure.
>>
>> Â Â Â This 2nd version takes into account the changes on how
>> Â Â Â constraints are stored by the scheduling code.
>>
>> Â Â Â The patch also takes care of hotplug CPU.
>>
>> Â Â Â Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
>
> Please run the patch through checkpatch, there's lots of trivial coding
> style errors (spaces instead of tabs, for(i=0; etc..)
>
Sorry about that.

>> @@ -2250,10 +2261,144 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event
>> Â Â Â return &unconstrained;
>> Â}
>>
>> +/*
>> + * AMD64 events are detected based on their event codes.
>> + */
>> +static inline int amd_is_nb_event(struct hw_perf_event *hwc)
>> +{
>> + Â Â u64 val = hwc->config;
>
> & K7_EVNTSEL_EVENT_MASK ?

Yes, except that:
-#define K7_EVNTSEL_EVENT_MASK 0x7000000FFULL
+#define K7_EVNTSEL_EVENT_MASK 0xF000000FFULL

>
>> + Â Â /* event code : bits [35-32] | [7-0] */
>> + Â Â val = (val >> 24) | ( val & 0xff);
>> + Â Â return val >= 0x0e0;
>> +}
>> +
>> +static void amd_put_event_constraints(struct cpu_hw_events *cpuc,
>> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct perf_event *event)
>> +{
>> + Â Â struct hw_perf_event *hwc = &event->hw;
>> + Â Â struct perf_event *old;
>> + Â Â struct amd_nb *nb;
>> + Â Â int i;
>> +
>> + Â Â /*
>> + Â Â Â* only care about NB events
>> + Â Â Â*/
>> + Â Â if(!amd_is_nb_event(hwc))
>> + Â Â Â Â Â Â return;
>> +
>> + Â Â /*
>> + Â Â Â* NB not initialized
>> + Â Â Â*/
>> + Â Â nb = cpuc->amd_nb;
>> + Â Â if (!nb)
>> + Â Â Â Â Â Â return;
>> +
>> + Â Â if (hwc->idx == -1)
>> + Â Â Â Â Â Â return;
>> +
>> + Â Â /*
>> + Â Â Â* need to scan whole list because event may not have
>> + Â Â Â* been assigned during scheduling
>> + Â Â Â*/
>> + Â Â for(i=0; i < x86_pmu.num_events; i++) {
>> + Â Â Â Â Â Â if (nb->owners[i] == event) {
>> + Â Â Â Â Â Â Â Â Â Â old = cmpxchg(nb->owners+i, event, NULL);
>
> might want to validate old is indeed event.
>
It was in there during debugging ;->

>> + Â Â Â* event can already be present yet not assigned (in hwc->idx)
>> + Â Â Â* because of successive calls to x86_schedule_events() from
>> + Â Â Â* hw_perf_group_sched_in() without hw_perf_enable()
>> + Â Â Â*/
>> + Â Â for(i=0; i < max; i++) {
>> + Â Â Â Â Â Â /*
>> + Â Â Â Â Â Â Â* keep track of first free slot
>> + Â Â Â Â Â Â Â*/
>> + Â Â Â Â Â Â if (k == -1 && !nb->owners[i])
>> + Â Â Â Â Â Â Â Â Â Â k = i;
>
> break?
>
No, we need to look for event, that's the main purpose
of the loop. We simply overlap this with looking for an
empty slot (before the event).

>> + Â Â Â*
>> + Â Â Â* try to alllcate same counter as before if
>> + Â Â Â* event has already been assigned once. Otherwise,
>> + Â Â Â* try to use free counter k obtained during the 1st
>> + Â Â Â* pass above.
>> + Â Â Â*/
>> + Â Â i = j = hwc->idx != -1 ? hwc->idx : (k == -1 ? 0 : k);
>
> That's patently unreadable, and I'm not sure what happens if we failed
> to find an eligible spot in the above loop, should we not somehow jump
> out and return emptyconstraint?
>
I have clarified the nested if. The goal of the first loop is to check
if the event is already present (explained why this is possible in the
comment). We may not have scanned all the slots. Furthermore, there
may be concurrent scans going on on other CPUs. The first pass tries
to find an empty slot, it does not reserve it. The second loop is the actual
allocation. We speculate the slot we found in the first pass is still available.
If the second loops fails, then we return emptyconstraints.

>> + Â Â do {
>> + Â Â Â Â Â Â old = cmpxchg(nb->owners+i, NULL, event);
>> + Â Â Â Â Â Â if (!old)
>> + Â Â Â Â Â Â Â Â Â Â break;
>> + Â Â Â Â Â Â if (++i == x86_pmu.num_events)
>> + Â Â Â Â Â Â Â Â Â Â i = 0;
>> + Â Â } while (i != j);
>> +skip:
>> + Â Â if (!old)
>> + Â Â Â Â Â Â return &nb->event_constraints[i];
>> + Â Â return &emptyconstraint;
>> Â}
>>
>> Âstatic int x86_event_sched_in(struct perf_event *event,
>
>> @@ -2561,6 +2707,96 @@ static __init int intel_pmu_init(void)
>> Â Â Â return 0;
>> Â}
>>
>> +static struct amd_nb *amd_alloc_nb(int cpu, int nb_id)
>> +{
>> + Â Â Â Âstruct amd_nb *nb;
>> + Â Â int i;
>> +
>> + Â Â Â Ânb= vmalloc_node(sizeof(struct amd_nb), cpu_to_node(cpu));
>
> $ pahole -C amd_nb build/arch/x86/kernel/cpu/perf_event.o
> struct amd_nb {
>    Âint            Ânb_id;        Â/*   0   4 */
>    Âint            Ârefcnt;        /*   4   4 */
> Â Â Â Âstruct perf_event * Â Â Â Âowners[64]; Â Â Â Â Â /* Â Â 8 Â 512 */
> Â Â Â Â/* --- cacheline 8 boundary (512 bytes) was 8 bytes ago --- */
>    Âstruct event_constraint  Âevent_constraints[64]; /*  520 Â1536 */
> Â Â Â Â/* --- cacheline 32 boundary (2048 bytes) was 8 bytes ago --- */
>
> Â Â Â Â/* size: 2056, cachelines: 33 */
> Â Â Â Â/* last cacheline: 8 bytes */
> };
>
> Surely we can kmalloc that?
>
Ok, I can switch that.

>> +
>> + Â Â /*
>> + Â Â Â* function may be called too early in the
>> + Â Â Â* boot process, in which case nb_id is bogus
>> + Â Â Â*
>> + Â Â Â* for BSP, there is an explicit call from
>> + Â Â Â* amd_pmu_init()
>> + Â Â Â*/
>
> I keep getting flash-backs to doom's graphics engine every time I see
> BSP..
>
So how do you call the initial boot processor?


>> + Â Â nb_id = amd_get_nb_id(cpu);
>> + Â Â if (nb_id == BAD_APICID)
>> + Â Â Â Â Â Â return;
>> +
>> + Â Â cpu1 = &per_cpu(cpu_hw_events, cpu);
>> + Â Â cpu1->amd_nb = NULL;
>> +
>> + Â Â raw_spin_lock(&amd_nb_lock);
>> +
>> + Â Â for_each_online_cpu(i) {
>> + Â Â Â Â Â Â cpu2 = &per_cpu(cpu_hw_events, i);
>> + Â Â Â Â Â Â nb = cpu2->amd_nb;
>> + Â Â Â Â Â Â if (!nb)
>> + Â Â Â Â Â Â Â Â Â Â continue;
>> + Â Â Â Â Â Â if (nb->nb_id == nb_id)
>> + Â Â Â Â Â Â Â Â Â Â goto found;
>> + Â Â }
>> +
>> + Â Â nb = amd_alloc_nb(cpu, nb_id);
>> + Â Â if (!nb) {
>> + Â Â Â Â Â Â pr_err("perf_events: failed to allocate NB storage for CPU%d\n", cpu);
>> + Â Â Â Â Â Â raw_spin_unlock(&amd_nb_lock);
>> + Â Â Â Â Â Â return;
>> + Â Â }
>> +found:
>> + Â Â nb->refcnt++;
>> + Â Â cpu1->amd_nb = nb;
>> +
>> + Â Â raw_spin_unlock(&amd_nb_lock);
>
> Can't this be simplified by using the cpu to node mask?

You mean to find the NB that corresponds to a CPU?

> Also, I think this is buggy in that:
>
> Âperf_disable();
> Âevent->pmu->disable(event);
> Â...
> Âevent->pmu->enable(event);
> Âperf_enable();
>
> can now fail, I think we need to move the put_event_constraint() from
> x86_pmu_disable() into x86_perf_enable() or something.

Constraints are reserved during x86_scheduling(), not during enable().
So if you had a conflict it was detected earlier than that.


--
Stephane Eranian | EMEA Software Engineering
Google France | 38 avenue de l'OpÃra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/