Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

From: Johannes Weiner
Date: Thu Jul 19 2018 - 13:51:21 EST


On Thu, Jul 19, 2018 at 08:08:20AM -0700, Linus Torvalds wrote:
> On Wed, Jul 18, 2018 at 5:03 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > And as said before, we can compress the state from 12 bytes, to 6 bits
> > (or 1 byte), giving another 11 bytes for 59 bytes free.
> >
> > Leaving us just 5 bytes short of needing a single cacheline :/
>
> Do you actually need 64 bits for the times?
>
> That's the big cost. And it seems ridiculous, if you actually care about size.
>
> You already have a 64-bit start time. Everything else is some
> cumulative relative time. Do those really need 64-bit and nanosecond
> resolution?
>
> Maybe a 32-bit microsecond would be ok - would you ever account more
> than 35 minutes of anything without starting anew?

D'oh, you're right, the per-cpu buckets don't need to be this big at
all. In fact, we flush those deltas out every 2 seconds when there is
activity to maintain the running averages. Since we get 4.2s worth of
nanoseconds into a u32, we don't even need to divide in the hotpath.

Something along the lines of this here should work:

static void psi_group_change(struct psi_group *group, int cpu, u64 now,
unsigned int clear, unsigned int set)
{
struct psi_group_cpu *groupc;
unsigned int *tasks;
unsigned int t;
u32 delta;

groupc = per_cpu_ptr(group->cpus, cpu);
tasks = groupc->tasks;

/* Time since last task change on this runqueue */
delta = now - groupc->last_time;
groupc->last_time = now;

/* Tasks waited for IO? */
if (tasks[NR_IOWAIT]) {
if (!tasks[NR_RUNNING])
groupc->full_time[PSI_IO] += delta;
else
groupc->some_time[PSI_IO] += delta;
}

/* Tasks waited for memory? */
if (tasks[NR_MEMSTALL]) {
if (!tasks[NR_RUNNING] ||
(cpu_curr(cpu)->flags & PF_MEMSTALL))
groupc->full_time[PSI_MEM] += delta;
else
groupc->some_time[PSI_MEM] += delta;
}

/* Tasks waited for the CPU? */
if (tasks[NR_RUNNING] > 1)
groupc->some_time[PSI_CPU] += delta;

/* Tasks were generally non-idle? To weigh the CPU in summaries */
if (tasks[NR_RUNNING] || tasks[NR_IOWAIT] || tasks[NR_MEMSTALL])
groupc->nonidle_time += delta;

/* Update task counts according to the set/clear bitmasks */
for (t = 0; clear; clear &= ~(1 << t), t++)
if (clear & (1 << t))
groupc->tasks[t]--;
for (t = 0; set; set &= ~(1 << t), t++)
if (set & (1 << t))
groupc->tasks[t]++;

/* Kick the stats aggregation worker if it's gone to sleep */
if (!delayed_work_pending(&group->clock_work))
schedule_delayed_work(&group->clock_work, PSI_FREQ);
}

And then we can pack it down to one cacheline:

struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS]; // 3

/* Time sampling bucket for pressure states - no FULL for CPU */
u32 some_time[NR_PSI_RESOURCES];
u32 full_time[NR_PSI_RESOURCES - 1];

/* Time sampling bucket for non-idle state (ns) */
u32 nonidle_time;

/* Time of last task change in this group (rq_clock) */
u64 last_time;
};

I'm going to go test with this.

Thanks