[RFD] CAT user space interface revisited

From: Thomas Gleixner
Date: Wed Nov 18 2015 - 13:25:58 EST


Folks!

After rereading the mail flood on CAT and staring into the SDM for a
while, I think we all should sit back and look at it from scratch
again w/o our preconceptions - I certainly had to put my own away.

Let's look at the properties of CAT again:

- It's a per socket facility

- CAT slots can be associated to external hardware. This
association is per socket as well, so different sockets can have
different behaviour. I missed that detail when staring the first
time, thanks for the pointer!

- The association ifself is per cpu. The COS selection happens on a
CPU while the set of masks which are selected via COS are shared
by all CPUs on a socket.

There are restrictions which CAT imposes in terms of configurability:

- The bits which select a cache partition need to be consecutive

- The number of possible cache association masks is limited

Let's look at the configurations (CDP omitted and size restricted)

Default: 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1

Shared: 1 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 0 0 1 1

Isolated: 1 1 1 1 0 0 0 0
0 0 0 0 1 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1

Or any combination thereof. Surely some combinations will not make any
sense, but we really should not make any restrictions on the stupidity
of a sysadmin. The worst outcome might be L3 disabled for everything,
so what?

Now that gets even more convoluted if CDP comes into play and we
really need to look at CDP right now. We might end up with something
which looks like this:

1 1 1 1 0 0 0 0 Code
1 1 1 1 0 0 0 0 Data
0 0 0 0 0 0 1 0 Code
0 0 0 0 1 1 0 0 Data
0 0 0 0 0 0 0 1 Code
0 0 0 0 1 1 0 0 Data
or
0 0 0 0 0 0 0 1 Code
0 0 0 0 1 1 0 0 Data
0 0 0 0 0 0 0 1 Code
0 0 0 0 0 1 1 0 Data

Let's look at partitioning itself. We have two options:

1) Per task partitioning

2) Per CPU partitioning

So far we only talked about #1, but I think that #2 has a value as
well. Let me give you a simple example.

Assume that you have isolated a CPU and run your important task on
it. You give that task a slice of cache. Now that task needs kernel
services which run in kernel threads on that CPU. We really don't want
to (and cannot) hunt down random kernel threads (think cpu bound
worker threads, softirq threads ....) and give them another slice of
cache. What we really want is:

1 1 1 1 0 0 0 0 <- Default cache
0 0 0 0 1 1 1 0 <- Cache for important task
0 0 0 0 0 0 0 1 <- Cache for CPU of important task

It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.

We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning.

Even if next generation systems will have more COS ids available,
there are not going to be enough to have a system wide consistent
view unless we have COS ids > nr_cpus.

Aside of that I don't think that a system wide consistent view is
useful at all.

- If a task migrates between sockets, it's going to suffer anyway.
Real sensitive applications will simply pin tasks on a socket to
avoid that in the first place. If we make the whole thing
configurable enough then the sysadmin can set it up to support
even the nonsensical case of identical cache partitions on all
sockets and let tasks use the corresponding partitions when
migrating.

- The number of cache slices is going to be limited no matter what,
so one still has to come up with a sensible partitioning scheme.

- Even if we have enough cos ids the system wide view will not make
the configuration problem any simpler as it remains per socket.

It's hard. Policies are hard by definition, but this one is harder
than most other policies due to the inherent limitations.

So now to the interface part. Unfortunately we need to expose this
very close to the hardware implementation as there are really no
abstractions which allow us to express the various bitmap
combinations. Any abstraction I tried to come up with renders that
thing completely useless.

I was not able to identify any existing infrastructure where this
really fits in. I chose a directory/file based representation. We
certainly could do the same with a syscall, but that's just an
implementation detail.

At top level:

xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
xxxxxxx/cat/cdp_enable <- Depends on CDP availability

Per socket data:

xxxxxxx/cat/socket-0/
...
xxxxxxx/cat/socket-N/l3_size
xxxxxxx/cat/socket-N/hwsharedbits

Per socket mask data:

xxxxxxx/cat/socket-N/cos-id-0/
...
xxxxxxx/cat/socket-N/cos-id-N/inuse
/cat_mask
/cdp_mask <- Data mask if CDP enabled

Per cpu default cos id for the cpus on that socket:

xxxxxxx/cat/socket-N/cpu-x/default_cosid
...
xxxxxxx/cat/socket-N/cpu-N/default_cosid

The above allows a simple cpu based partitioning. All tasks which do
not have a cache partition assigned on a particular socket use the
default one of the cpu they are running on.

Now for the task(s) partitioning:

xxxxxxx/cat/partitions/

Under that directory one can create partitions

xxxxxxx/cat/partitions/p1/tasks
/socket-0/cosid
...
/socket-n/cosid

The default value for the per socket cosid is COSID_DEFAULT, which
causes the task(s) to use the per cpu default id.

Thoughts?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/