Integrating cpusets and cpu isolation [was Re: [CPUISOL] CPU isolationextensions]

From: Max Krasnyanskiy
Date: Thu Jan 31 2008 - 14:13:47 EST

Next message: Evgeniy Polyakov: "[1/2] POHMELFS - network filesystem with local coherent cache."
Previous message: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
In reply to: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
Next in thread: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Paul Jackson wrote:

Max wrote:
So far it seems that extending cpu_isolated_map
is more natural way of propagating this notion to the rest of the kernel.
Since it's very similar to the cpu_online_map concept and it's easy to integrated
with the code that already uses it.

If it were just realtime support, then I suspect I'd agree that
extending cpu_isolated_map makes more sense.

But some people use realtime on systems that are also heavily
managed using cpusets. The two have to work together. I have
customers with systems running realtime on a few CPUs, at the
same time that they have a large batch scheduler (which is layered
on top of cpusets) managing jobs on a few hundred other CPUs.
Hence with the cpuset 'sched_load_balance' flag I think I've already
done what I think is one part of what your patches achieve by extending
the cpu_isolated_map.

This is a common situation with "resource management" mechanisms such
as cpusets (and more recently cgroups and the subsystem modules it
supports.) They cut across existing core kernel code that manages such
key resources as CPUs and memory. As best we can, they have to work
with each other.

Hi Paul,

I thought some more about your proposal to use sched_load_balance flag in cpusets instead
of extending cpu_isolated_map. I looked at the cpusets, cgroups, latest thread started by
Peter (about sched domains and stuff) and here are my thoughts on this.

Here is the list of things of issues with sched_load_balance flag from CPU isolation perspective:
--
(1) Boot time isolation is not possible. There is currently no way to setup a cpuset at
boot time. For example we won't be able to isolate cpus from irqs and workqueues at boot.
Not a major issue but still an inconvenience.

--
(2) There is currently no easy way to figure out what cpuset a cpu belongs to in order to query it's sched_load_balance flag. In order to do that we need a method that iterates
all active cpusets and checks their cpus_allowed masks. This implies holding cgroup and cpuset mutexes. It's not clear whether it's ok to do that from the the contexts CPU isolation happens in (apic, sched, workqueue). It seems that cgroup/cpuset api is designed
from top down access. ie adding a cpu to a set and then recomputing domains. Which makes
perfect sense for the common cpuset usecase but is not what cpu isolation needs.
In other words I think it's much simpler and cleaner to use the cpu_isolated_map for isolation
purposes.

--
(3) cpusets are a bit too dynamic :). What I mean by this is that sched_load_balance flag
can be changed at any time without bringing a CPU offline. What that means is that we'll
need some notifier mechanisms for killing and restarting workqueue threads when that flag changes. Also we'd need some logic that makes sure that a user does not disable load balancing on all cpus because that effectively will kill workqueues on all the cpus.
This particular case is already handled very nicely in my patches. Isolated bit can be set
only when cpu is offline and it cannot be set on the first online cpu. Workqueus and other
subsystems already handle cpu hotplug events nicely and can easily ignore isolated cpus when
they come online.

-----

#1 is probably unfixable. #2 and #3 can be fixed but at the expense of extra complexity across
the board. I seriously doubt that I'll be able to push that through the reviews ;-).

Also personally I still think cpusets and cpu isolation attack two different problems. cpusets is about partitioning cpus and memory nodes, and managing tasks. Most of the cgroups/cpuset APIs are designed to deal with tasks. CPU isolation is much simpler and is at the lower layer. It deals with IRQs, kernel per cpu threads, etc. The only intersection I see is that both features affect scheduling domains (cpu isolation is again simple here it just puts cpus into null domains and
that's an existing logic in sched.c nothing new here).
So here are some proposal on how we can make them play nicely with each other.

--
(A) Make cpusets aware of isolated cpus.
All we have to do here is to change guarantee_online_cpus()
common_cpu_mem_hotplug_unplug()
to exclude cpu_isolated_map from cpu_online_map before using it.
And we'd need to change update_cpumasks()
to simply ignore isolated cpus.

That way if a cpu is isolated it'll be ignored by the cpusets logic. Which I believe would be
correct behavior. We're talking trivial ~5 liner patch which will be noop if cpu isolation is disabled.

(B) Ignore isolated map in cpuset. That's the current state of affairs with my patches applied.
Looks like your customers are happy with what they have now so they will probably not enable cpu isolation anyway :).

(C) Introduce cpu_usable_map. That map will be recomputed on hotplug events. Essentially it'd be
cpu_online_map AND ~cpu_isolated_map. Convert things like cpusets to use that map instead of online map.

We can probably come up with other options. My preference would be option (A).
I can kook up a patch for this and re-send the patch series. What do you think ?

btw My impression is that we're talking about very different use cases here. You're talking big
machines with lots of cpus and I'm thinking your probably talking soft RT here, probably RT networking services or something like that.

Use case I'm talking about is a dedicated machine for a certain task. Like HW simulator, wireless base station with SW MAC, etc. For this in any foreseeable future most common configuration will
be 2-8 cores. cpusets is probably an overkill here because apps will want to manage thread affinities
themselves anyways (for example right now we bind soft-RT threads to CPU0 and hard-RT thread to CPU1).

Sorry for the typos :)
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Evgeniy Polyakov: "[1/2] POHMELFS - network filesystem with local coherent cache."
Previous message: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
In reply to: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
Next in thread: Max Krasnyanskiy: "Re: [CPUISOL] CPU isolation extensions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]