RE: cPCI Hot Swap for Linux (system-level issues)

William R. Kerr (
Wed, 4 Mar 1998 22:53:47 -0800 (PST)

Several people have written asking for a discussion of the system-level issues
in implementing cPCI Hot Swap in an operating system. So here goes.

Some of the issues are fairly obvious: when a card is live-inserted
during operation, an interrupt is generated. This needs to activate some
system-level code that can arrange to locate the new card, and to notify
some device driver that it should attach to a new instance of the device
(or devices, for multi-function cards). Similarly, when the operator
activates the switch built in to the ejector latch, an interrupt will
be generated. Some system-level code needs to identify the card that
originated the signal, and then be able to notify the driver that all
activity currently in progress should be completed, and no new operations
started. When the card is "quiet", the software can illuminate a blue
LED on the faceplate of the card, which indicates to the operator that
software has detached, and the card can be pulled.

For Unix systems, some of the problems that have to be solved at this
layer are (1) devising some means for registering drivers with the
system-level "bus driver" so that devices can be mapped onto drivers and
(2) devising an API whereby drivers can receive event notification, and
(3) mundane details like arranging for device minor numbers, and the
corresponding names in the /dev directory to work properly in a more
fluidly dynamic system environment.

All issues of this type can be resolved, and I would urge people with
more recent experience in kernel workings than I to begin drafting
some plans for how this could be cleanly laid out. The goal would be
write a portable "bus driver" and a set of clearly specified conventions
that writers of device drivers could follow so that drivers could be
written that would work both with normal PCI implementations of the
device as well as with Hot Swap cPCI implementations. I'm not really
qualified to do this, though I would be happy to help in the reviewing
process. Another opportunity for the "bazaar model" of free software

There is another layer of system-level issues, and I alluded previously to
some of the decisions at this layer that have been made by Microsoft for
NT 5.0. This layer is the basic one of allocating system resources to a
newly inserted card: IRQ's, I/O ports, DMA addresses, etc. First, some

(==== tedious, pedantic background information below ====)

The PCI specification lays out a "configuration space" for devices. This
space is 256 bytes in size, with a standardized layout for the first
0x40 bytes. This header has several read-only fields for manufacturer
ID, device ID, revision number, device class (network class, mass
storage class, bridge, etc.) and so on. There are also several
read/write fields that may be used for configuring the device, i.e.,
allocating system resources for use by the device driver. Some of the
fields have a mixture of read and write bits by which one can both
read the size of the devices status/control register space and also
write the assigned location. For example, on machines with "in" and
"out" instructions, such as x86 processors, one could determine that
a device needed 128 I/O ports allocated in I/O space, and then one
could assign, say, 0x7c00 as the base address of these ports. A similar
mechanism exists for memory-mapping the the ports (for machines that
don't have "in" and "out", or by preference).

Once the bus has been enumerated in this fashion, and non-overlapping
system resources assigned, device drivers can attach. They read the
resource assignments from the configuration header, and proceed with

In PC architecture machines, this allocation is normally performed by
the BIOS, as part of system initialization. It doesn't have to be,
though; any system-level code that has global knowledge of unassigned
resources may perform it. The only tricky part is supporting the
very early PCI devices, manufactured before the PCI spec. was firmed up.
These require special knowledge, rather than the generalized knowledge
of configuration header layout. As far as I know, these older devices are
primarily VGA devices, and it is unlikely that any more machines will be
manufactured that incorporate them.

Next: PCI/PCI bridge devices, which connect two PCI busses. This
class of devices also has a standard configuration header layout, but
it differs from ordinary devices. A bridge must pass PCI cycles through to
the subordinate bus, so it must be configured with the range of resources
allocated to devices on the subordinate bus. Or subordinate busses,
actually, since a device on the lower bus might be another bridge. In
fact, one normally considers that a system has a *tree* of PCI busses,
and the process of enumerating PCI space is actually a recursive activity.

The point I want to make here is that all of the resources allocated
to devices on a subordinate tree of PCI busses must be *contiguous*,
because the bridge must be configured so that it can pass cycles through.
The same with DMA space, in which the cycles originate on the devices
and the bridge must decide whether to pass them on to the superior

Normally, when PCI space is enumerated at system startup, system resources
are allocated to be tightly packed. This creates a problem when, after
a system is up and devices are running, someone inserts a new device.
There might be lots of unallocated resources in the system as a whole,
but there is not likely to be any resources contiguous with those
already configured into a bridge device. In other words, we're stuck.

If this is a bit obscure and murky, let's try a picture.

Bus 0 ------Bridge A-------------------------Bridge B----------------
| |
Bus 1 | Bus 2 |
| |
Device Device
| I/O base 7C00 | I/O base 7D00
| |
Device |
| I/O base 7C80 |
Empty Slot

At boot time two bridges were found, and the bus subordinate to the
bridge labelled "A" had two devices. I/O addresses 7C00 and 7C80 were
allocated for the two devices, and Bridge A was configured to pass
I/O cycles in the address range 7C00-7CFF. The devices on the bus
subordinate to Bridge B were also initialized, and B configured as well.

Now, suppose that Bus 1 is a cPCI Hot Swap bus, and that the operator
live-inserts a board into the Empty Slot. Assuming that I/O space
below 7C00 is allocated to some device not shown in the figure, there
is no unallocated space in the range configured into Bridge A.

As long as I've come this far, one more point: the bus numbers shown
are *assigned* during the initial enumeration, and must also be configured
into the bridges. Bus numbers must be assigned such that all numbers
subordinate to a bridge must also be contiguous. Now, consider that
some cPCI adapter cards contain a bridge device themselves! (For
example, some cards contain two ethernet controllers on a local PCI
bus, with a bridge being the device presented to the bus into which
the card is plugged.) Now we have to assign a new PCI bus number, but
as you can see from the figure, there are no unassigned numbers adjacent
to "1".

(==== end of tedious, pedantic background information ====)

There are two ways to solve the problem:
1. Dynamic re-enumeration during system operation.
2. Pre-allocation of resources for a set of cards that could
be live-inserted during operation.

Microsoft has decided on method 1 for NT 5.0, and I think that for the
bulk of their target market, this is the right choice as it is completely
generalized and can, in theory, support any new device just bought at
the local discount electronic outlet. (Of course, one could easily argue
that the MS target market is the entire universe, but that's another

In the new PnP (plug and play) architecture, device drivers are to consider
that all bus resources are dynamic, and may be shifted. Microsoft is
developing a new class of "bus drivers", of which the "PCI bus driver" will
be one. The PCI bus driver will be responsible for allocating resources
to devices, and (perhaps not initially, but eventually) will be capable
of dynamically re-enumerating the tree of busses to accommodate inserted

To support this model, PnP drivers will be told to *suspend operations* until
allowed to restart with a different set of system resources, including
bus number, I/O ports, DMA addresses, and IRQs.

As I mentioned in my previous post, I don't think that this model is
necessarily appropriate for real-time systems, or other high-availability
applications (like, e.g., telephone switching). Frankly, I think that
most embedded applications would prefer that hot swapping boards not
affect ongoing operations.

On the other hand, end users of embedded systems would probably be
quite satisfied with model 2. In this model, the system-level code that
initially enumerates PCI space and allocates resources could be
configured to reserve a certain amount of resources on a set of 1 or more
busses, and to configure the affected bridges with the pre-allocated
resources. In this model, devices could be inserted and have PCI
resources assigned without disrupting the operation of other devices.

[ Disclaimer: I don't *know* that Microsoft's plans don't include
provisions for configurable resource pre-allocation. On one side is
the fact that MS would like to own the universe; on the other side is
the fact that the new PnP architecture is a pretty formidable chunk
of work in it's own right.]

Anyway, for PC architecture machines, I've gotten some strong indications
that the BIOS vendors aren't considering adding configurable pre-allocation
to their PCI enumerator code. Their eyes are firmly on Redmond.

But that's okay. The PC architecture is far from the only game in town
in the embedded market, or in the Linux world.

What I would propose is a configurable PCI enumerator that runs very
early during the boot process, before PCI drivers for devices on busses other
than bus 0 attach to devices, and is capable of reserving space for a set
of boards that may be inserted while the system is operational. This,
coupled with the well thought out driver API and portable "bus driver"
discussed above, could move Linux firmly into the embedded market.

This sort of project is of moderate scope; not huge, but not something that
can be knocked off in a couple of caffeine-crazed all-nighters, either.
Since it requires cooperative participation from all fine folks who write
device drivers, it really needs to be a group effort. I'm willing to help.
Anyone else want to get the ball rolling?
-- bilker, only occasionally known to the Tristero. Muted post horn, indeed.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to