Cache flushing...

David S. Miller (davem@caip.rutgers.edu)
Sat, 1 Jul 1995 23:32:36 -0400


I'm in the process of adding finer grained cache/TLB flushing hooks to
the Linux memory management (the Sparc port will need this as will
other ports which have to deal with virtually indexed caches and
extended TLB replacement). I jotted down the various situations in
which a cache needs to be flushed, and how much so that I could more
easily attack the code. I present here my little document. It is by
no means complete, nor is it completely Linux specific (I even venture
into non-COW based paging systems) Enjoy.

Later,
David S. Miller
davem@caip.rutgers.edu

----------------------
This document is seperated by type of cache:

1) Pure Virtual Caches

Here the cache line entries are indexed and tagged by virtual
addresses and as such cause many headaches for the kernel.

a) Context Switch - If a write-through cache is present
all that needs to be done is an invalidation of every
cache line during the switch. If a write-back (AKA
copy-back) cache is used, the a validation of main
memory must occur then an invalidation of all the
cache lines as in the write-through cache. Most
write-back caches provide a means to do these two
operations in one flush.

b) Fork - If COW is used, then as the parents pages must be
flushed from the cache. If they were not then if the
parent (not the child) were the next to run, and it invoked
a data access that flushed out a previously modified line
in the cache the cache flush with the modified line would
cause a bogus COW fault to happen, and the parent would
lose the modification it just made. I call these VAC COW
data bombs. Note that with a COW kernel invalidation of
these parent cache lines is not necessary, only a
validation (flush) to main memory of modified data. If
the kernel does not support COW, then the temporary pages
the kernel virtually maps to copy the parent address space
into the childs will have to be flushed before the
temporary mappings are taken away. Context switch flushes
take care of any other ambiguities and bad aliases that
could occur during a fork().

c) Exec - A full cache invalidation of the processes address
space must be performed as to prevent the new program from
generating hits with old data/instructions.

d) Exit - The last operation of exit() is to context switch
which will flush the cache. However, since we have wiped
out all the mappings of this process there is no place to
flush the modified pages in a write-back cache. Therefore
it is necessary to perform a complete invalidation of the
cache at the end of exit() without main memory validation.

e) Sbrk/brk - Increasing the processes BSS area during these
calls requires no flushing at all. However, when areas
become unmapped, an invalidation of those cache lines
need to be done lest the process incur hits on unmapped
virtual regions.

f) Shared Memory - Bad aliasing can occur if one process
maps a shared region in two places in it's virtual
mappings. If the cache is write-back in nature and uses
write-allocate for replacement, this can be avoided by
making sure that all mappings of the same shared region
within a process will index to the same cache lines. In
this scenerio if the same pages within this shared region
are accessed, the cache's line replacement will insure data
consistancy. If the cache is write-back without
write-allocate, then the only way around this is to only
allow one of the shared region instances to be mapped at
one time, incurring a fault for the process if one of the
other instances is accessed thus flushing the cache, and
unmapping the previously present shared region instance.
This of course incurs a huge performance penalty.
Detaching a shared region requires the same flushing
necessary for a shrinking bss segment during sbrk/brk
system calls as outlined above.

g) System Calls - If the cache makes no differentiation
between cache lines filled by the user and the kernel
then an explicit flush of the entire cache must be done
upon return from a system call to userland. Also, if
a supervisor bit is provided in the cache lines and
the cache is write-back in nature, then any data the
kernel copies into user space during the system calls
must be explicitly flushed because during line replace-
ment the user will not be able to translate the kernel
virtual addresses into the physical address where the
line needs to be written to main memory.

2) Virtual caches with process ID's in the tags.

These types of virtual caches add extra bits to the line tags
which identify which process the line belongs to. Usually there are a
limited number of these 'process keys' thus they have to be allocated
and deallocated on a 'need to use' basis when there are more processes
than keys available which is usually the case. Main point is that
this disallows a hit on the cache for one processes for a line filled
by another process. Usually in this setup the MMU uses the same key
set for sets of address translations so that line replacement works
for each process. If this is not the case, things become very
complicated.

a) Context Switch - If a write-through cache is being used, no
explicit flush is necessary at all during a context
switch. However, if a write-back cache is used, a
validation of main memory is required before each switch so
that modified lines are written back to main memory.
Regardless of whether the cache is write-back or
write-through in nature, when process keys are re-allocated
from one process to another the kernel must first flush all
modified data in the cache back to main memory (for a
write-back cache) then invalidate all the lines owned by
the process losing it's key (for both types of caches)
before giving the key to the new process.

b) Fork - If COW is not employed by the kernel and a
write-back cache is used, an explicit flush of the cache is
necessary after the parents address space is copied the
childs by the kernel since it is not always necessary to
flush during context switch time. Even if COW is used, the
parents cache lines must be flushed when using a write-back
cache to prevent unwanted VAC flush COW faults.

c) Exec - Just like the pure virtual cache, the cache lines
corresponding to the processes old mappings must be
invalidated completely from the cache. Since this prevents
any unwanted stale cache entries assosciated with the
processes key, the key can be safely reused in the new
mappings since they will initially be all misses when it
begins execution.

d) Exit - Since flushing doesn't always happen during a
context switch, an explicit invalidation of the processes
cache lines must be done by the kernel as the last step of
the exit().

e) Brk/Sbrk - Refer to the pure virtual cache techniques for
sbrk/brk handling above, they apply the same for virtual
caches with process key tags.

f) Shared memory - For shared mappings pointing to the same
physical pages within one process, the same handling
techniques apply as for the pure virtual cache. Since
context switches don't necessarily flush the cache, shared
mappings shared between processes can produce bad aliases
and inconsistancies. The simplest method to eliminate this
problem is to flush the cache during context switch time
(either the whole cache, or just the shared regions of the
current process, whichever is smaller). Or, the shared
regions can be made uncacheable in their page table
entries, this is also a performance lose. If the cache is
direct mapped, the kernel could eliminate the context
switch flushing by making sure that every shared mapping
would index the same lines in the cache (ie. the beginning
or each shared region within a virtual address space is
either the same or the difference between any two mappings
is modulo the caches size thus guarenteeing a cache miss
and thus a replacement). This will not work for write-back
caches as data inconsistancies will happen, a context
switch flush or making the shared pages non-cacheable will
be necessary.

3) Virtually indexed caches, tagged by physical address

These caches have their lines indexed by virtual address,
however the tag bits are from the physical address after the virtual
address is translated in the MMU.

a) Context switch - A flush should not be necessary at all
during context switch time. However, under certain
circumstances it will be necessary for shared mappings.

b) Fork - No flushing necessary.

c) Exec - No flushing necessary.

d) Exit - No flushing necessary.

e) Shared memory - Shared regions between processes must
either be mapped at the same exact virtual addresses or
addresses that are exactly some number of the cache size
apart from each other. If this is not enforced, an
explicit cache flush during context switches is necessary.
For mappings that occur twice within the same processes
address space are handled the same way as for pure virtual
caches.

4) Physical caches

Here, the lines are indexed by physical addresses, and the
tags are based upon the physical address too. Usually these caches
are external or are part of a multi-level caching architecture.

a) Context switch - No flushing.

b) Fork - No flushing.

c) Exec - No flushing.

d) Exit - No flushing.

e) Brk/sbrk - No flushing.

f) Shared memory - No flushing, and no restrictions as to the
virtual at which a shared region can be mapped.

As you can see, as you add more information into the cache
line tags, the less often an explicit flush is needed by the operating
system. Some further optimizations are possible to decrease the
flushing necessary by the kernel.

First, for physically indexed caches it is beneficial to use a
technique called page coloring. Every physical page the operating
system has in it's free page pool is assigned a 'color' which is
usually just an integer. Pages that map into the same cache line will
have the same color. The number of colors is bounded by the number of
pages that could fit in the cache at once. When allocating free pages
to a process you iterate through the color numbers and try to allocate
pages with different colors each time, per process. This can greatly
reduce cache thrashing for physical caches.

Secondly, cache aligning data structures within the kernel can
help greatly. Particularly the process table entries should be
aligned on cache lines if the size of one entry does not exceed the
size of a line. Otherwise the extra space wasted for padding is
probably not warranted.

Lastly, bounding cache flushes by the size of the cache can
greatly reduce the cache flushing overhead. If a region that needs to
be flushed from the cache exceeds the total size of the cache, only
an entire cache flush is necessary. This is pretty straight forward.