{free|clear}_page_tables

Jakub Jelinek (jj@sunsite.ms.mff.cuni.cz)
Tue, 13 Jan 1998 18:50:05 +0100 (MET)


Hi!

I wrote new page tables for sparc64 a few days ago, the new scheme uses
32bit pgd_t and pmd_t instead of the former 64bit, TLB miss handler was
speeded up, so I expected the system will get measurably faster.
But what I found out (lat_proc-s is your friend), that although several
things seemed to be speeded up, fork+exit went from 239usec to 263usec,
which is way worse. So I did some tests on where the time is spent and found
out, that about 160usec were spent in free_page_tables.
Just to test it, I wrote a small hack in free_page_tables, whic does:
for 32bit processes, just free page tables of 32bit address space, not full
44bit address space, and for pmd entries read them two by two (as they are
now 32bit) and test this 64bit value if both are pmd_none, which generally
is the case. With this, fork+exit went to 173usec, which is SIGNIFICANTLY
better. Now, if I see what free_page_tables (and clear_page_tables likewise)
does, is, it BLOWS up all the D-cache by reading all the used pgd and pmd
tables fully into core, to find out that usually nearly the whole PAGE_SIZE
will be full of pgd_none resp. pmd_none entries.
I had initially the idea to just clear those regions which have vm_areas for
them (something like free_page_range), but apparently zap_page_range doesn't
clean after itself, so this cannot be used.
But is the any reason why zap_page_range does not clean up after itself?
In the current code on most platforms, yes, there is a reason.
But David Miller wrote a nice trick for us on sparc64 for
pgd_alloc/pmd_alloc/pte_alloc, which will make all these things
a) much faster
b) less memory used by page tables of a process which mmaps a large piece of
something and then munmaps it
How it basically works is that apparently current mm code when it
pgd/pmd/pte_frees some piece of memory, it is initialized to the state that
it is full of pgd/pmd/pte_none, which is exactly what pgd/pmd/pte_alloc
expects. So he has a simple fast cache of the last pgd/pmd/pte s freed and
serves them on demand and just alloces and initializes when no more entries
are in the cache. If the pgd/pmd/pte caches are per-CPU, then this can be
done without no locking at all, which is really a four instruction code,
which should not make things slower for zap_page_range.

So, now my questions:
If I write the pgd/pmd/pte per-CPU caches for all other ports (alpha-sparc),
would it be possible to have zap_page_range pmd/pte free after itself (by
checking previous and next vm areas and taking their vm_end resp. vm_start
into account or are there any special reasons why this is a bad idea?

If we do that, then we can get rid of the costly
free_page_tables/clear_page_tables and thus on my, fairly slow 167MHz Ultra
get to about 100usec fork+exit latency, possible faster, as we won't blow up
the caches (the 173usec I reported was with D-cache bloat, as it in fact
read all pmd pages fully from memory).

Or, if there are reasons why this is a bad idea, can we at least have
architecture specific clear/free_page_tables, so architectures can do some
ugly tricks to get that faster? Or can I try to code that trick in some way
it will be generally usable for other 64bit architectures (I doubt it is
possible, but I might try)?

Cheers,
Jakub
___________________________________________________________________
Jakub Jelinek | jj@sunsite.mff.cuni.cz | http://sunsite.mff.cuni.cz
Administrator of SunSITE Czech Republic, MFF, Charles University
___________________________________________________________________
Ultralinux - first 64bit OS to take full power of the UltraSparc
Linux version 2.0.32 on a sparc machine (291.64 BogoMips).
___________________________________________________________________