From markhe@veritas.com Thu Aug 16 20:14:26 2001 Date: Tue, 3 Jul 2001 12:20:47 +0100 (BST) From: markhe@veritas.com Subject: Re: [Lse-tech] Dbench scalability results on 2.4.0 Hi Bill, I've been doing some more work on the kmap_lock. The first part isn't very interesting - it modifies highmem.c to use a freelist linkage. Under "normal" usage patterns, I wouldn't expect this to give much of an improvement. The second part is more interesting. :) Currently, I don't have access to an 8-way box (only a 4), but I'm guessing the kmap lock contention is caused by the flush_tlb_all() which is done with the kmap_lock held. Now, flush_tlb_all() sends an IPI to all engines (processors) and _waits_ for them all to perform the shootdown. If any of the engines have interrupts blocked, then flush_tlb_all() busy- waits until the interrupt (IPI) is delieved and processed. This can be a significant number of CPU cycles - espically as locks which require interrupts to be blocked spin with ints disabled (image such a lock which has contention - it can be quite sometime until the last contentor breaks out of the critical region and enables ints). Analysing the usage of flush_tlb_all() shows that it does not need to busy-wait - it can simply send the "shootdown" IPI and continue; it doesn't even need to busy-wait for an ack from the other engines. ie. flush_tlb_all() can become asynchronous. For example, think about the flush_tlb_all() for highmem. New mappings cannot be created with interrupts disabled (else the orignal flush_tlb_all() could deadlock the system), nor from within an interupt handler. Same for "dropping" a mapping, and gaining a reference to an existing mapping. Infact, an engine's TLB doesn't need to be flushed until its next call to kmap_high() or until a context-switch occurs on the engine. As the highmem TLBs are marked "global", we'd need to add an extra test in schedule() which I'd rather stay away from. So, instead, we can wait for the flush on an engine to occur when it enables interupts. This works for the highmem case, and other uses of flush_tlb_all(). At least, I believe it does - can anyone find an existing case where it doesn't? Ingo, you know the revelent code better than anyone else. Is the idea sound? Does this sound fragile? Yes, it is, but if it improves scalability and is well documented, then it is worth doing. I've attached a patch against 2.4.5. The original code was pulled from a highly modified tree, but I don't think I've made any mistakes... Mark