Re: new IRQ scalability changes in 2.3.48

From: Ingo Molnar (mingo@chiara.csoma.elte.hu)
Date: Tue Mar 14 2000 - 16:32:17 EST


On Tue, 14 Mar 2000 yodaiken@fsmlabs.com wrote:

> But the SMP system should do n-times as many user<->kernel transitions
> and so should still run into preemption points at a high speed: even
> if each processor schedules less often.

yes, but as i outlined in the dual-CPU case, this is a statistical thing.
There are workloads which are in high-latency code paths, and there having
2 CPUs does not help - 100msec or 50msec doesnt matter much, the sound
buffer is getting starved. There will also be still 100msec peaks at a 50%
chance (compared to the nonpreemptive 1-CPU case). I dont think this needs
many arguing, mixing two bad latency kernel 'point of executions' just
does not help, it decreases noise but not conceptually. Latencies will get
shorter by sqrt(2) (~50%) on average, if i remember correctly. With N CPUs
it's getting better by sqrt(N), so a 4-CPU box will have twice as good
average latencies than a 1-CPU box. The peaks will be still just as bad,
and users will just hear half as many sound skipping as before, but it
wont be gone.

> > of course it has more footprint than 0 instructions [ == the current SMP
> > kernel].
>
> I thought that we had had this conversation in another setting and you
> had argued that cache bloat was absolutely critical path.

yes, thats possible :-) but here there is a (i believe big) quality
difference, and in that case i believe the argument was about some minor
microoptimization. But i do agree that doing the 'complete' preemptible
solution adds more per-spinlock-operation (cache footprint) overhead than
i'd like to have. Changing the icache footprint of spinlocks by 1 byte
adds about 5k more kernel image size, so we are very sensitive on that
one. One solution would be to make spinlocks a (highly optimized) function
call [ducking down]:

 de8: 68 00 00 00 00 pushl $0x0
 ded: e8 e2 ff ff ff call dd4 <__spin_lock>
 de8: 68 00 00 00 00 pushl $0x0
 ded: e8 e2 ff ff ff call dd4 <__spin_unlock>

20 bytes icache footprint. (__spin_lock is a special function
auto-cleaning the stack so the addl $4, %esp is not needed.)

A typical spinlock aquire+release currently is:

 de8: f0 0f ba 2d 00 00 00 lock btsl $0x0,0x0
 def: 00 00
 df1: 0f 82 a0 00 00 00 jb e97 <dummyy+0xaf>
 df7: f0 0f ba 35 00 00 00 lock btrl $0x0,0x0
 dfe: 00 00

24 bytes. So a function-call based spin lock/unlock has ~20% less icache
footprint. (but obviously it has higher overhead)

the function-call based spinlock implementation could be made even lower
overhead for global spinlocks, where a helper function could do the
lock/unlock:

        spin_lock_tasklist();
        spin_unlock_tasklist();

this variant is half the icache footprint of inlined spinlocks.

        Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Mar 15 2000 - 21:00:28 EST