Re: Linux 2.6.17-rc2

From: Piet Delaney
Date: Thu Apr 20 2006 - 19:39:24 EST


On Thu, 2006-04-20 at 15:20 -0700, Linus Torvalds wrote:
>
> On Thu, 20 Apr 2006, Piet Delaney wrote:
> >
> > What about marking the pages Read-Only while it's being used by the
> > kernel
>
> NO!
>
> That's a huge mistake, and anybody that does it that way (FreeBSD) is
> totally incompetent.

Yea, we're not using it either.

>
> Once you play games with page tables, you are generally better off copying
> the data. The cost of doing page table updates and the associated TLB
> invalidates is simply not worth it, both from a performance standpoing and
> a complexity standpoint.

I once wrote some code to find the PTE entries for user buffers;
and as I recall the code was only about 20 lines of code. I thought
only a small part of the TLB had to be invalidated. I never tested
or profiled it and didn't consider the multi-threading issues.

Instead of COW, I just returned information in recvmsg control
structure indicating that the buffer wasn't being use by the kernel
any longer.

I kept the list of pages involved in the zero copy in a structure
and when the kernel was done with the pages it decremented the page
count via a callback, similar to what yzy <yzy@xxxxxxxxxxxxx> discussed
two weeks ago on the linux-net mailing list.

I thought this structure could have pointers to the PTE's and
mmu context to clear the PTE entries. Unfortunately it gets
messy if the zero copy's overlap onto a shared page.

I didn't study the BSD implementation well enough to appreciate
how their COW implementation worked.

>
> Basically, if you want the highest possible performance, you do not want
> to do TLB invalidates. And if you _don't_ want the highest possible
> performance, you should just use regular write(), which is actually good
> enough for most uses, and is portable and easy.

We use a zero copy, and also don't mess with the TLB. In our application
99.99% of the data is looked at but not modified (we are looking through
TCP streams for a security exploitations).

>
> The thing is, the cost of marking things COW is not just the cost of the
> initial page table invalidate: it's also the cost of the fault eventually
> when you _do_ write to the page, even if at that point you decide that the
> page is no longer shared, and the fault can just mark the page writable
> again.

Right, it's difficult for the kernel code to change the involved PTE's
when it's done with a page. Then flushing the TLB's of involved CPU's
adds to the problem.

>
> That cost is _bigger_ than the cost of just copying the page in the first
> place.
>
> The COW approach does generate some really nice benchmark numbers, because
> the way you benchmark this thing is that you never actually write to the
> user page in the first place, so you end up having a nice benchmark loop
> that has to do the TLB invalidate just the _first_ time, and never has to
> do any work ever again later on.
>
> But you do have to realize that that is _purely_ a benchmark load. It has
> absolutely _zero_ relevance to any real life. Zero. Nada. None. In real
> life, COW-faulting overhead is expensive. In real life, TLB invalidates
> (with a threaded program, and all users of this had better be threaded, or
> they are leaving more performance on the floor) are expensive.

Yea, your right, the multi-threading it a real problem,
you would have to send a interrupt with information about which part
of the TLB needs to be invalidated to each CPU.

>
> I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
> Playing games with VM is bad. memory copies are _also_ bad, but quite
> frankly, memory copies often have _less_ downside than VM games, and
> bigger caches will only continue to drive that point home.

Yep, both of the zero copy implementations that I've worked on have
used non-VM techniques to synchronize socket buffer state between the
kernel and user space.

-piet

>
> Linus
--
---
piet@xxxxxxxxxxxx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/