Re: Integration of SCST in the mainstream Linux kernel

From: Nicholas A. Bellinger
Date: Mon Feb 04 2008 - 14:20:29 EST


On Mon, 2008-02-04 at 11:06 -0800, Nicholas A. Bellinger wrote:
> On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> >
> > On Mon, 4 Feb 2008, James Bottomley wrote:
> > >
> > > The way a user space solution should work is to schedule mmapped I/O
> > > from the backing store and then send this mmapped region off for target
> > > I/O.
> >
> > mmap'ing may avoid the copy, but the overhead of a mmap operation is
> > quite often much *bigger* than the overhead of a copy operation.
> >
> > Please do not advocate the use of mmap() as a way to avoid memory copies.
> > It's not realistic. Even if you can do it with a single "mmap()" system
> > call (which is not at all a given, considering that block devices can
> > easily be much larger than the available virtual memory space), the fact
> > is that page table games along with the fault (and even just TLB miss)
> > overhead is easily more than the cost of copying a page in a nice
> > streaming manner.
> >
> > Yes, memory is "slow", but dammit, so is mmap().
> >
> > > You also have to pull tricks with the mmap region in the case of writes
> > > to prevent useless data being read in from the backing store. However,
> > > none of this involves data copies.
> >
> > "data copies" is irrelevant. The only thing that matters is performance.
> > And if avoiding data copies is more costly (or even of a similar cost)
> > than the copies themselves would have been, there is absolutely no upside,
> > and only downsides due to extra complexity.
> >
>
> The iSER spec (RFC-5046) quotes the following in the TCP case for direct
> data placement:
>
> " Out-of-order TCP segments in the Traditional iSCSI model have to be
> stored and reassembled before the iSCSI protocol layer within an end
> node can place the data in the iSCSI buffers. This reassembly is
> required because not every TCP segment is likely to contain an iSCSI
> header to enable its placement, and TCP itself does not have a
> built-in mechanism for signaling Upper Level Protocol (ULP) message
> boundaries to aid placement of out-of-order segments. This TCP
> reassembly at high network speeds is quite counter-productive for the
> following reasons: wasted memory bandwidth in data copying, the need
> for reassembly memory, wasted CPU cycles in data copying, and the
> general store-and-forward latency from an application perspective."
>
> While this does not have anything to do directly with the kernel vs. user discussion
> for target mode storage engine, the scaling and latency case is easy enough
> to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.
>
> > If you want good performance for a service like this, you really generally
> > *do* need to in kernel space. You can play games in user space, but you're
> > fooling yourself if you think you can do as well as doing it in the
> > kernel. And you're *definitely* fooling yourself if you think mmap()
> > solves performance issues. "Zero-copy" does not equate to "fast". Memory
> > speeds may be slower that core CPU speeds, but not infinitely so!
> >
>
> >From looking at this problem from a kernel space perspective for a
> number of years, I would be inclined to believe this is true for
> software and hardware data-path cases. The benefits of moving various
> control statemachines for something like say traditional iSCSI to
> userspace has always been debateable. The most obvious ones are things
> like authentication, espically if something more complex than CHAP are
> the obvious case for userspace. However, I have thought recovery for
> failures caused from communication path (iSCSI connections) or entire
> nexuses (iSCSI sessions) failures was very problematic to expect to have
> to potentially push down IOs state to userspace.
>
> Keeping statemachines for protocol and/or fabric specific statemachines
> (CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
> obvious ones) are the best canidates for residing in kernel space.
>
> > (That said: there *are* alternatives to mmap, like "splice()", that really
> > do potentially solve some issues without the page table and TLB overheads.
> > But while splice() avoids the costs of paging, I strongly suspect it would
> > still have easily measurable latency issues. Switching between user and
> > kernel space multiple times is definitely not going to be free, although
> > it's probably not a huge issue if you have big enough requests).
> >
>

Then again, having some data-path for software and hardware bulk IO
operation of storage fabric protocol / statemachine in userspace would
be really interesting for something like an SPU enabled engine for the
Cell Broadband Architecture.

--nab



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/