Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

From: Andy Isaacson
Date: Tue May 03 2005 - 13:45:14 EST


On Fri, Apr 29, 2005 at 05:31:44PM -0700, Caitlin Bestler wrote:
> Attempting to provide *any* support for applications that fork children
> after doing RDMA registrations is a ratshole best avoided. The general
> rule that application developers should follow is to do RDMA *only*
> in the child processes.

I think it's unreasonable to *prohibit* fork-after-registration; for one
thing, there's lots of code that forks under the covers. Setuid helpers
like getpty just assume that they're going to be able to fork. Even
stuff like get*by*(3) can potentially fork. And with site-configured
stuff like PAM, you end up with things that work on the developer's
system but break in deployment.

I think it's exceedingly reasonable to say "RDMA doesn't work in
children". But the child should get a sane memory image: at least
zeros in fully-registered pages, and preferably copies of
partially-registered pages. Differentiating between fully-registered
and partially-registered pages avoids (I think) the pathological case of
having to copy a GB of data just to system("/bin/ls > /tmp/tmpfile").
You can still go pathological if you've partially-registered gigabytes
of address space (for example a linked list where each node is allocated
with malloc and then registered) but that's a case of "Well, don't do
that then".

Rather than replacing the fully-registered pages with pages of zeros,
you could simply unmap them.

A consistent statement would be

After fork(2), any regions which were registered are UNDEFINED.
Region boundaries are byte-accurate; a registration can cover just
part of a page, in which case the non-registered part of the page
has normal fork COW semantics.

Probably the most sane solution is to simply unmap the fully-registered
pages at fork time, and copy any partially-registered pages. But the
statement above does not require this.

> Keep in mind that it is not only the memory regions that must be dealt
> with, but control data invisible to the user (the QP context, etc.). This
> data frequently is interlinked between kernel residente and user resident
> data (such as a QP context has the PD ID somewhere on-chip or in
> kernel, which the Send Queue ring needs to be in user memory). Having
> two different user processes that both think they have the user half to
> this type of split data structure is just asking for trouble, even if you
> manage to get the copy on write bit timing problems all solved.

Obviously, calling *any* RDMA-userland-stuff in the child is completely
undefined [1]. One place where I can see a potential problem is in
atexit()-type handlers registered by the RDMA library. Since those
aren't performance-critical they can and should do sanity checks with
getpid() and/or checking with the kernel driver.

[1] You might want to allow the child to start a completely new RDMA
context, but I don't see that as necessary.

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/