Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

From: Timur Tabi
Date: Mon Apr 18 2005 - 11:28:37 EST


Andrew Morton wrote:
Roland Dreier <roland@xxxxxxxxxxx> wrote:

Troy> Do we even need the mlock in userspace then?

Yes, because the kernel may go through and unmap pages from userspace
while trying to swap. Since we have the page locked in the kernel,
the physical page won't go anywhere, but userspace might end up with a
different page mapped at the same virtual address.


That shouldn't happen. If get_user_pages() has elevated the refcount on a
page then the following can happen:

- The VM may decide to add the page to swapcache (if it's not mmapped
from a file).

- Once the page is backed by either swapcache of a (mmapped) file, the VM
may decide the unmap the application's pte's. A later minor fault by the
app will cause the same physical page to be remapped.

That's not what we're seeing. We have hardware that does DMA over the network (much like the Infiniband stuff), and we have a testcase that fails if get_user_pages() is used, but not if mlock() is used. Consider two computers on a network, X and Y. Both have our hardware, which can transfer a page of memory from a given physical address on X to a physical address on Y.

1) Application on X allocates a block of memory, and passes the virtual address to the driver.
2) Driver on X calls get_user_pages() and then obtains a physical address for the memory.
3) Application and driver on Y do the same thing.
4) App X fills memory with some data D.
5) App X then allocates as much memory as it possibly can. It touches every page in this memory, and then frees the memory. This will force other pages to be swapped out, including the supposedly pinned memory.
6) App X then tells Driver X to transfer data D to computer Y.
7) App Y compares data D and finds that it doesn't match with it's supposed to.

Conclusion: during step 5, the data in pinned memory is swapped out or something. I'm not sure where it goes.

We can only demonstrate this problem using our hardware, because you need the ability to transfer memory without using the CPU. We were going to prepare a test case and ship same hardware to a few kernel developers to prove our point, but now that we're able to call mlock() in non-user processes, we decided it wasn't worth our time. Actually, I discovered that I can call cap_raise() and set the ulimit structure, which gives me the ability to call mlock() on any amount of memory from any process in 2.4 and 2.6 kernels, which we need to support. If I had thought of that earlier, I wouldn't have needed all those hacks to call sys_mlock() from the driver.




--
Timur Tabi
Staff Software Engineer
timur.tabi@xxxxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/