[PATCH][RFC 23/23]: Support for zero-copy TCP transmit of user spacedata

From: Vladislav Bolkhovitin
Date: Wed Dec 10 2008 - 14:05:16 EST


This patch implements support for zero-copy TCP transmit of user space
data. It is necessary in iSCSI-SCST target driver for transmitting data
from user space buffers, supplied by user space backend handlers. In
this case SCST core needs to know when TCP finished transmitting the
data, so the corresponding buffers can be reused or freed. Without this
patch it isn't possible, so iSCSI-SCST has to use data copying to TCP
send buffers function sock_sendpage(). ISCSI-SCST also works without
this patch, but that this patch gives a nice performance improvement.

In the chosen approach new optional field void *net_priv was added to
struct page. It is enclosed by

#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION),

so if one doesn't need this functionality, net_priv won't consume space
in struct page.

Then, 2 new global callbacks net_get_page_callback and
net_put_page_callback together with 2 new inline functions
net_get_page() and net_put_page() were added. If
CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION not defined
net_get_page() and net_put_page() effectively become get_page() and
put_page() correspondingly.

Those functions, if the corresponding net_get_page_callback or
net_put_page_callback assigned, call it, then do get_page() or put_page().

Then in net/ subdirectory all get_page() calls were replaced by
net_get_page() and put_page() - by net_put_page().

How it works. ISCSI-SCST assigns net_get_page_callback and
net_put_page_callback to its internal functions. Each page before being
sent to TCP's sendpage has net_priv field set to pointer to the
corresponding iSCSI command. Then in each net_get_page_callback handler
reference counter for that command increased and in each
net_put_page_callback - decreased. When it reaches zero, then all the
data for this command were transferred, so the command and its buffer
can be freed.

You can find how it used in the iSCSI-SCST patch (number 21 in this series).

Global callbacks were chosen, because this is the simplest and most
performance effective approach, fully following section 2 subsection 4
of SubmittingPatches file: "Don't over-design". If accepted, iSCSI-SCST
will be the only user of this functionality. Requirements to call
net_set_get_put_page_callbacks() (see comment in the patch) allows to
not protect those callbacks anyhow. Then, if in the future there is
another user of that functionality, it will be possible to convert those
callbacks to RCU-protected list of callbacks. But for now there's no
need to overcomplicate the code.

During development the following approaches were also examined and rejected:

1. Add net_priv analog in struct sk_buff, not in struct page. But then
it would be required that all the pages in each skb must be from the
same originator, i.e. with the same net_priv. It is unpractical to
change all the operations with skb's to forbid merging them, if they
have different net_priv. I tried, but quickly gave up. There are too
many such places in very not obvious code pieces.

2. Have in iSCSI-SCST a hashed list to translate page to iSCSI cmd by a
simple search function. This approach was rejected, because to copy a
page a modern CPU needs using MMX about 1500 ticks. It was observed,
that each page can be referenced by TCP during transmit about 20 times
or even more. So, if each search needs, say, 20 ticks, the overall
search time will be 20*20*2 (to get() and put()) = 800 ticks. So, this
approach would considerably worse performance-wise to the chosen
approach and provide not too much benefit.

Please, if you reject this approach, advice any other way to implement
the required functionality.

Signed-off-by: Vladislav Bolkhovitin <vst@xxxxxxxx>
---
include/linux/mm_types.h | 12 +++++++++++
include/linux/net.h | 40 ++++++++++++++++++++++++++++++++++++++
net/Kconfig | 12 +++++++++++
net/core/skbuff.c | 14 ++++++-------
net/ipv4/Makefile | 1
net/ipv4/ip_output.c | 4 +--
net/ipv4/tcp.c | 8 +++----
net/ipv4/tcp_output.c | 2 -
net/ipv4/tcp_zero_copy.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
net/ipv6/ip6_output.c | 2 -
10 files changed, 129 insertions(+), 15 deletions(-)

diff -upr linux-2.6.26/include/linux/mm_types.h linux-2.6.26/include/linux/mm_types.h
--- linux-2.6.26/include/linux/mm_types.h 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/mm_types.h 2008-07-22 20:30:21.000000000 +0400
@@ -92,6 +92,18 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+
+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+ /*
+ * Used to implement support for notification on zero-copy TCP transfer
+ * completion. It might look as not good to have this field here and
+ * it's better to have it in struct sk_buff, but it would make the code
+ * much more complicated and fragile, since all skb then would have to
+ * contain only pages with the same value in this field.
+ */
+ void *net_priv;
+#endif
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long page_cgroup;
#endif
diff -upr linux-2.6.26/include/linux/net.h linux-2.6.26/include/linux/net.h
--- linux-2.6.26/include/linux/net.h 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/include/linux/net.h 2008-07-29 20:48:07.000000000 +0400
@@ -57,6 +57,7 @@ typedef enum {
#include <linux/random.h>
#include <linux/wait.h>
#include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/mm.h>

struct poll_table_struct;
struct pipe_inode_info;
@@ -354,5 +354,44 @@ extern int net_msg_cost;
extern struct ratelimit_state net_ratelimit_state;
#endif

+#if defined(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION)
+/* Support for notification on zero-copy TCP transfer completion */
+typedef void (*net_get_page_callback_t)(struct page *page);
+typedef void (*net_put_page_callback_t)(struct page *page);
+
+extern net_get_page_callback_t net_get_page_callback;
+extern net_put_page_callback_t net_put_page_callback;
+
+extern int net_set_get_put_page_callbacks(
+ net_get_page_callback_t get_callback,
+ net_put_page_callback_t put_callback);
+
+/*
+ * See comment for net_set_get_put_page_callbacks() why those functions
+ * don't need any protection.
+ */
+static inline void net_get_page(struct page *page)
+{
+ if (page->net_priv != 0)
+ net_get_page_callback(page);
+ get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+ if (page->net_priv != 0)
+ net_put_page_callback(page);
+ put_page(page);
+}
+#else
+static inline void net_get_page(struct page *page)
+{
+ get_page(page);
+}
+static inline void net_put_page(struct page *page)
+{
+ put_page(page);
+}
+#endif /* CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_NET_H */
diff -upr linux-2.6.26/net/core/skbuff.c linux-2.6.26/net/core/skbuff.c
--- linux-2.6.26/net/core/skbuff.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/core/skbuff.c 2008-07-22 20:28:41.000000000 +0400
@@ -319,7 +319,7 @@ static void skb_release_data(struct sk_b
if (skb_shinfo(skb)->nr_frags) {
int i;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
}

if (skb_shinfo(skb)->frag_list)
@@ -658,7 +658,7 @@ struct sk_buff *pskb_copy(struct sk_buff

for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
- get_page(skb_shinfo(n)->frags[i].page);
+ net_get_page(skb_shinfo(n)->frags[i].page);
}
skb_shinfo(n)->nr_frags = i;
}
@@ -721,7 +721,7 @@ int pskb_expand_head(struct sk_buff *skb
sizeof(struct skb_shared_info));

for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- get_page(skb_shinfo(skb)->frags[i].page);
+ net_get_page(skb_shinfo(skb)->frags[i].page);

if (skb_shinfo(skb)->frag_list)
skb_clone_fraglist(skb);
@@ -990,7 +990,7 @@ drop_pages:
skb_shinfo(skb)->nr_frags = i;

for (; i < nfrags; i++)
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);

if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);
@@ -1159,7 +1159,7 @@ pull_pages:
k = 0;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
if (skb_shinfo(skb)->frags[i].size <= eat) {
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
eat -= skb_shinfo(skb)->frags[i].size;
} else {
skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
@@ -1916,7 +1916,7 @@ static inline void skb_split_no_header(s
* where splitting is expensive.
* 2. Split is accurately. We make this.
*/
- get_page(skb_shinfo(skb)->frags[i].page);
+ net_get_page(skb_shinfo(skb)->frags[i].page);
skb_shinfo(skb1)->frags[0].page_offset += len - pos;
skb_shinfo(skb1)->frags[0].size -= len - pos;
skb_shinfo(skb)->frags[i].size = len - pos;
@@ -2284,7 +2284,7 @@ struct sk_buff *skb_segment(struct sk_bu
BUG_ON(i >= nfrags);

*frag = skb_shinfo(skb)->frags[i];
- get_page(frag->page);
+ net_get_page(frag->page);
size = frag->size;

if (pos < offset) {
diff -upr linux-2.6.26/net/ipv4/ip_output.c linux-2.6.26/net/ipv4/ip_output.c
--- linux-2.6.26/net/ipv4/ip_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/ip_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -1007,7 +1007,7 @@ alloc_new_skb:
err = -EMSGSIZE;
goto error;
}
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
frag = &skb_shinfo(skb)->frags[i];
}
@@ -1165,7 +1165,7 @@ ssize_t ip_append_page(struct sock *sk,
if (skb_can_coalesce(skb, i, page, offset)) {
skb_shinfo(skb)->frags[i-1].size += len;
} else if (i < MAX_SKB_FRAGS) {
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, offset, len);
} else {
err = -EMSGSIZE;
diff -upr linux-2.6.26/net/ipv4/Makefile linux-2.6.26/net/ipv4/Makefile
--- linux-2.6.26/net/ipv4/Makefile 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/Makefile 2008-07-22 20:35:05.000000000 +0400
@@ -50,6 +50,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION) += tcp_zero_copy.o

obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
xfrm4_output.o
diff -upr linux-2.6.26/net/ipv4/tcp.c linux-2.6.26/net/ipv4/tcp.c
--- linux-2.6.26/net/ipv4/tcp.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp.c 2008-07-22 20:28:41.000000000 +0400
@@ -712,7 +712,7 @@ new_segment:
if (can_coalesce) {
skb_shinfo(skb)->frags[i - 1].size += copy;
} else {
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, offset, copy);
}

@@ -917,7 +917,7 @@ new_segment:
goto new_segment;
} else if (page) {
if (off == PAGE_SIZE) {
- put_page(page);
+ net_put_page(page);
TCP_PAGE(sk) = page = NULL;
off = 0;
}
@@ -958,9 +958,9 @@ new_segment:
} else {
skb_fill_page_desc(skb, i, page, off, copy);
if (TCP_PAGE(sk)) {
- get_page(page);
+ net_get_page(page);
} else if (off + copy < PAGE_SIZE) {
- get_page(page);
+ net_get_page(page);
TCP_PAGE(sk) = page;
}
}
diff -upr linux-2.6.26/net/ipv4/tcp_output.c linux-2.6.26/net/ipv4/tcp_output.c
--- linux-2.6.26/net/ipv4/tcp_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -854,7 +854,7 @@ static void __pskb_trim_head(struct sk_b
k = 0;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
if (skb_shinfo(skb)->frags[i].size <= eat) {
- put_page(skb_shinfo(skb)->frags[i].page);
+ net_put_page(skb_shinfo(skb)->frags[i].page);
eat -= skb_shinfo(skb)->frags[i].size;
} else {
skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i];
diff -upr linux-2.6.26/net/ipv4/tcp_zero_copy.c linux-2.6.26/net/ipv4/tcp_zero_copy.c
--- linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-22 20:12:35.000000000 +0400
+++ linux-2.6.26/net/ipv4/tcp_zero_copy.c 2008-07-31 21:21:13.000000000 +0400
@@ -0,0 +1,49 @@
+/*
+ * Support routines for TCP zero copy transmit
+ *
+ * Created by Vladislav Bolkhovitin
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/skbuff.h>
+
+net_get_page_callback_t net_get_page_callback __read_mostly;
+EXPORT_SYMBOL(net_get_page_callback);
+
+net_put_page_callback_t net_put_page_callback __read_mostly;
+EXPORT_SYMBOL(net_put_page_callback);
+
+/*
+ * Caller of this function must ensure that at the moment when it's called
+ * there are no pages in the system with net_priv field set to non-zero
+ * value. Hence, this function, as well as net_get_page() and net_put_page(),
+ * don't need any protection.
+ */
+int net_set_get_put_page_callbacks(
+ net_get_page_callback_t get_callback,
+ net_put_page_callback_t put_callback)
+{
+ int res = 0;
+
+ if ((net_get_page_callback != NULL) && (get_callback != NULL) &&
+ (net_get_page_callback != get_callback)) {
+ res = -EBUSY;
+ goto out;
+ }
+
+ if ((net_put_page_callback != NULL) && (put_callback != NULL) &&
+ (net_put_page_callback != put_callback)) {
+ res = -EBUSY;
+ goto out;
+ }
+
+ net_get_page_callback = get_callback;
+ net_put_page_callback = put_callback;
+
+out:
+ return res;
+}
+EXPORT_SYMBOL(net_set_get_put_page_callbacks);
diff -upr linux-2.6.26/net/ipv6/ip6_output.c linux-2.6.26/net/ipv6/ip6_output.c
--- linux-2.6.26/net/ipv6/ip6_output.c 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/ipv6/ip6_output.c 2008-07-22 20:28:41.000000000 +0400
@@ -1349,7 +1349,7 @@ alloc_new_skb:
err = -EMSGSIZE;
goto error;
}
- get_page(page);
+ net_get_page(page);
skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
frag = &skb_shinfo(skb)->frags[i];
}
diff -upr linux-2.6.26/net/Kconfig linux-2.6.26/net/Kconfig
--- linux-2.6.26/net/Kconfig 2008-07-14 01:51:29.000000000 +0400
+++ linux-2.6.26/net/Kconfig 2008-07-29 21:15:39.000000000 +0400
@@ -59,6 +59,18 @@ config INET

Short answer: say Y.

+config TCP_ZERO_COPY_TRANSFER_COMPLETION_NOTIFICATION
+ bool "TCP/IP zero-copy transfer completion notification"
+ depends on INET
+ default SCST_ISCSI
+ ---help---
+ Adds support for sending a notification upon completion of a
+ zero-copy TCP/IP transfer. This can speed up certain TCP/IP
+ software. Currently this is only used by the iSCSI target driver
+ iSCSI-SCST.
+
+ If unsure, say N.
+
if INET
source "net/ipv4/Kconfig"
source "net/ipv6/Kconfig"


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/