Re: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT.

From: Maciej Żenczykowski
Date: Tue Apr 27 2021 - 17:55:53 EST


On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@xxxxxxxxxxxx> wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
>
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.

This is IMHO a userspace bug.

You should never be closing or creating new SO_REUSEPORT sockets on a
running server (listening port).

There's at least 3 ways to accomplish this.

One involves a shim parent process that takes care of creating the
sockets (without close-on-exec),
then fork-exec's the actual server process[es] (which will use the
already opened listening fds),
and can thus re-fork-exec a new child while using the same set of sockets.
Here the old server can terminate before the new one starts.

(one could even envision systemd being modified to support this...)

The second involves the old running server fork-execing the new server
and handing off the non-CLOEXEC sockets that way.

The third approach involves unix fd passing of sockets to hand off the
listening sockets from the old process/thread(s) to the new
process/thread(s). Once handed off the old server can stop accept'ing
on the listening sockets and close them (the real copies are in the
child), finish processing any still active connections (or time them
out) and terminate.

Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
count), nor closing the final copy of a given socket.

This is basically the same thing that was needed not to lose incoming
connections in a pre-SO_REUSEPORT world.
(no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
triggering a RST during the server restart, it just makes the window
when RSTs happen shorter)

This was from day one (I reported to Tom and worked with him on the
very initial distribution function) envisioned to work like this,
and we (Google) have always used it with unix fd handoff to support
transparent restart.