Re: [PATCH v1 2/2] tests/pid_namespace: add pid_max tests
From: Christian Brauner
Date: Fri Feb 23 2024 - 11:25:26 EST
On Thu, Feb 22, 2024 at 09:54:08AM -0700, Tycho Andersen wrote:
> On Thu, Feb 22, 2024 at 05:09:15PM +0100, Alexander Mikhalitsyn wrote:
> > +static int pid_max_nested_limit_inner(void *data)
> > +{
> > + int fret = -1, nr_procs = 400;
> > + int fd, ret;
> > + pid_t pid;
> > + pid_t pids[1000];
> > +
> > + ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
> > + if (ret) {
> > + fprintf(stderr, "%m - Failed to make rootfs private mount\n");
> > + return fret;
> > + }
> > +
> > + umount2("/proc", MNT_DETACH);
> > +
> > + ret = mount("proc", "/proc", "proc", 0, NULL);
> > + if (ret) {
> > + fprintf(stderr, "%m - Failed to mount proc\n");
> > + return fret;
> > + }
> > +
> > + fd = open("/proc/sys/kernel/pid_max", O_RDWR | O_CLOEXEC | O_NOCTTY);
> > + if (fd < 0) {
> > + fprintf(stderr, "%m - Failed to open pid_max\n");
> > + return fret;
> > + }
> > +
> > + ret = write(fd, "500", sizeof("500") - 1);
> > + close(fd);
> > + if (ret < 0) {
> > + fprintf(stderr, "%m - Failed to write pid_max\n");
> > + return fret;
> > + }
> > +
> > + for (nr_procs = 0; nr_procs < 500; nr_procs++) {
> > + pid = fork();
> > + if (pid < 0)
> > + break;
> > +
> > + if (pid == 0)
> > + exit(EXIT_SUCCESS);
> > +
> > + pids[nr_procs] = pid;
> > + }
> > +
> > + if (nr_procs >= 400) {
> > + fprintf(stderr, "Managed to create processes beyond the configured outer limit\n");
> > + goto reap;
> > + }
>
> A small quibble, but I wonder about the semantics here. "You can write
> whatever you want to this file, but we'll ignore it sometimes" seems
> weird to me. What if someone (CRIU) wants to spawn a pid numbered 450
> in this case? I suppose they read pid_max first, they'll be able to
> tell it's impossible and can exit(1), but returning E2BIG from write()
> might be more useful.
That's a good idea. But it's a bit tricky. The straightforward thing is
to walk upwards through all ancestor pid namespaces and use the lowest
pid_max value as the upper bound for the current pid namespace. This
will guarantee that you get an error when you try to write a value that
you would't be able to create. The same logic should probably apply to
ns_last_pid as well.
However, that still leaves cases where the current pid namespace writes
a pid_max limit that is allowed (IOW, all ancestor pid namespaces are
above that limit.). But then immediately afterwards an ancestor pid
namespace lowers the pid_max limit. So you can always end up in a
scenario like this.