Re: pselect/etc semantics

From: Eric W. Biederman
Date: Thu May 30 2019 - 09:05:12 EST


Oleg Nesterov <oleg@xxxxxxxxxx> writes:

> Al, Linus, Eric, please help.
>
> The previous discussion was very confusing, we simply can not understand each
> other.
>
> To me everything looks very simple and clear, but perhaps I missed something
> obvious? Please correct me.

If I have read this thread correctly the core issue is that ther is a
program that used to work and that fails now. The question is why.

There are two ways the semantics for a sigmask changing system call
can be defined. The naive way and by reading posix. I will pick
on pselect.

The naive way:
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, const struct timespec *timeout,
const sigset_t *sigmask)
{
sigset_t oldmask;
int ret;

if (sigmask)
sigprocmask(SIG_SETMASK, sigmask, &oldmask);

ret = select(nfds, readfds, writefds, exceptfds, timeout);

if (sigmask)
sigprocmask(SIG_SETMASK, &oldmask, NULL);

return ret;
}

The standard for pselect behavior at
https://pubs.opengroup.org/onlinepubs/009695399/functions/pselect.html
says:
> If sigmask is not a null pointer, then the pselect() function shall
> replace the signal mask of the caller by the set of signals pointed to
> by sigmask before examining the descriptors, and shall restore the
> signal mask of the calling thread before returning.

...

> On failure, the objects pointed to by the readfds, writefds, and
> errorfds arguments shall not be modified. If the timeout interval
> expires without the specified condition being true for any of the
> specified file descriptors, the objects pointed to by the readfds,
> writefds, and errorfds arguments shall have all bits set to 0.


So the standard specified behavior is if the return value is -EINTR you
know you were interrupted by a signal, for any other return value you
don't know.

An error value just indicates that the file descriptor sets have not
been updated.


That is what the standard and reasonable people will expect from the
system call. Looking at set_user_sigmask and restore_user_sigmask
I believe we are don't violate those semantics today.

Which means I believe we have a semantically valid change in behavior
that is causing a regression.

>From my inspection of the code the change in behavior is from these
two pieces of code:

>From the v4.18 epoll_pwait.

/*
* If we changed the signal mask, we need to restore the original one.
* In case we've got a signal while waiting, we do not restore the
* signal mask yet, and we allow do_signal() to deliver the signal on
* the way back to userspace, before the signal mask is restored.
*/
if (sigmask) {
if (error == -EINTR) {
memcpy(&current->saved_sigmask, &sigsaved,
sizeof(sigsaved));
set_restore_sigmask();
} else
set_current_blocked(&sigsaved);
}

/*
* restore_user_sigmask:
* usigmask: sigmask passed in from userland.
* sigsaved: saved sigmask when the syscall started and changed the sigmask to
* usigmask.
*
* This is useful for syscalls such as ppoll, pselect, io_pgetevents and
* epoll_pwait where a new sigmask is passed in from userland for the syscalls.
*/
void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved)
{

if (!usigmask)
return;
/*
* When signals are pending, do not restore them here.
* Restoring sigmask here can lead to delivering signals that the above
* syscalls are intended to block because of the sigmask passed in.
*/
if (signal_pending(current)) {
current->saved_sigmask = *sigsaved;
set_restore_sigmask();
return;
}

/*
* This is needed because the fast syscall return path does not restore
* saved_sigmask when signals are not pending.
*/
set_current_blocked(sigsaved);
}

Which has been reported results in a return value of 0, and a signal
delivered, where previously that did not happen.

They only way I can see that happening is that set_current_blocked in
__set_task_blocked clears pending signals (that will be blocked) and
calls retarget_shared_pending.

Frankly the only reason this appears to be worth touching is that we
have a userspace regression. Otherwise I would call the current
behavior more correct and desirable than ignoring the signal longer.

If I am reading sitaution properly I suggest we go back to resoring the
sigmask by hand in epoll_pwait, and put in a big fat comment about a
silly userspace program depending on that behavior.

That will be easier to backport and it will just fix the regression and
not overfix the problem for reasonable applications.

Oleg your cleanup also seems reasonable. But I would like to make
certain we understand and fix the regression first. We seem to be
talking very small windows for both cases.

Eric