Re: rseq with syscall as the last instruction

From: Dmitry Vyukov
Date: Fri Oct 01 2021 - 08:55:47 EST


On Thu, 30 Sept 2021 at 16:01, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Sep 28, 2021 at 11:09:24AM +0200, Dmitry Vyukov wrote:
> > Hi rseq maintainers,
> >
> > I wonder if rseq can be used in the following scenario (or extended to be used).
> > I want to pass extra arguments to syscalls using a kind of
> > side-channel, for example, to say "do fault injection for the next
> > system call", or "trace the next system call". But what is "next"
> > system call should be atomic with respect to signals.
> > Let's say there is shared per-task memory location known to the kernel
> > where these arguments can be stored:
> >
> > __thread struct trace_descriptor desk;
> > prctl(REGISTER_PER_TASK_TRACE_DESCRIPTOR, &desk);
> >
> > then before a system call I can setup the descriptor to enable tracing:
> >
> > desk = ...
> > SYSCALL;
> >
> > The problem is that if a signal arrives in between we setup desk and
> > SYSCALL instruction, we will actually trace some unrelated syscall in
> > the signal handler.
> > Potentially the kernel could switch/restore 'desk' around syscall
> > delivery, but it becomes tricky/impossible for signal handlers that do
> > longjmp or mess with PC in other ways; and also would require
> > extending ucontext to include the desc information (not sure if it's
> > feasible).
> >
> > So instead the idea is to protect this sequence with rseq that will be
> > restarted on signal delivery:
> >
> > enter rseq critical section with end right after SYSCALL instruction;
> > desk = ...
> > SYSCALL;
> >
> > Then, the kernel can simply clear 'desc', on syscall delivery.
> >
> > rseq docs seem to suggest that this can work:
> >
> > https://lwn.net/Articles/774098/
> > +Restartable sequences are atomic with respect to preemption (making it
> > +atomic with respect to other threads running on the same CPU), as well
> > +as signal delivery (user-space execution contexts nested over the same
> > +thread). They either complete atomically with respect to preemption on
> > +the current CPU and signal delivery, or they are aborted.
> >
> > But the doc also says that the sequence must not do syscalls:
> >
> > +Restartable sequences must not perform system calls. Doing so may result
> > +in termination of the process by a segmentation fault.
> >
> > The question is:
> > Can this restriction be weakened to allow syscalls as the last instruction?
> > For flags in this case we would pass
> > RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT and
> > RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE, but no
> > RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL.
> >
> > I don't see any fundamental reasons why this couldn't work b/c if we
> > restart only on signals, then once we reach the syscall, rseq critical
> > section is committed, right?
> >
> > Do you have any feeling of how hard it would be to support or if there
> > can be some implementation issues?
>
> IIRC the only enforcement of this constraint is rseq_syscall() (which is
> a NOP when !CONFIG_DEBUG_RSEQ, because performance).
>
> However, since we use regs->ip, which for SYSCALL points to right
> *after* the SYSCALL instruction (for obvious reasons), it will not in
> fact match in_rseq_cs().
>
> And as such, I think your scheme should just work as is. Did you try?

Well, no, I did not try (wasn't sure how to interpret results).
Thanks, we will consider this option as well then.