Re: rseq event_counter field

From: Mathieu Desnoyers
Date: Mon Oct 30 2017 - 15:20:50 EST


----- On Oct 30, 2017, at 2:08 PM, Kyle Huey me@xxxxxxxxxxxx wrote:

> On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> Answering both emails here.
>>
>> Also, welcome Kyle. Kyle, how badly does rseq's proposed
>> event_counter break rr? For that matter, how badly does rseq without
>> an event_counter break rr?
>>
>> (Linus, if you care, I'm proposing that rseq is probably fine for
>> 4.15, but that it should be merged without the event_counter. If
>> there is a bona fide use for event_counter along with a benchmark
>> showing that it makes a meaningful difference, then event_counter can
>> be easily added later.)
>>
>> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@xxxxxx> wrote:
>>> My understanding is that one of the motivations for the event counter was to
>>> make it possible to express things like "increment a per-cpu value" in C.
>>> Without this abstraction it wasn't possible to ensure that the strict
>>> requirements of the rseq ABI were met (contiguous restartable block, single
>>> instruction for the final write). If you write your entire restartable
>>> sequence in assembly there's no need for the event counter. Comparing all
>>> input values as an alternative way of having non-assembly sequences seems
>>> like it might have a risk of ABA style race conditions.
>>>
>>
>> My opinion after looking at a few examples is that event_counter
>> sounds useful but that it's really more of a crutch that enables
>> mediocre userspace code. With event_counter, you can do roughly this:
>>
>> struct rseq_state rseq = begin();
>>
>> int cpu = rseq_getcpu();
>> Calculate some stuff. Be very very careful to to chase pointers,
>> because it's easy to do in a very very fragile way as Mathieu's
>> example below does.
>>
>> commit(&rseq, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> With a real database, this style of programming works well. You
>> probably have a progress guarantee, and the database gives you the
>> semantics you want. With rseq, you don't get any of this. If the C
>> code is too slow or does something (an accidental interaction with
>> userfaultfd, perhaps?), then you lose forward progress. You also can
>> be a bit slow in the C code, in which case you'll get more conflicts
>> detected than actually exist.
>>
>> Without event_counter, you can use a different abstraction. In your
>> percpu data, you have something like:
>>
>> struct rseq_lock lock;
>> struct freelist_entry *head;
>>
>> Then you can do:
>>
>> int cpu = rseq_getcpu();
>> struct my_data *data = ... cpu ...;
>> struct rseq_state rseq = rseq_acquire(data->lock);
>>
>> Calculate some stuff, being careful not to touch anything that isn't
>> protected by the rseq_lock you're using.
>>
>> commit(&rseq, cpu, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> This variant is indeed a tiny bit slower *if you measure time by
>> number if instructions* -- there are some very clever optimizations
>> that save some instructions in the event_counter variant. But, in the
>> cases where you actually have C code in here, I suspect that the
>> difference won't matter in real code. And the counter-less variant
>> can have arbitrarily slow C code or even syscalls without being
>> spuriously aborted until the very short asm commit sequence starts.
>>
>> BUT, I think that the event_counter-free version is likely to be
>> faster in real code if the library is done well. The version with
>> event_counter has to load the event_counter from the rseq cacheline
>> right at the beginning, and that value is used, so unless the CPU is
>> really quite aggressive avoid speculating through a branch that hasn't
>> gotten its dependencies into cache yet, you're likely to stall a bit
>> if the per-thread rseq cacheline isn't in L1.
>>
>> The variant without event_counter can avoid this problem as long as
>> the architecture gives a cheap way to find the CPU number. On x86,
>> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
>> any CPU as long as the kernel supports per-cpu FSBASE. (That's a
>> feature that isn't terribly hard to add and would IMO be quite nifty.)
>> If this is done, then you still have to do:
>>
>> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
>>
>> but that's a *store* to the rseq cacheline. It'll go in the store
>> buffer and, in the fast case, nothing ever depends on it. Hence it's
>> unlikely to stall. Of course, there's still a load from the
>> rseq_lock, but that shares a cacheline with the data that it protects,
>> so it's basically free.
>>
>> IOW, I think that adding event_counter encourages user code that has a
>> weaker progress guarantee, is slower in a very highly optimized
>> implementation, and allows users to think less about they're doing to
>> their own detriment, for the gain of saving a couple of instructions.
>> I could be wrong, but I think it would be nice to see evidence that
>> I'm wrong before event_counter becomes enshrined in the ABI. Also, I
>> suspect that neat tools like rr will have a much easier time dealing
>> with rseq if event_counter isn't involved.
>>
>>>
>>> RDPID is interesting, but it seems like given the current ABI (which
>>> requires setting the current restartable range) you'd still need some sort
>>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
>>
>> That's the wrong comparison. For a good userspace implementation
>> (which is very awkward without libc's help), storing to the rseq
>> pointer is:
>>
>> leaq the_rseq_cs(%rip), %rax
>> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
>>
>> loading the cpu number is:
>>
>> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
>>
>> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
>> fixed) instead. This avoids reading from the rseq cacheline.
>>
>>
>> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:
>>>
>>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>>> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>>
>>> Let me step back a bit, and try to understand the fundamental
>>> motivation for not letting user-space detect that it has been
>>> preempted.
>>>
>>> My understanding (please let me know where I am mistaken) is that
>>> the motivation for not letting user-space know if preempted is to
>>> ensure malware cannot cloak (hide) themselves by changing their
>>> behavior when they detect that they are heavily preempted. This
>>> is effectively a way to detect things like single-stepping, strace,
>>> ltrace, and so on.
>>
>> That's about a third of my objection. The next third is that it seems
>> messy to be to intentionally expose what should be an invisible
>> detail, and the third is that I still haven't seen convincing evidence
>> that anything genuinely needs an event counter. See below and my
>> other email.
>>
>>>
>>> One can also pretty much trivially use rdtsc (x86), tb register
>>> (powerpc) or ARM ccnt register (if user access is enabled) to
>>> detect whether it runs under observation.
>>> (this would be your timing-and-guessing approach)
>>>
>>
>> Mostly irrelevant, but rdtsc is trappable.
>>
>>> I hope that very *few* users will completely open-code their asm
>>> sequences. For instance, if you do the whole list pop in asm, you
>>> need to:
>>>
>>> 1) store current rseq_cs pointer
>>> 2) load CPU number from struct rseq,
>>> 3) pointer-chase the right per-cpu data that you want to act on,
>>> 4) load head
>>> 5) compare head against NULL
>>> 6) conditional branch
>>> 6) load head->next
>>> 10) store head->next to head (commit)
>>>
>>> step 3) is the interesting bit here:
>>> It heavily depends on the application per-cpu data layout. It can be
>>> a large array of per-cpu integers, or an array of per-cpu structures
>>> at a specific offset, or an array of per-cpu struct + offset + pointer
>>> to some other structure (dynamically allocated), and so on. I do not
>>> want to impose any restriction on the data structure layout of
>>> user-space applications. This means the asm-specialized-stuff will
>>> have to be specialized for each per-cpu data structure layout, which
>>> is a pain.
>>
>> To my point: I don't think this is a particularly good example. I
>> think your examples are buggy (see below), and I would implement this
>> quite differently if I were trying to use C (or even asm, perhaps),
>> like this:
>>
>> 1) load CPU number.
>> 2) pointer-chase the right per-cpu data, in C if desired.
>> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
>> 4) load head, compare against null, load head->next
>> 5) commit
>>
>> I don't see how the event counter is at all useful here.
>>
>> FWIW, I can easily imagine alternative ABIs in which abort_ip is
>> treated as a *call* instead of a jump to enable much better C support.
>>
>>> Having the event_counter in place allows users to code the
>>> pointer-chasing in pure C, which we can expect will help adoption
>>> and ensure people use rseq through a utility library header file
>>> rather than open-coding everything (with the associated testing
>>> problems).
>>
>> True, but my variant works just as well, might be faster, and doesn't
>> need event_counter.
>>
>>>
>>> If we look at the list pop alternative using C for pointer-chasing,
>>> with event_counter:
>>>
>>> <C>
>>> 1) 64-bit load of both event_counter and cpu_id
>>> 2) Use cpu_id to pointer-chase the per-cpu data
>>> 3) Load head
>>> 4) compare head with NULL
>>> 5) conditional branch
>>> 6) Load head->next
>>
>> ^^^ Potential segfault here. head might have been freed and unmapped.
>>
>>> <ASM>
>>> 9) store current rseq_cs
>>> 10) Load event counter
>>> 11) compare current event_counter with event_counter loaded in 1)
>>> 12) conditional branch
>>> 13) store new value to @loc
>>>
>>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>>> stall due to pointer-chasing.
>>>
>>> And here is the resulting list pop where the algorithm is in C,
>>> without sequence counter. Let's make it a simple case
>>> where the application is not changing the layout of the data
>>> structures being pointer-chased concurrently from a book-keeping
>>> algorithm (because that would add extra compare in the asm).
>>>
>>> <C>
>>> 1) 32-bit load of cpu_id
>>> 2) Use cpu_id to pointer-chase the per-cpu data
>>> 3) Load head
>>> 4) compare head with NULL
>>> 5) conditional branch
>>> 6) Load head->next
>>
>> ^^^ ditto: potential segfault here
>>
>>> <ASM>
>>> 7) store current rseq_cs
>>> 8) 32-bit load of cpu_id
>>> 9) compare cpu_id with prior cpu_id read in 1)
>>> 10) conditional branch
>>> 11) Load head
>>> 12) compare head against head loaded in 3)
>>> 13) conditional branch
>>> 14) Load head->next
>>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>>> 16) conditional branch
>>> 17) store head->next as new head
>>>
>>>
>>> Would it be required to change the kernel-user ABI when switching to
>>> RDPID ? Can this co-exist with TLS users ?
>>
>> RDPID gives you the CPU number without reading memory, and that's
>> pretty much all it does. The format in which it returns it sucks,
>> unfortunately, but I'm going to try to fix that by the time rseq is
>> merged.
>>
>> --Andy
>
> (+roc)
>
> Thanks for roping me in. I'm not sure I fully understand the proposal
> (are there docs somewhere outside of this email thread?)

The tree with my current implementation is at
https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-rseq.git/

The commit messages and implementation contain comments describing the
architecture of the rseq proposal.

However, the last bits about event_counter removal were discussed between
Andy and me just last week, so it's a bit soon to have this documented. ;)

> but I don't
> think this will be a huge problem for rr because:
>
> 1) Since rseq is initiated through a syscall we can always make that
> fail and deal with not recording programs that use it. As long as it
> doesn't become widely used in glibc or something that's probably
> bearable.

One goal is to make glibc use it for its memory allocator. However, I
would expect glibc to fall-back to another allocator if it detects that
rseq is not available, so making the system call return -ENOSYS should
trigger the fallback behavior.

> 2) We ought to be able to emulate it if we have to, since everything
> is mediated by the kernel rr might need to be able to see when a
> thread migrates from one CPU to another (via a perf event or
> something, idk if this functionality already exists) and when a rseq
> restart happens (via ptrace or perf events or something).

rseq users are exporting a __rseq_table section entry from each rseq critical
section, which contain the start_ip, post_commit_offset fields, and the
abort_ip. This should allow a debugger/single-stepping emulator to
understand that it should expect the kernel to "jump" to the abort_ip,
and thus set breakpoints to the abort_ip to plan for these "unexpected"
branches. It might become useful for architectures that perform single-stepping
by placing explicit breakpoints on the next instruction (and on jump
targets).

Another feature provided by rseq is the in-TLS "cpu_id" field: the kernel
stores the current cpu_id value into that TLS field whenever it returns
to user-space. If you care about trapping RDPID or sched_getcpu() vDSO,
you will probably care about that field too.

>
> RDPID sounds annoying. My copy of the SDM doesn't seem to indicate
> whether it obeys the RDTSC(P) faulting mechanism. Does anyone know
> what microarchitecture introduced that instruction? We could always
> mask off the CPUID bit indicating it's supported.
>
> A similar concept on ARM (the load-lock-store-condition pattern of the
> LDREX/STREX instructions) doomed our effort to port rr to ARM a few
> years back. The key problems there for us were:
>
> 1. Hardware interrupts in the critical section trigger a failed store
> because the kernel will grab the per-CPU exclusive lock.
> 2. There's no way to observe when a store fails so as to induce the
> same failures during replay, and
> 3. That pattern used all over the place for every atomic operation on
> ARM so blacklisting or emulating it wasn't feasible.

I'm with you there. I've looked at how gdb deals with ll/sc stuff, and
god it's ugly. It really expects specific patterns, and anything out of
the ordinary would fool it. That's why I decided to expose the
__rseq_table section so a debugger would not have to "guess" rseq asm
sequence patterns.

>
> If everything is moderated by the kernel and the kernel exposes enough
> information through ptrace/perf events/whatever I think we'll be ok

The simple solution would be to -ENOSYS sys_rseq(). A more elaborate
solution would be to add breakpoints at each rseq_table abort_ip target,
so you can catch whenever the kernel aborts a rseq critical section.
You would probably have to tweak the rseq TLS cpu_id value presented to
user-space if you care too.

I'm currently working on removing the "event_counter" from rseq.

Let me know if you need clarifications or more explanation on specific
points.

Thanks,

Mathieu

>
> Thanks,
>
> - Kyle

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com