Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter

From: Pavel Emelyanov
Date: Tue Mar 11 2014 - 14:19:59 EST

Next message: Jon Ringle: "Re: [PATCH] RFC: WIP: sc16is7xx [v0.4]"
Previous message: Bjorn Helgaas: "Re: [PATCH 0/3] amd/pci: Add AMD hostbridge supports for newer AMD systems"
In reply to: Alexei Starovoitov: "Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter"
Next in thread: Eric Dumazet: "Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 03/11/2014 10:03 PM, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2014 at 10:40 AM, Pavel Emelyanov <xemul@xxxxxxxxxxxxx> wrote:
>> On 03/10/2014 02:00 AM, Daniel Borkmann wrote:
>>> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>>>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@xxxxxxxxxxxxx> wrote:
>>>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>>>
>>>>>> Extended BPF extends old BPF in the following ways:
>>>>>> - from 2 to 10 registers
>>>>>> Original BPF has two registers (A and X) and hidden frame pointer.
>>>>>> Extended BPF has ten registers and read-only frame pointer.
>>>>>> - from 32-bit registers to 64-bit registers
>>>>>> semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>>> subregisters
>>>>>> - if (cond) jump_true; else jump_false;
>>>>>> old BPF insns are replaced with:
>>>>>> if (cond) jump_true; /* else fallthrough */
>>>>>> - adds signed > and >= insns
>>>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>>> up to 512 bytes of multi-use stack space
>>>>>> - introduces bpf_call insn and register passing convention for zero
>>>>>> overhead calls from/to other kernel functions (not part of this patch)
>>>>>> - adds arithmetic right shift insn
>>>>>> - adds swab32/swab64 insns
>>>>>> - adds atomic_add insn
>>>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>>>
>>>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>>>> almost as fast as natively compiled code
>>>>>>
>>>>>> sk_convert_filter() remaps old style insns into extended:
>>>>>> 'sock_filter' instructions are remapped on the fly to
>>>>>> 'sock_filter_ext' extended instructions when
>>>>>> sysctl net.core.bpf_ext_enable=1
>>>>>>
>>>>>> Old filter comes through sk_attach_filter() or
>>>>>> sk_unattached_filter_create()
>>>>>> if (bpf_ext_enable) {
>>>>>> convert to new
>>>>>> sk_chk_filter() - check old bpf
>>>>>> use sk_run_filter_ext() - new interpreter
>>>>>> } else {
>>>>>> sk_chk_filter() - check old bpf
>>>>>> if (bpf_jit_enable)
>>>>>> use old jit
>>>>>> else
>>>>>> use sk_run_filter() - old interpreter
>>>>>> }
>>>>>>
>>>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>>>> than sk_run_filter() for two reasons:
>>>>>>
>>>>>> 1.fall-through jumps
>>>>>> Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>>> branch which causes branch-miss penalty.
>>>>>> Extended BPF jump instructions have one branch and fall-through,
>>>>>> which fit CPU branch predictor logic better.
>>>>>> 'perf stat' shows drastic difference for branch-misses.
>>>>>>
>>>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>>> Instead of single tablejump at the top of 'switch' statement, GCC will
>>>>>> generate multiple tablejump instructions, which helps CPU branch
>>>>>> predictor
>>>>>>
>>>>>> Performance of two BPF filters generated by libpcap was measured
>>>>>> on x86_64, i386 and arm32.
>>>>>>
>>>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>>>> tcpdump -i eth0 port 22 -dd
>>>>>>
>>>>>> fprog #2 is taken from 'man tcpdump':
>>>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>>> ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>>>
>>>>>> Other libpcap programs have similar performance differences.
>>>>>>
>>>>>> Raw performance data from BPF micro-benchmark:
>>>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>>>> time in nsec per call, smaller is better
>>>>>> --x86_64--
>>>>>> fprog #1 fprog #1 fprog #2 fprog #2
>>>>>> cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF 90 101 192 202
>>>>>> ext BPF 31 71 47 97
>>>>>> old BPF jit 12 34 17 44
>>>>>> ext BPF jit TBD
>>>>>>
>>>>>> --i386--
>>>>>> fprog #1 fprog #1 fprog #2 fprog #2
>>>>>> cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF 107 136 227 252
>>>>>> ext BPF 40 119 69 172
>>>>>>
>>>>>> --arm32--
>>>>>> fprog #1 fprog #1 fprog #2 fprog #2
>>>>>> cache-hit cache-miss cache-hit cache-miss
>>>>>> old BPF 202 300 475 540
>>>>>> ext BPF 180 270 330 470
>>>>>> old BPF jit 26 182 37 202
>>>>>> new BPF jit TBD
>>>>>>
>>>>>> Tested with trinify BPF fuzzer
>>>>>>
>>>>>> Future work:
>>>>>>
>>>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>>>
>>>>>> 1. add extended BPF JIT for x86_64
>>>>>>
>>>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>>>> programs
>>>>>> can be loaded through old sk_attach_filter() and
>>>>>> sk_unattached_filter_create()
>>>>>> interfaces
>>>>>>
>>>>>> 3. tracing filters systemtap-like with extended BPF
>>>>>>
>>>>>> 4. OVS with extended BPF
>>>>>>
>>>>>> 5. nftables with extended BPF
>>>>>>
>>>>>> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
>>>>>> Acked-by: Hagen Paul Pfeifer <hagen@xxxxxxxx>
>>>>>> Reviewed-by: Daniel Borkmann <dborkman@xxxxxxxxxx>
>>>>>
>>>>>
>>>>> One more question or possible issue that came through my mind: When
>>>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>>>> then the old filter will transparently be converted to the new
>>>>> representation. If then user space (e.g. through checkpoint restore)
>>>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>>>> insns_ext[] with the assumption we still have the old code. Would that
>>>>> actually crash (or leak memory, or just return garbage), as we access
>>>>> decodes[] array with filt->code? Would be great if you could double-check.
>>>>
>>>> ohh. yes. missed that.
>>>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>>>> This way the user space can see how old bpf filter was converted.
>>>>
>>>> Of course we can allocate extra memory and keep original bpf code there
>>>> just to return it via sk_get_filter(), but that seems overkill.
>>>
>>> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
>>> filter program (v2)").
>>>
>>> I think the issue can be that when applications could get migrated
>>> from one machine to another and their kernel won't support ebpf yet,
>>> then filter could not get loaded this way as it's expected to return
>>> what the user loaded. The trade-off, however, is that the original
>>> BPF code needs to be stored as well. :(
>>
>> Sorry if I miss the point, but isn't the original filter kept on socket?
>> The sk_attach_filter() does so, then calls __sk_prepare_filter, which
>> in turn calls bpf_jit_compile(), and the latter two keep the insns in place.
>
> Yes. in V8/V9 series original filter is kept on socket.

Ah, I see :)

> and your crtools/test/zdtm/live/static/socket_filter.c test passes.
> Let me know if there are any other tests I can try.

No, that's the only test we need wrt sk-filter.
Thanks for keeping an eye on it :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jon Ringle: "Re: [PATCH] RFC: WIP: sc16is7xx [v0.4]"
Previous message: Bjorn Helgaas: "Re: [PATCH 0/3] amd/pci: Add AMD hostbridge supports for newer AMD systems"
In reply to: Alexei Starovoitov: "Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter"
Next in thread: Eric Dumazet: "Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]