Re: [RFC PATCH] bpf: Add helpers to read useful task_struct members

From: Alexei Starovoitov
Date: Sat Nov 04 2017 - 17:11:12 EST


On 11/5/17 2:31 AM, Naveen N. Rao wrote:
Hi Alexei,

Alexei Starovoitov wrote:
On 11/3/17 3:58 PM, Sandipan Das wrote:
For added security, the layout of some structures can be
randomized by enabling CONFIG_GCC_PLUGIN_RANDSTRUCT. One
such structure is task_struct. To build BPF programs, we
use Clang which does not support this feature. So, if we
attempt to read a field of a structure with a randomized
layout within a BPF program, we do not get the expected
value because of incorrect offsets. To observe this, it
is not mandatory to have CONFIG_GCC_PLUGIN_RANDSTRUCT
enabled because the structure annotations/members added
for this purpose are enough to cause this. So, all kernel
builds are affected.

For example, considering samples/bpf/offwaketime_kern.c,
if we try to print the values of pid and comm inside the
task_struct passed to waker() by adding the following
lines of code at the appropriate place

char fmt[] = "waker(): p->pid = %u, p->comm = %s\n";
bpf_trace_printk(fmt, sizeof(fmt), _(p->pid), _(p->comm));

it is seen that upon rebuilding and running this sample
followed by inspecting /sys/kernel/debug/tracing/trace,
the output looks like the following

_-----=> irqs-off
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / delay
TASK-PID CPU# |||| TIMESTAMP FUNCTION
| | | |||| | |
<idle>-0 [007] d.s. 1883.443594: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [018] d.s. 1883.453588: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [007] d.s. 1883.463584: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [009] d.s. 1883.483586: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [005] d.s. 1883.493583: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [009] d.s. 1883.503583: 0x00000001: waker():
p->pid = 0, p->comm =
<idle>-0 [018] d.s. 1883.513578: 0x00000001: waker():
p->pid = 0, p->comm =
systemd-journal-3140 [003] d... 1883.627660: 0x00000001: waker():
p->pid = 0, p->comm =
systemd-journal-3140 [003] d... 1883.627704: 0x00000001: waker():
p->pid = 0, p->comm =
systemd-journal-3140 [003] d... 1883.627723: 0x00000001: waker():
p->pid = 0, p->comm =

To avoid this, we add new BPF helpers that read the
correct values for some of the important task_struct
members such as pid, tgid, comm and flags which are
extensively used in BPF-based analysis tools such as
bcc. Since these helpers are built with GCC, they use
the correct offsets when referencing a member.

Signed-off-by: Sandipan Das <sandipan@xxxxxxxxxxxxxxxxxx>
...
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f90860d1f897..324508d27bd2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -338,6 +338,16 @@ union bpf_attr {
* @skb: pointer to skb
* Return: classid if != 0
*
+ * u64 bpf_get_task_pid_tgid(struct task_struct *task)
+ * Return: task->tgid << 32 | task->pid
+ *
+ * int bpf_get_task_comm(struct task_struct *task)
+ * Stores task->comm into buf
+ * Return: 0 on success or negative error
+ *
+ * u32 bpf_get_task_flags(struct task_struct *task)
+ * Return: task->flags
+ *

I don't think it's a solution.
Tracing scripts read other fields too.
Making it work for these 3 fields is a drop in a bucket.

Indeed. However...

If randomization is used I think we have to accept
that existing bpf scripts won't be usable.

... the actual issue is that randomization isn't necessary for this to
show up. The annotations added to mark off the structure members results
in some structure members being moved into an anonymous structure, which
would then get padded differently. So, *all* kernels since v4.13 are
affected, afaict.

hmm. why would all 4.13+ be affected?
It's just an anonymous struct inside task_struct.
Are you saying that due to clang not adding this 'struct { };' treatment to task_struct?
I thought such struct shouldn't change layout.
If it is we need to fix include/linux/compiler-clang.h to do that
anon struct as well.

As such, we wanted to propose this as a short term solution, but I do
agree that this doesn't solve the real issue.

Long term solution is to support 'BPF Type Format' or BTF
(which is old C-Type Format) for kernel data structures,
so bcc scripts wouldn't need to use kernel headers and clang.
The proper offsets will be described in BTF.
We were planning to use it initially to describe map key/value,
but it applies for this case as well.
There will be a tool that will take dwarf from vmlinux and
compress it into BTF. Kernel will also be able to verify
that BTF is a valid BTF.

This is the first that I've heard about BTF. Can you share more details
about it, or point me to some place where it has been discussed?

We considered having tools derive the structure offsets from debuginfo,
but debuginfo may not always be present on production systems. So, it
isn't clear if having that dependency is fine. I'm not sure how BTF will
be different.

It was discussed at this year plumbers:
https://lwn.net/Articles/734453/

btw the name BTF is work in progress. We started with CTF, but
it conflicts with all other meanings of this abbreviation.
Likely we will call it something different at the end.

Initial goal was to describe key/map values of bpf maps to
make debugging easier, but now we want to use this compressed
type format for tracing as well, since installing kernel headers
everywhere doesn't scale well while CTF can be embedded in vmlinux

We were also thinking to improve verifier with CTF knowledge too.
Like if CTF describes that map value is two u32, but bpf program
is doing 8-byte access then something is wrong and either warn
or reject such program.