Re: [PATCH bpf-next v7 0/3] Support storing struct task_struct objects as kptrs

From: David Vernet
Date: Fri Nov 18 2022 - 10:08:26 EST


On Thu, Nov 17, 2022 at 10:04:27PM -0800, John Fastabend wrote:

[...]

> > > And last thing I was checking is because KF_SLEEPABLE is not set
> > > this should be blocked from running on sleepable progs which would
> > > break the call_rcu in the destructor. Maybe small nit, not sure
> > > its worth it but might be nice to annotate the helper description
> > > with a note, "will not work on sleepable progs" or something to
> > > that effect.
> >
> > KF_SLEEPABLE is used to indicate whether the kfunc _itself_ may sleep,
> > not whether the calling program can be sleepable. call_rcu() doesn't
> > block, so no need to mark the kfunc as KF_SLEEPABLE. The key is that if
> > a kfunc is sleepable, non-sleepable programs are not able to call it
> > (and this is enforced in the verifier).
>
> OK but should these helpers be allowed in sleepable progs? I think
> not. What stops this, (using your helpers):
>
> cpu0 cpu1
> ----
> v = insert_lookup_task(task)
> kptr = bpf_kptr_xchg(&v->task, NULL);
> if (!kptr)
> return 0;
> map_delete_elem()
> put_task()
> rcu_call
> do_something_might_sleep()
> put_task_struct
> ... free
> kptr->[free'd memory]
>
> the insert_lookup_task will bump the refcnt on the acquire on map
> insert. But the lookup doesn't do anything to the refcnt and the
> map_delete_elem will delete it. We have a check for spin_lock
> types to stop them from being in sleepable progs. Did I miss a
> similar check for these?

So, in your example above, bpf_kptr_xchg(&v->task, NULL) will atomically
xchg the kptr from the map, and so the map_delete_elem() call would fail
with (something like) -ENOENT. In general, the semantics are similar to
std::unique_ptr::swap() in C++.

FWIW, I think KF_KPTR_GET kfuncs are the more complex / racy kfuncs to
reason about. The reason is that we're passing a pointer to the map
value containing a kptr directly to the kfunc (with the attempt of
acquiring an additional reference if a kptr was already present in the
map) rather than doing an xchg which atomically gets us the unique
pointer if nobody else xchgs it in first. So with KF_KPTR_GET, someone
else could come along and delete the kptr from the map while the kfunc
is trying to acquire that additional reference. The race looks something
like this:

cpu0 cpu1
----
v = insert_lookup_task(task)
kptr = bpf_task_kptr_get(&v->task);
map_delete_elem()
put_task()
rcu_call
put_task_struct
... free
if (!kptr)
/* In this race example, this path will be taken. */
return 0;

The difference is that here, we're not doing an atomic xchg of the kptr
out of the map. Instead, we're passing a pointer to the map value
containing the kptr directly to bpf_task_kptr_get(), which itself tries
to acquire an additional reference on the task to return to the program
as a kptr. This is still safe, however, as bpf_task_kptr_get() uses RCU
and refcount_inc_not_zero() in the bpf_task_kptr_get() kfunc to ensure
that it can't hit a UAF, and that it won't return a dying task to the
caller:

/**
* bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task
* kptr acquired by this kfunc which is not subsequently stored in a map, must
* be released by calling bpf_task_release().
* @pp: A pointer to a task kptr on which a reference is being acquired.
*/
__used noinline
struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
struct task_struct *p;

rcu_read_lock();
p = READ_ONCE(*pp);

/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
* cpu1 could remove the element from the map here, and invoke
* put_task_struct_rcu_user(). We're in an RCU read region
* though, so the task won't be freed until at the very
* earliest, the rcu_read_unlock() below.
* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
*/

if (p && !refcount_inc_not_zero(&p->rcu_users))
/* <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
* refcount_inc_not_zero() will return false, as cpu1
* deleted the element from the map and dropped its last
* refcount. So we just return NULL as the task will be
* deleted once an RCU gp has elapsed.
* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
*/
p = NULL;
rcu_read_unlock();

return p;
}

Let me know if that makes sense. This stuff is tricky, and I plan to
clearly / thoroughly add it to that kptr docs page once this patch set
lands.