Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()

From: Teng Qi
Date: Sun Jun 11 2023 - 09:02:29 EST


Hello!
> BTW, please do create a test case, e.g, sockmap test case which
> can show the problem with existing code base.

I add a printk in bpf_prog_put_deferred():
static void bpf_prog_put_deferred(struct work_struct *work)
{
// . . .
int inIrq = in_irq();
int irqsDisabled = irqs_disabled();
int preemptBits = preempt_count();
int inAtomic = in_atomic();
int rcuHeld = rcu_read_lock_held();
printk("bpf_prog_put: in_irq() %d, irqs_disabled() %d, preempt_count()
%d, in_atomic() %d, rcu_read_lock_held() %d",
inIrq, irqsDisabled, preemptBits, inAtomic, rcuHeld);
// . . .
}

When running the selftest, I see the following output:
[255340.388339] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
preempt_count() 256, in_atomic() 1, rcu_read_lock_held() 1
[255393.237632] bpf_prog_put: in_irq() 0, irqs_disabled() 0,
preempt_count() 0, in_atomic() 0, rcu_read_lock_held() 1

Based on this output, I believe it is sufficient to construct a self-test case
for bpf_prog_put_deferred() called under preempt disabled or rcu read lock
region. However, I'm a bit confused about what I should do to build the
self-test case. Are we looking to create a checker that verifies the
context of bpf_prog_put_deferred() is valid? Or do we need a test case that
can trigger this bug?

Could you show me more ideas to construct a self test case? I am not familiar
with it and have no idea.

-- Teng Qi

On Thu, May 25, 2023 at 3:34 AM Yonghong Song <yhs@xxxxxxxx> wrote:
>
>
>
> On 5/24/23 5:42 AM, Teng Qi wrote:
> > Thank you.
> >
> >> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >> value rcu_read_lock_held() could be 1 for some configurations regardless
> >> whether rcu_read_lock() is really held or not. In most cases,
> >> rcu_read_lock_held() is used in issuing potential warnings.
> >> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >
> > Sorry. I was not aware of the dependency of configurations of
> > rcu_read_lock_held().
> >
> >> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >> can be !in_interrupt(), so any process-context will go to a workqueue.
> >
> > I agree that using !in_interrupt() as a condition is an acceptable solution.
>
> This should work although it could be conservative.
>
> >
> >> Alternatively, we could have another solution. We could add another
> >> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >> will be done in rcu context.
> >
> > Implementing a new function like bpf_prog_put_rcu() is a solution that involves
> > more significant changes.
>
> Maybe we can change signature of bpf_prog_put instead? Like
> void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
> and inside bpf_prog_put we can add
> WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());
>
> >
> >> So if in_interrupt(), do kvfree, otherwise,
> >> put into a workqueue.
> >
> > Shall we proceed with submitting a patch following this approach?
>
> You could choose either of the above although I think with newer
> bpf_prog_put() is better.
>
> BTW, please do create a test case, e.g, sockmap test case which
> can show the problem with existing code base.
>
> >
> > I would like to mention something unrelated to the possible bug. At this
> > moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
> > but not safe under other atomic contexts.
> > This disorder challenges our conventional belief, a monotonic incrementation
> > of limitations of the hierarchical atomic contexts, that programer needs
> > to be more and more careful to write code under rcu read lock, spin lock,
> > bh disable, interrupt...
> > This disorder can lead to unexpected consequences, such as code being safe
> > under interrupts but not safe under spin locks.
> > The disorder makes kernel programming more complex and may result in more bugs.
> > Even though we find a way to resolve the possible bug about the bpf_prog_put(),
> > I feel sad for undermining of kernel`s maintainability and disorder of
> > hierarchy of atomic contexts.
> >
> > -- Teng Qi
> >
> > On Tue, May 23, 2023 at 12:33 PM Yonghong Song <yhs@xxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 5/21/23 6:39 AM, Teng Qi wrote:
> >>> Thank you.
> >>>
> >>> > Your above analysis makes sense if indeed that kvfree cannot appear
> >>> > inside a spin lock region or RCU read lock region. But is it true?
> >>> > I checked a few code paths in kvfree/kfree. It is either guarded
> >>> > with local_irq_save/restore or by
> >>> > spin_lock_irqsave/spin_unlock_
> >>> > irqrestore, etc. Did I miss
> >>> > anything? Are you talking about RT kernel here?
> >>>
> >>> To see the sleepable possibility of kvfree, it is important to analyze the
> >>> following calling stack:
> >>> mm/util.c: 645 kvfree()
> >>> mm/vmalloc.c: 2763 vfree()
> >>>
> >>> In kvfree(), to call vfree, if the pointer addr points to memory
> >>> allocated by
> >>> vmalloc(), it calls vfree().
> >>> void kvfree(const void *addr)
> >>> {
> >>> if (is_vmalloc_addr(addr))
> >>> vfree(addr);
> >>> else
> >>> kfree(addr);
> >>> }
> >>>
> >>> In vfree(), in_interrupt() and might_sleep() need to be considered.
> >>> void vfree(const void *addr)
> >>> {
> >>> // ...
> >>> if (unlikely(in_interrupt()))
> >>> {
> >>> vfree_atomic(addr);
> >>> return;
> >>> }
> >>> // ...
> >>> might_sleep();
> >>> // ...
> >>> }
> >>
> >> Sorry. I didn't check vfree path. So it does look like that
> >> we need to pay special attention to non interrupt part.
> >>
> >>>
> >>> The vfree() may sleep if in_interrupt() == false. The RCU read lock region
> >>> could have in_interrupt() == false and spin lock region which only disables
> >>> preemption also has in_interrupt() == false. So the kvfree() cannot appear
> >>> inside a spin lock region or RCU read lock region if the pointer addr points
> >>> to memory allocated by vmalloc().
> >>>
> >>> > > Therefore, we propose modifying the condition to include
> >>> > > in_atomic(). Could we
> >>> > > update the condition as follows: "in_irq() || irqs_disabled() ||
> >>> > > in_atomic()"?
> >>> > Thank you! We look forward to your feedback.
> >>>
> >>> We now think that ‘irqs_disabled() || in_atomic() ||
> >>> rcu_read_lock_held()’ is
> >>> more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
> >>> preempt count and rcu_read_lock_held() is for RCU read lock region.
> >>
> >> We cannot use rcu_read_lock_held() in the 'if' statement. The return
> >> value rcu_read_lock_held() could be 1 for some configuraitons regardless
> >> whether rcu_read_lock() is really held or not. In most cases,
> >> rcu_read_lock_held() is used in issuing potential warnings.
> >> Maybe there are other ways to record whether rcu_read_lock() is held or not?
> >>
> >> I agree with your that 'irqs_disabled() || in_atomic()' makes sense
> >> since it covers process context local_irq_save() and spin_lock() cases.
> >>
> >> If we cannot resolve rcu_read_lock() presence issue, maybe the condition
> >> can be !in_interrupt(), so any process-context will go to a workqueue.
> >>
> >> Alternatively, we could have another solution. We could add another
> >> function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
> >> will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
> >> put into a workqueue.
> >>
> >>
> >>>
> >>> -- Teng Qi
> >>>
> >>> On Sun, May 21, 2023 at 11:45 AM Yonghong Song <yhs@xxxxxxxx
> >>> <mailto:yhs@xxxxxxxx>> wrote:
> >>>
> >>>
> >>>
> >>> On 5/19/23 7:18 AM, Teng Qi wrote:
> >>> > Thank you for your response.
> >>> > > Looks like you only have suspicion here. Could you find a real
> >>> violation
> >>> > > here where __bpf_prog_put() is called with !in_irq() &&
> >>> > > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>> have not seen
> >>> > > things like that.
> >>> >
> >>> > For the complex conditions to call bpf_prog_put() with 1 refcnt,
> >>> we have
> >>> > been
> >>> > unable to really trigger this atomic violation after trying to
> >>> construct
> >>> > test cases manually. But we found that it is possible to show
> >>> cases with
> >>> > !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> >>> > For example, even a failed case, one of selftest cases of bpf,
> >>> netns_cookie,
> >>> > calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> >>> > only inside rcu read lock: The possible call stack is:
> >>> > net/core/sock_map.c: 615 bpf_sock_map_update()
> >>> > net/core/sock_map.c: 468 sock_map_update_common()
> >>> > net/core/sock_map.c: 217 sock_map_link()
> >>> > kernel/bpf/syscall.c: 2111 bpf_prog_put()
> >>> >
> >>> > The files about netns_cookie include
> >>> > tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> >>> > tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
> >>> inserted the
> >>> > following code in
> >>> > ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> >>> > static int sock_map_update_common(..)
> >>> > {
> >>> > int inIrq = in_irq();
> >>> > int irqsDisabled = irqs_disabled();
> >>> > int preemptBits = preempt_count();
> >>> > int inAtomic = in_atomic();
> >>> > int rcuHeld = rcu_read_lock_held();
> >>> > printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> >>> > in_atomic() %d, rcu_read_lock_held() %d", inIrq,
> >>> irqsDisabled,
> >>> > preemptBits, inAtomic, rcuHeld);
> >>> > }
> >>> >
> >>> > The output message is as follows:
> >>> > root@(none):/root/bpf# ./test_progs -t netns_cookie
> >>> > [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> >>> > in_atomic() 0,
> >>> > rcu_read_lock_held() 1
> >>> > #113 netns_cookie:OK
> >>> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> >>> >
> >>> > We notice that there are numerous callers in kernel/, net/ and
> >>> drivers/,
> >>> > so we
> >>> > highly suggest modifying __bpf_prog_put() to address this gap.
> >>> The gap
> >>> > exists
> >>> > because __bpf_prog_put() is only safe under in_irq() ||
> >>> irqs_disabled()
> >>> > but not in_atomic() || rcu_read_lock_held(). The following code
> >>> snippet may
> >>> > mislead developers into thinking that bpf_prog_put() is safe in all
> >>> > contexts.
> >>> > if (in_irq() || irqs_disabled()) {
> >>> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>> > schedule_work(&aux->work);
> >>> > } else {
> >>> > bpf_prog_put_deferred(&aux->work);
> >>> > }
> >>> >
> >>> > Implicit dependency may lead to issues.
> >>> >
> >>> > > Any problem here?
> >>> > We mentioned it to demonstrate the possibility of kvfree() being
> >>> > called by __bpf_prog_put_noref().
> >>> >
> >>> > Thanks.
> >>> > -- Teng Qi
> >>> >
> >>> > On Wed, May 17, 2023 at 1:08 AM Yonghong Song <yhs@xxxxxxxx
> >>> <mailto:yhs@xxxxxxxx>
> >>> > <mailto:yhs@xxxxxxxx <mailto:yhs@xxxxxxxx>>> wrote:
> >>> >
> >>> >
> >>> >
> >>> > On 5/16/23 4:18 AM, starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>
> >>> > <mailto:starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>> wrote:
> >>> > > From: Teng Qi <starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>
> >>> > <mailto:starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>>>
> >>> > >
> >>> > > Hi, bpf developers,
> >>> > >
> >>> > > We are developing a static tool to check the matching between
> >>> > helpers and the
> >>> > > context of hooks. During our analysis, we have discovered some
> >>> > important
> >>> > > findings that we would like to report.
> >>> > >
> >>> > > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
> >>> function
> >>> > > bpf_prog_put_deferred() won`t be called in the condition of
> >>> > > ‘in_irq() || irqs_disabled()’.
> >>> > > if (in_irq() || irqs_disabled()) {
> >>> > > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> >>> > > schedule_work(&aux->work);
> >>> > > } else {
> >>> > >
> >>> > > bpf_prog_put_deferred(&aux->work);
> >>> > > }
> >>> > >
> >>> > > We suspect this condition exists because there might be
> >>> sleepable
> >>> > operations
> >>> > > in the callees of the bpf_prog_put_deferred() function:
> >>> > > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> >>> > > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> >>> > > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> >>> > > kvfree(prog->aux->jited_linfo);
> >>> > > kvfree(prog->aux->linfo);
> >>> >
> >>> > Looks like you only have suspicion here. Could you find a real
> >>> > violation
> >>> > here where __bpf_prog_put() is called with !in_irq() &&
> >>> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
> >>> have not seen
> >>> > things like that.
> >>> >
> >>> > >
> >>> > > Additionally, we found that array prog->aux->jited_linfo is
> >>> > initialized in
> >>> > > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> >>> > > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> >>> > > sizeof(*prog->aux->jited_linfo),
> >>> bpf_memcg_flags(GFP_KERNEL |
> >>> > __GFP_NOWARN));
> >>> >
> >>> > Any problem here?
> >>> >
> >>> > >
> >>> > > Our question is whether the condition 'in_irq() ||
> >>> > irqs_disabled() == false' is
> >>> > > sufficient for calling 'kvfree'. We are aware that calling
> >>> > 'kvfree' within the
> >>> > > context of a spin lock or an RCU lock is unsafe.
> >>>
> >>> Your above analysis makes sense if indeed that kvfree cannot appear
> >>> inside a spin lock region or RCU read lock region. But is it true?
> >>> I checked a few code paths in kvfree/kfree. It is either guarded
> >>> with local_irq_save/restore or by
> >>> spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
> >>> anything? Are you talking about RT kernel here?
> >>>
> >>>
> >>> > >
> >>> > > Therefore, we propose modifying the condition to include
> >>> > in_atomic(). Could we
> >>> > > update the condition as follows: "in_irq() ||
> >>> irqs_disabled() ||
> >>> > in_atomic()"?
> >>> > >
> >>> > > Thank you! We look forward to your feedback.
> >>> > >
> >>> > > Signed-off-by: Teng Qi <starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>
> >>> > <mailto:starmiku1207184332@xxxxxxxxx
> >>> <mailto:starmiku1207184332@xxxxxxxxx>>>
> >>> >
> >>>