Re: [Patch 00/12] Hardware Breakpoint Interfaces

From: K.Prasad
Date: Mon May 11 2009 - 13:21:00 EST


On Mon, May 11, 2009 at 10:50:39AM -0400, Alan Stern wrote:
> On Mon, 11 May 2009, K.Prasad wrote:
>
> > > You shouldn't call on_each_cpu() while holding a spinlock. The same
> > > thing happens in unregister_kernel_hw_breakpoint().
> > >
> >
> > First, on_each_cpu() will now be changed to return only after all
> > functions invoked through IPIs have returned (by changing @wait
> > parameter to 1). This is required to prevent side effects of
> > incrementing hbp_kernel_pos after on_each_cpu() in
> > unregister_kernel_hw_breakpoint() [hbp_kernel_pos is still incremented
> > after IPI and I will explain it below].
>
> Good. I thought this was already done; it's lucky you noticed it
> wasn't.
>
> > on_each_cpu() isn't a blocking call (despite @wait being set to 1, which
> > does a busy wait through cpu_relax()) and should be safe to invoke
> > inside a spin_lock() context. I would like to know if you think
> > otherwise.
>
> I guess you're right. Why did you change the spin_lock to
> spin_lock_bh?
>

I realised that flush_thread_hw_breakpoint()-->flush_thread() is invoked
through a softIRQ. Considering that it can cause potential deadlock due
to circular lock dependancy, spin_lock() calls were changed to
spin_lock_bh(). Here's a traceback and message from lockdep that helped
me:

=================================
[ INFO: inconsistent lock state ]
2.6.30-rc4-tip.hbkpt #31
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(hw_breakpoint_lock){+.?...}, at: [<c0498f73>]
flush_thread_hw_breakpoint+0x16/0x75
{SOFTIRQ-ON-W} state was registered at:
[<c0459309>] __lock_acquire+0x2e8/0xb55
[<c0459c11>] lock_acquire+0x9b/0xbe
[<c0701dbd>] _spin_lock+0x23/0x53
[<c0499190>] load_debug_registers+0x1b/0x77
[<c06fcf20>] start_secondary+0x1b9/0x1c6
[<ffffffff>] 0xffffffff
irq event stamp: 1447500
hardirqs last enabled at (1447500): [<c04be9ff>]
kmem_cache_free+0xb2/0xe7
hardirqs last disabled at (1447499): [<c04be996>]
kmem_cache_free+0x49/0xe7
softirqs last enabled at (1447468): [<c043cdda>]
__do_softirq+0x14c/0x154
softirqs last disabled at (1447473): [<c04051b8>] do_softirq+0x6d/0xcd

other info that might help us debug this:
no locks held by swapper/0.

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.30-rc4-tip.hbkpt #31
Call Trace:
[<c06ff8d2>] ? printk+0x14/0x1a
[<c04582e7>] valid_state+0x12a/0x13d
[<c04583e0>] mark_lock+0xe6/0x1e9
[<c0458ae8>] ? check_usage_forwards+0x0/0x3f
[<c0459287>] __lock_acquire+0x266/0xb55
[<c04aa403>] ? page_address+0xe/0xb1
[<c0459c11>] lock_acquire+0x9b/0xbe
[<c0498f73>] ? flush_thread_hw_breakpoint+0x16/0x75
[<c0701dbd>] _spin_lock+0x23/0x53
[<c0498f73>] ? flush_thread_hw_breakpoint+0x16/0x75
[<c0498f73>] flush_thread_hw_breakpoint+0x16/0x75
[<c0409751>] free_thread_xstate+0x40/0x62
[<c0409785>] free_thread_info+0x12/0x1e
[<c043611b>] free_task+0x1e/0x34
[<c0437522>] __put_task_struct+0xa7/0xac
[<c0439084>] delayed_put_task_struct+0x75/0x7c
[<c047bd9c>] __rcu_process_callbacks+0x1a5/0x229
[<c047be44>] rcu_process_callbacks+0x24/0x42
[<c043cd2b>] __do_softirq+0x9d/0x154
[<c043cc8e>] ? __do_softirq+0x0/0x154
<IRQ> [<c043c816>] ? irq_exit+0x3a/0x68
[<c07055d4>] ? smp_apic_timer_interrupt+0x79/0x87
[<c0403d47>] ? apic_timer_interrupt+0x2f/0x34
[<c041eeb4>] ? native_safe_halt+0xa/0xc
[<c0409296>] ? default_idle+0x4f/0x81
[<c048c417>] ? stop_critical_timings+0x25/0x2a
[<c040263b>] ? cpu_idle+0x58/0x79
[<c06f0004>] ? rest_init+0x58/0x5a
[<c08cc805>] ? start_kernel+0x2fc/0x301
[<c08cc06a>] ? __init_begin+0x6a/0x6f

>
> > > What happens if a kernel breakpoint is triggered on another CPU while
> > > this loop is running? Or what happens if the breakpoint being removed
> > > is triggered on another CPU before on_each_cpu() is called below?
> > >
> > > Basically, it's impossible to change the kernel breakpoints
> > > simultaneously on all CPUs. That means you somehow have to keep both
> > > the old set and the new set around until all the CPUs are updated.
> > >
> >
> > I must admit that the code did not handle the above scenario. I am
> > adding a per-cpu instance of 'hbp_kernel[]' called 'this_hbp_kernel[]'.
> > The breakpoint handler would use the per-cpu instance which will remain
> > valid throughout the execution of the handler. The per-cpu instance will
> > be updated with hbp_kernel[] values in arch_update_kernel_hw_breakpoint().
> > [This necessitates hbp_kernel_pos increment to happen after the IPI call
> > in unregister_kernel code].
>
> Yes, okay. I'm a little nervous about making this_hbp_kernel[] be
> per-cpu while hbp_kernel_pos isn't. It means the two values will
> sometimes be out of sync with each other. But it ought to be safe
> since during the out-of-sync periods, hbp_kernel_pos will always point
> to an empty breakpoint slot.
>
> Still, in hw_breakpoint_handler(), perhaps the "if (i >=
> hbp_kernel_pos)" path should also check for !bp. Just to be safe. In
> other words, move the
>
> + if (!bp)
> + continue;
> + rc = NOTIFY_DONE;
>
> lines outside the "if" statement.
>
>

I don't see where the out-of-sync situation between hbp_kernel_pos and
this_hbp_kernel[] can cause problem. Our concern is when an exception
arises in the small time-window starting at the time when hbp_kernel[]
is updated in common code and ends when exceptions are disabled through
the IPI routine arch_update_kernel_hw_breakpoint. Consider these two
cases now:

i)register_kernel_hw_breakpoint() - Although hbp_kernel_pos is
decremented before the IPI, no new exceptions will arise because of the
same on a CPU where this_hbp_kernel[] is not synced because the physical
debug registers are not yet set. So, the 'i' in "if (i >= hbp_kernel_pos)"
cannot belong to the newly registered breakpoint. If it is for the new
breakpoint, then it is on a CPU where this_hbp_kernel[] is synced with
hbp_kernel[].

ii)unregister_kernel_hw_breakpoint() - hbp_kernel_pos is incremented
after all the IPIs have finished, so hbp_kernel_pos still points to the
old value whose hbp_kernel[] is NULL. However in "if (i >= hbp_kernel_pos)"
if 'i' points to the 'pending-delete' breakpoint, then 'bp' is derived
from this_hbp_kernel[] which contains the old value. The callback
bp->triggered() is invoked and the exception returns fine.

Let me know if I am missing any possibility leading to incorrect
behaviour.

> > > Another problem: kdr7 is a global variable, and here you've got every
> > > CPU recomputing it whenever a kernel breakpoint is added or removed.
> > > It should be computed just once, before the on_each_cpu() call.
> > >
> >
> > If kdr7 needs to be updated only once, it has to be kept outside the IPI
> > through the use of a wrapper routine (in arch/x86/kernel/hw_breakpoint.c
> > as it is arch-specific). This would mean one more function call in
> > (un)register_kernel_<> routines taking the code back to one of its previous
> > designs. In a trade-off between code-brevity and efficiency, the present one
> > chose the former keeping in mind some of the comments received during the
> > early stages of this patch.
>
> You don't appreciate the seriousness of this problem. Here's the new code:
>
> + /* Clear all kernel-space breakpoints in kdr7 */
> + kdr7 = 0;
> (A)
> + for (i = hbp_kernel_pos; i < HB_NUM; i++) {
> + per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i];
> + if (bp) {
> + kdr7 |= encode_dr7(i, bp->info.len, bp->info.type);
> + set_debugreg(hbp_kernel[i]->info.address, i);
> + }
> + }
> +
> + /* No need to set DR6. Update the debug registers with kernel-space
> + * breakpoint values from kdr7 and user-space requests from the
> + * current process
> + */
> (B)
> + set_debugreg(kdr7 | current->thread.debugreg7, 7);
>
> Suppose you send out the IPIs, and one of the CPUs happens to reach (A)
> at the same time another CPU reaches (B). The second CPU will load an
> incorrect value into DR7.
>
> If you prefer you can replace kdr7 above with a local variable, and
> set kdr7 to the local variable's value at the end of the function.
> Also, in the first set_debugreg() line above, you can replace
> hbp_kernel[i] with bp.
>
>

I failed to realise the potential inconsistency in kdr7 value. The code
will be modified like this:

void arch_update_kernel_hw_breakpoints(void *unused)
{
struct hw_breakpoint *bp;
int i, cpu = get_cpu();
unsigned long temp_kdr7 = 0;

/* Don't allow debug exceptions while we update the registers */
set_debugreg(0UL, 7);

for (i = hbp_kernel_pos; i < HB_NUM; i++) {
per_cpu(this_hbp_kernel[i], cpu) = bp = hbp_kernel[i];
if (bp) {
temp_kdr7 |= encode_dr7(i, bp->info.len, bp->info.type);
set_debugreg(bp->info.address, i);
}
}

/* No need to set DR6. Update the debug registers with kernel-space
* breakpoint values from temp_kdr7 and user-space requests from
* the current process
*/
kdr7 = temp_kdr7;
set_debugreg(kdr7 | current->thread.debugreg7, 7);
put_cpu_no_resched();
}


> > > > + void *temp_mem = kzalloc(sizeof(struct hw_breakpoint), GFP_KERNEL);
> > > > + if (!temp_mem)
> > > > + temp_mem_used = -ENOMEM;
> > >
> > > I don't think this is a good idea...
> > >
> >
> > I agree that it turned out to be wrong. ptrace is now modified to use
> > the (un)register_user_hw_breakpoint() interfaces directly and not the
> > worker routines, thereby avoiding all this complexity. Please find the
> > changes in the new patch.
>
> > > And now if something went wrong, you have already freed the memory
> > > holding the original breakpoint structures. It would be better to
> > > keep them around until you know they aren't going to be needed.
>
> I still think it would be a good idea to avoid freeing any breakpoint
> structures until you know for certain they aren't needed. Otherwise
> you might find that you're unable to allocate memory to re-create a
> structure that was already present.
>
> Alan Stern
>

I agree that we shouldn't free memory temporarily that would be needed
later, as memory may not be available. However such a situation does not
arise in ptrace_write_dr7() after the re-write using
(un)register_user_hw_breakpoint(). kfree(bp) is done when
i) an existing breakpoint is disabled
ii) a breakpoint is requested but register_user_hw_breakpoint() routine
fails for some reasons. Hence the breakpoint structure kzalloc()'ed
afresh is kfree()'ed.
iii) In case of modifications to the type of breakpoint (such as
read<-->write), the breakpoint structure is not de-allocated and this is
where __modify_user_hw_breakpoint() helps us. There is no opportunity
window where the breakpoint can be grabbed by other requests when a
modification is done.

Let me know if you think there's still some discrepancy here.

Thanks for your detailed review and pointing me to the errors. I am
sending the patchset again with the changes discussed above.

Thanks,
K.Prasad


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/