Re: [RFC PATCH] sched_clock: Avoid tearing during read from NMI

From: Daniel Thompson
Date: Wed Jan 21 2015 - 15:20:33 EST


On 21/01/15 17:29, John Stultz wrote:
> On Wed, Jan 21, 2015 at 8:53 AM, Daniel Thompson
> <daniel.thompson@xxxxxxxxxx> wrote:
>> Currently it is possible for an NMI (or FIQ on ARM) to come in and
>> read sched_clock() whilst update_sched_clock() has half updated the
>> state. This results in a bad time value being observed.
>>
>> This patch fixes that problem in a similar manner to Thomas Gleixner's
>> 4396e058c52e("timekeeping: Provide fast and NMI safe access to
>> CLOCK_MONOTONIC").
>>
>> Note that ripping out the seqcount lock from sched_clock_register() and
>> replacing it with a large comment is not nearly as bad as it looks! The
>> locking here is actually pretty useless since most of the variables
>> modified within the write lock are not covered by the read lock. As a
>> result a big comment and the sequence bump implicit in the call
>> to update_epoch() should work pretty much the same.
>
> It still looks pretty bad, even with the current explanation.

I'm inclined to agree. Although to be clear, the code I proposed should
not more broken than the code we have today (and arguably more honest).

>> - raw_write_seqcount_begin(&cd.seq);
>> + /*
>> + * sched_clock will report a bad value if it executes
>> + * concurrently with the following code. No locking exists to
>> + * prevent this; we rely mostly on this function being called
>> + * early during kernel boot up before we have lots of other
>> + * stuff going on.
>> + */
>> read_sched_clock = read;
>> sched_clock_mask = new_mask;
>> cd.rate = rate;
>> cd.wrap_kt = new_wrap_kt;
>> cd.mult = new_mult;
>> cd.shift = new_shift;
>> - cd.epoch_cyc = new_epoch;
>> - cd.epoch_ns = ns;
>> - raw_write_seqcount_end(&cd.seq);
>> + update_epoch(new_epoch, ns);
>
>
> So looking at this, the sched_clock_register() function may not be
> called super early, so I was looking to see what prevented bad reads
> prior to registration.

Certainly not super early, but, from the WARN_ON() at the top of the
function I thought it was intended to be called before start_kernel()
unmasks interrupts...

> And from quick inspection, its nothing. I
> suspect the undocumented trick that makes this work is that the mult
> value is initialzied to zero, so sched_clock returns 0 until things
> have been registered.
>
> So it does seem like it would be worth while to do the initialization
> under the lock, or possibly use the suspend flag to make the first
> initialization safe.

As mentioned the existing write lock doesn't really do very much at the
moment.

The simple and (I think) strictly correct approach is to duplicate the
whole of the clock_data (minus the seqcount) and make the read lock in
sched_clock cover all accesses to the structure.

This would substantially enlarge the critical section in sched_clock()
meaning we might loop round the seqcount fractionally more often.
However if that causes any real problems it would be a sign the epoch
was being updated too frequently.

Unless I get any objections (or you really want me to look closely at
using suspend) then I'll try this approach in the next day or two.


Daniel.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/