Re: [RFC][patch 1/5] move clock source related code toclocksource.c

From: Martin Schwidefsky
Date: Mon Jul 27 2009 - 07:56:25 EST


On Fri, 24 Jul 2009 17:08:47 -0700
john stultz <johnstul@xxxxxxxxxx> wrote:

> On Thu, 2009-07-23 at 12:52 +0200, Martin Schwidefsky wrote:
> > On Wed, 22 Jul 2009 17:28:20 -0700
> > john stultz <johnstul@xxxxxxxxxx> wrote:
> > > 2) Structure names aren't too great right now. Not sure timeclock is
> > > what I want to use, probably system_time or something. Will find/replace
> > > before the next revision is sent out.
> >
> > I've picked the name struct timekeeper.
>
> Nice. I like this name.

Agreed then.

> > > 5) The TSC clocksource uses cycles_last to avoid very slight skew issues
> > > (that otherwise would not be noticed). Not sure how to fix that if we're
> > > pulling cycles_last (which is purely timekeeping state) out of the
> > > clocksource. Will have to think of something.
> >
> > That is an ugly one. A similar thing exists in the s390 backend where I
> > want to reset the timekeeping to precise values after the clocksource
> > switch from jiffies. The proper solution probably is to allow
> > architectures to override the default clocksource. The jiffies
> > clocksource doesn't make any sense on s390.
>
> Yea. Still not sure what a good fix is here. We need some way for the
> TSC to check if its slightly behind another cpu. Alternatively we could
> create a flag (SLIGHTLY_UNSYNCED or something) where we then do the
> quick comparison in the timekeeping code instead of the clocksource read
> implementation? It would really only be useful for the TSC and would
> clutter the code up somewhat, so I'm not sure I really like it, but
> maybe that would work.

For my next version of the patch series I will leave the cycle_last in
the struct clocksource. Small, manageable steps, if we find a way to
move it then we can do that later on as well.

> > > 3) Once all arches are converted to using read_persistent_clock(), then
> > > the arch specific time initialization can be dropped. Removing the
> > > majority of direct xtime structure accesses.
> >
> > Only if the read_persistent_clock allows for a better resolution than
> > seconds. With my cputime accounting hat on: seconds are not good
> > enough..
>
> cputime accounting? read_persistent_clock is only used for time
> initialization. But yea, it could be extended to return a timespec.

Exactly!

> > > 4) Then once the remaining direct wall_to_monotonic and xtime accessors
> > > are moved to timekeeping.c we can make those both static and embed them
> > > into the core timekeeping structure.
> >
> > Both should not be accessed at a rate that makes it necessary to read
> > from the values directly. An accessor should be fine I think.
>
> Yea its just a matter of cleaning things up.

Nod.

> > > But let me know if this patch doesn't achieve most of the cleanup you
> > > wanted to see.
> >
> > We are getting there. I wonder if it is really necessary to pull
> > xtime_cache, raw_time, total_sleep_time and timekeeping_suspended into
> > the struct timeclock. I would prefer the semantics that the struct
> > timekeeper / timeclock contains the internal values of the timekeeping
> > code for the currently selected clock source. xtime is not clock
> > specific.
>
> Its not necessary. Although I do like the idea of pulling all the values
> protected by the xtime_lock into one structure so its clear what we're
> protecting. As it is now, jiffies, ntp state and a whole host of other
> data is covered by it, so it would be a ways off until all that logic
> could be dethreaded.

Yea, I guess that makes sense. Gives us something to do in the near
future. For now I do what is not too complicated to implement.

> The other reason for putting those values into the timekeeping struct is
> that they are actually fairly tightly connected with the clocksource
> state. There is some redundancy: xtime_nsec stores the highres remainder
> of xtime.tv_nsec, xtime_cache is also mostly duplicate, so hopefully we
> can combine those once xtime is made static.
>
> > For reference here is the current stack of patches I have on my disk.
> > The stop_machine conversion to install a new clocksource is currently missing.
> >
> > PRELIMINARY PATCHES, USE AT YOUR OWN RISK.
> > -------------------------------------------------------------------
> >
> > Subject: [PATCH] introduce timekeeping_leap_insert
> >
> > From: john stultz <johnstul@xxxxxxxxxx>
> >
> > ---
> > include/linux/time.h | 1 +
> > kernel/time/ntp.c | 7 ++-----
> > kernel/time/timekeeping.c | 7 +++++++
> > 3 files changed, 10 insertions(+), 5 deletions(-)
> >
> [snip]
>
> I think this one is pretty easy and should be able to be submitted on
> the sooner side.

A good question is who will upstream all this, will you take care of
it ?

> > -------------------------------------------------------------------
> >
> > Subject: [PATCH] remove clocksource inline functions
> >
> > From: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
> >
> > Remove clocksource_read, clocksource_enable and clocksource_disable
> > inline functions. No functional change.
> >
> > Cc: Ingo Molnar <mingo@xxxxxxx>
> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > Cc: john stultz <johnstul@xxxxxxxxxx>
> > Cc: Daniel Walker <dwalker@xxxxxxxxxx>
> > Signed-off-by: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
> > ---
> > include/linux/clocksource.h | 46 --------------------------------------------
> > kernel/time/timekeeping.c | 32 +++++++++++++++++-------------
> > 2 files changed, 18 insertions(+), 60 deletions(-)
>
> [snip]
>
> Again, the outstanding mult_orig fix (I need to pester Thomas or Andrew
> to push that before 2.6.31 comes out) will collide with this, but
> overall it looks ok.

The conflict with mult_orig is a small issue, I'll work around it. In
the end mult_orig will be gone anyway.

> > -------------------------------------------------------------------
> >
> > Subject: [PATCH] cleanup clocksource selection
> >
> > From: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
> >
> > Introduce clocksource_dequeue & clocksource_update and move spinlock
> > calls. clocksource_update does nothing for GENERIC_TIME=n since
> > change_clocksource does nothing as well.
> >
> > Cc: Ingo Molnar <mingo@xxxxxxx>
> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > Cc: john stultz <johnstul@xxxxxxxxxx>
> > Cc: Daniel Walker <dwalker@xxxxxxxxxx>
> > Signed-off-by: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
>
> [snip]
>
> Haven't had enough time to closely review or test this, but it looks
> like an agreeable change.

Don't know yet, the change to clocksource for the stop_machine rework
will be bigger than I though. The clocksource watchdog code changes the
rating of the TSC clock from interrupt context. I need to get out of
the interrupt context if I want to use stop_machine for the clocksource
update. Not nice..

> > +/* Structure holding internal timekeeping values. */
> > +struct timekeeper {
> > + struct clocksource *clock;
> > + cycle_t cycle_interval;
> > + u64 xtime_interval;
> > + u32 raw_interval;
> > + u64 xtime_nsec;
> > + s64 ntp_error;
> > + int xtime_shift;
> > + int ntp_shift;
> > +
> > + /*
> > + * The following is written at each timer interrupt
> > + * Keep it in a different cache line to dirty no
> > + * more than one cache line.
> > + */
> > + cycle_t cycle_last ____cacheline_aligned_in_smp;
> > +};
>
>
> The cacheline aligned bit probably needs to cover almost everything here
> (not clock, not the shift values), as it will all be updated each tick.

cycle_last is back in the clocksource structure for now.

> [snip]
> > @@ -564,8 +628,8 @@ static __always_inline int clocksource_b
> > * Now calculate the error in (1 << look_ahead) ticks, but first
> > * remove the single look ahead already included in the error.
> > */
> > - tick_error = tick_length >> (NTP_SCALE_SHIFT - clock->shift + 1);
> > - tick_error -= clock->xtime_interval >> 1;
> > + tick_error = tick_length >> (timekeeper.ntp_shift + 1);
> > + tick_error -= timekeeper.xtime_interval >> 1;
> > error = ((error - tick_error) >> look_ahead) + tick_error;
> >
> > /* Finally calculate the adjustment shift value. */
> > @@ -590,18 +654,18 @@ static __always_inline int clocksource_b
> > * this is optimized for the most common adjustments of -1,0,1,
> > * for other values we can do a bit more work.
> > */
> > -static void clocksource_adjust(s64 offset)
> > +static void timekeeping_adjust(s64 offset)
> > {
> > - s64 error, interval = clock->cycle_interval;
> > + s64 error, interval = timekeeper.cycle_interval;
> > int adj;
> >
> > - error = clock->error >> (NTP_SCALE_SHIFT - clock->shift - 1);
> > + error = timekeeper.ntp_error >> (timekeeper.ntp_shift - 1);
> > if (error > interval) {
> > error >>= 2;
> > if (likely(error <= interval))
> > adj = 1;
> > else
> > - adj = clocksource_bigadjust(error, &interval, &offset);
> > + adj = timekeeping_bigadjust(error, &interval, &offset);
> > } else if (error < -interval) {
> > error >>= 2;
> > if (likely(error >= -interval)) {
> > @@ -609,15 +673,14 @@ static void clocksource_adjust(s64 offse
> > interval = -interval;
> > offset = -offset;
> > } else
> > - adj = clocksource_bigadjust(error, &interval, &offset);
> > + adj = timekeeping_bigadjust(error, &interval, &offset);
> > } else
> > return;
> >
> > - clock->mult += adj;
> > - clock->xtime_interval += interval;
> > - clock->xtime_nsec -= offset;
> > - clock->error -= (interval - offset) <<
> > - (NTP_SCALE_SHIFT - clock->shift);
> > + timekeeper.clock->mult += adj;
> > + timekeeper.xtime_interval += interval;
> > + timekeeper.xtime_nsec -= offset;
> > + timekeeper.ntp_error -= (interval - offset) << timekeeper.ntp_shift;
> > }
> >
> > /**
> > @@ -627,53 +690,56 @@ static void clocksource_adjust(s64 offse
> > */
> > void update_wall_time(void)
> > {
> > + struct clocksource *clock;
> > cycle_t offset;
> >
> > /* Make sure we're fully resumed: */
> > if (unlikely(timekeeping_suspended))
> > return;
> >
> > + clock = timekeeper.clock;
> > #ifdef CONFIG_GENERIC_TIME
> > - offset = (clock->read(clock) - clock->cycle_last) & clock->mask;
> > + offset = (clock->read(clock) - timekeeper.cycle_last) & clock->mask;
> > #else
> > - offset = clock->cycle_interval;
> > + offset = timekeeper.cycle_interval;
> > #endif
> > - clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift;
> > + timekeeper.xtime_nsec = (s64)xtime.tv_nsec << timekeeper.xtime_shift;
> >
> > /* normally this loop will run just once, however in the
> > * case of lost or late ticks, it will accumulate correctly.
> > */
> > - while (offset >= clock->cycle_interval) {
> > + while (offset >= timekeeper.cycle_interval) {
> > /* accumulate one interval */
> > - offset -= clock->cycle_interval;
> > - clock->cycle_last += clock->cycle_interval;
> > + offset -= timekeeper.cycle_interval;
> > + timekeeper.cycle_last += timekeeper.cycle_interval;
> >
> > - clock->xtime_nsec += clock->xtime_interval;
> > - if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
> > - clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
> > + timekeeper.xtime_nsec += timekeeper.xtime_interval;
> > + if (timekeeper.xtime_nsec >= (u64)NSEC_PER_SEC << timekeeper.xtime_shift) {
> > + timekeeper.xtime_nsec -= (u64)NSEC_PER_SEC << timekeeper.xtime_shift;
> > xtime.tv_sec++;
> > second_overflow();
> > }
> >
> > - clock->raw_time.tv_nsec += clock->raw_interval;
> > - if (clock->raw_time.tv_nsec >= NSEC_PER_SEC) {
> > - clock->raw_time.tv_nsec -= NSEC_PER_SEC;
> > - clock->raw_time.tv_sec++;
> > + raw_time.tv_nsec += timekeeper.raw_interval;
> > + if (raw_time.tv_nsec >= NSEC_PER_SEC) {
> > + raw_time.tv_nsec -= NSEC_PER_SEC;
> > + raw_time.tv_sec++;
> > }
> >
> > /* accumulate error between NTP and clock interval */
> > - clock->error += tick_length;
> > - clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
> > + timekeeper.ntp_error += tick_length;
> > + timekeeper.ntp_error -= timekeeper.xtime_interval <<
> > + timekeeper.ntp_shift;
> > }
>
>
> Just to be super careful, I'd probably separate the introduction of the
> timekeeper code and the ntp_shift changes into separate patches. Just to
> keep the amount of change to very complex code down to a more easily
> review-able amount.

Done, I now have multiple patches for the cleanup. I will resend soon.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/