Re: [PATCH] timerfd_create.2: Included return value 0

From: Michael Kerrisk (man-pages)
Date: Mon Mar 30 2020 - 17:09:30 EST


Hello Thomas,

On 3/30/20 12:50 AM, Thomas Gleixner wrote:
> Micheal,
>
> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>> [Greetings, Thomas; now I recall a conversation we had in Lyon :-) ]
>
> Hehe.
>
>> I think this patch does not really capture the details
>> properly. The immediately preceding paragraph says:
>>
>> If the associated clock is either CLOCK_REALTIME or
>> CLOCK_REALTIME_ALARM, the timer is absolute
>> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
>> was specified when calling timerfd_settime(), then read(2)
>> fails with the error ECANCELED if the real-time clock
>> undergoes a discontinuous change. (This allows the reading
>> application to discover such discontinuous changes to the
>> clock.)
>>
>> Following on from that, I think we should have a pargraph that says
>> something like:
>>
>> If the associated clock is either CLOCK_REALTIME or
>> CLOCK_REALTIME_ALARM, the timer is absolute
>> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET
>> was not specified when calling timerfd_settime(), then a
>> discontinuous negative change to the clock
>> (e.g., clock_settime(2)) may cause read(2) to unblock, but
>> return a value of 0 (i.e., no bytes read), if the clock
>> change occurs after the time expired, but before the
>> read(2) on the timerfd file descriptor.
>
> Yes, that's correct. Accurate as always!

Thanks. (It took me a while to nut it out, actually.)

> This is pretty much in line with clock_nanosleep(CLOCK_REALTIME,
> TIMER_ABSTIME) which has a similar problem vs. observability in user
> space.
>
> clock_nanosleep(2) mutters:
>
> "POSIX.1 specifies that after changing the value of the CLOCK_REALTIME
> clock via clock_settime(2), the new clock value shall be used to
> determine the time at which a thread blocked on an absolute
> clock_nanosleep() will wake up; if the new clock value falls past the
> end of the sleep interval, then the clock_nanosleep() call will return
> immediately."
>
> which can be interpreted as guarantee that clock_nanosleep() never
> returns prematurely,

<nod>

> i.e. the assert() in the below code would indicate
> a kernel failure:
>
> ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL);
> if (!ret) {
> clock_gettime(CLOCK_REALTIME, &now);
> assert(now >= expiry);
> }
>
> But that assert can trigger when CLOCK_REALTIME was modified after the
> timer fired and the kernel decided to wake up the task and let it return
> to user space.
>
> clock_nanosleep(..., &expiry)
> arm_timer(expires);
> schedule();
>
> -> timer interrupt
> now = ktime_get_real();
> if (expires <= now)
> -------------------------------- After this point
> wakeup(); clock_settime(2) or
> adjtimex(2) which
> makes CLOCK_REALTIME
> jump back far enough will
> cause the above assert
> to trigger.
>
> ...
> return from syscall (retval == 0)
>
> There is no guarantee against clock_settime() coming after the
> wakeup. Even if we put another check into the return to user path then
> we won't catch a clock_settime() which comes right after that and before
> user space invokes clock_gettime().

<nod>

> POSIX spec Issue 7 (2018 edition) says:
>
> The suspension for the absolute clock_nanosleep() function (that is,
> with the TIMER_ABSTIME flag set) shall be in effect at least until the
> value of the corresponding clock reaches the absolute time specified by
> rqtp.
>
> And that's what the kernel implements for clock_nanosleep() and timerfd
> behaves exactly the same way.
>
> The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2),
> read(2), poll(2), is not happening _before_ the absolute time specified
> is reached.
>
> If clock_settime() happens right before the expiry check, then it does
> the right thing, but any modification to the clock after the wakeup
> cannot be mitigated. At least not in a way which would make the assert()
> in the example code above a reliable indicator for a kernel fail.
>
> That's the reason why I rejected the attempt to mitigate that particular
> 0 tick issue in timerfd as it would just scratch a particular itch but
> still not provide any guarantee. So having the '0' return documented is
> the right way to go.

Thanks for the incredibly detailed follow-up Thomas (which all
landed in my commit message). I've applied a patch almost exactly
the same as the text I showed above (and it's pushed to Git).

All of 2020 is a bust, I expect. Perhaps we see us at a conference
in 2021 ;-).

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/