Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts()

From: Arnd Bergmann
Date: Fri Jun 02 2017 - 05:46:06 EST


On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani <deepa.kernel@xxxxxxxxx> wrote:
>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz <john.stultz@xxxxxxxxxx> wrote:
>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann <arnd@xxxxxxxx> wrote:
>>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani <deepa.kernel@xxxxxxxxx> wrote:
>>>>>
>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>>>>>> index 517838b..77204da 100644
>>>>>>> --- a/drivers/block/rbd.c
>>>>>>> +++ b/drivers/block/rbd.c
>>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
>>>>>>> {
>>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req;
>>>>>>>
>>>>>>> - osd_req->r_mtime = CURRENT_TIME;
>>>>>>> + ktime_get_real_ts(&osd_req->r_mtime);
>>>>>>> osd_req->r_data_offset = obj_request->offset;
>>>>>>> }
>>>>>>>
>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>> index c681762..1d3fa90 100644
>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request *
>>>>>>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>>>>> {
>>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS);
>>>>>>> + struct timespec ts;
>>>>>>>
>>>>>>> if (!req)
>>>>>>> return ERR_PTR(-ENOMEM);
>>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode)
>>>>>>> init_completion(&req->r_safe_completion);
>>>>>>> INIT_LIST_HEAD(&req->r_unsafe_item);
>>>>>>>
>>>>>>> - req->r_stamp = current_fs_time(mdsc->fsc->sb);
>>>>>>> + ktime_get_real_ts(&ts);
>>>>>>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran);
>>>>>>
>>>>>> This change causes our kernel_untar_tar test case to fail (inode's
>>>>>> ctime goes back). The reason is that there is time drift between the
>>>>>> time stamps got by ktime_get_real_ts() and current_time(). We need to
>>>>>> revert this change until current_time() uses ktime_get_real_ts()
>>>>>> internally.
>>>>>
>>>>> Hmm, the change was not supposed to have a user-visible effect, so
>>>>> something has gone wrong, but I don't immediately see how it
>>>>> relates to what you observe.
>>>>>
>>>>> ktime_get_real_ts() and current_time() use the same time base, there
>>>>> is no drift, but there is a difference in resolution, as the latter uses
>>>>> the time stamp of the last jiffies update, which may be up to one jiffy
>>>>> (10ms) behind the exact time we put in the request stamps here.
>>>>>
>>>>> Do you still see problems if you use current_kernel_time() instead of
>>>>> ktime_get_real_ts()?
>>>>
>>>> The problem disappears after using current_kernel_time().
>>>>
>>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417
>>>
>>> From the commit above:
>>> "It seems there is time drift between ktime_get_real_ts() and
>>> current_kernel_time()"
>>>
>>> Its more of a granularity difference. current_kernel_time() returns
>>> the cached time at the last tick, where as ktime_get_real_ts() reads
>>> the clocksource hardware and returns the immediate time.
>>>
>>> Filesystems usually use the cached time (similar to
>>> CLOCK_REALTIME_COARSE), for performance reasons, as touching the
>>> clocksource takes time.
>>
>> Alternatively, it would be best for this code also to use current_time().
>> I had suggested this in one of the previous versions of the patch.
>> The implementation of current_time() will change when we switch vfs to
>> use 64 bit time. This will prevent such errors from happening again.
>> But, this also means there is more code reordering for these modules
>> to get a reference to inode.
>>
>
> I took a look. it's quite inconvenience to use current_time(). I
> prefer to temporarily use current_kernel_time().

I've looked at the code some more and I think there is another angle to it:
In your test case, 'tar' calls into the utimes syscall (or a member of its
family), which sets the i_ctime field in the inode to the curren time
(using current_time()), then calls __ceph_setattr(), which creates
a mds client request, and ceph_mdsc_create_request() takes another
time stamp and stores it in r_stamp.

We then store the first timestamp (only) in the in-memory inode, and the
second time stamp in the request. Depending on the state of the inode,
we may also set the ctime to a third timestamp we again take using
current_time().

The mtime and atime from user space get passed correctly through
union ceph_mds_request_args->setattr and are kept in sync between
the in-memory inode and the persistent inode data, but the ctime
in the inode never makes it to the lower protocol levels and instead
we use the r_stamp field that got set a little earlier or a little later.

I believe the bug you see is the result of the two timestamps
currently being almost guaranteed to be different in the latest
kernels.
Changing r_stamp to use current_kernel_time() will make it the
same value most of the time (as it was before Deepa's patch),
but when the timer interrupt happens between the timestamps,
the two are still different, it's just much harder to hit.

I think the proper solution should be to change __ceph_setattr()
in a way that has req->r_stamp always synchronized with i_ctime.
If we copy i_ctime to r_stamp, that will also take care of the
future issues with the planned changes to current_time().

The part I don't understand is what else r_stamp (i.e. the time
stamp in ceph_msg_data with type==
CEPH_MSG_CLIENT_REQUEST) is used for, other than setting
ctime in CEPH_MDS_OP_SETATTR.

Will this be used to update the stored i_ctime for other operations
too? If so, we would need to synchronize it with the in-memory
i_ctime for all operations that do this.

Semi-related side note: I see that the granularity for ceph_timespec
is 1000 nanoseconds, so the value is always a multiple of 1000.
If the full 32-bit data gets stored, we could use this it to store the
epoch number in the future:

static inline void ceph_decode_timespec(struct timespec64 *ts,
const struct ceph_timespec *tv)
{
u32 ns_epoch = le32_to_cpu(tv->tv_nsec);
u32 epoch = ns_epoch % 1000;

/* tv_sec is traditionally interpreted as unsigned
* with time ranges 1970-2106, we extend
* it to 1970-138069 */
ts->tv_sec = (u64)le32_to_cpu(tv->tv_sec) +
(u64)epoch << 32;
ts->tv_nsec = ns_epoch - epoch;
}

Arnd