Re: [PATCH 11/12] rwsem: wake all readers when first waiter is areader

From: Peter Hurley
Date: Wed Mar 13 2013 - 22:01:17 EST

Next message: Ben Hutchings: "Re: [ 031/100] iwlwifi: always copy first 16 bytes of commands"
Previous message: Marcelo Tosatti: "Re: [PATCH 6/6] KVM: MMU: fast zap all shadow pages"
In reply to: Michel Lespinasse: "Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader"
Next in thread: Ingo Molnar: "Re: [PATCH 11/12] rwsem: wake all readers when first waiter is areader"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> > Hi Dave,
> >
> > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > >> - since all readers are woken at once, you might see burst of direct
> > >> IO operations followed by bursts of truncate operations, instead of
> > >> having them interleaved in smaller groups as before.
> > >
> > > And this will result in the same application IO pattern giving
> > > vastly different results. e.g. a read IO promoted ahead of a
> > > truncate will now return data instead of -1 (beyond EOF).
> >
> > I will reply to this part first, as I think this gives a more concrete
> > anchor to the discussion.
> >
> > The crucial point is that the truncate system call hasn't completed
> > yet - that thread is still queued in the rwsem.
> >
> > You still have the guarantee that the truncate will complete
> > eventually (even if there is a never-ending stream of IOs), and that
> > all IO system calls that start after the truncate system call
> > completes will see the file as truncated.
>
> Sure, but the problem is not about when the syscall completes - the
> problem is that you are changing where in the pipeline of IO the
> truncate is initially executed. i.e. ordering is not defined by
> when an operation completes, but by the order ing which the queue is
> processed after a blocking operation completes. That is not when the
> syscall completes, but when the filesystem drops the exclusive lock.
>
> From a single threaded userspace application perspective or
> multithreaded apps with their own userspace locking, truncates
> complete when either the syscall completes or some time after when
> the application drops it's lock. Either way, we're looking at
> completion time serialisation, and in that case what XFS or the
> kernel does simply doesn't matter.
>
> However, if we are talking about userspace applications that use
> lockless IO algorithms or kernel side applications (like knfsd) that
> are exposed directly to the filesystem locking semantics,
> serialisation behaviour is actually defined by filesystem's
> submission side locking behaviour. There is no external
> serialisation of IO completion, so serialisation and ordering is
> defined (for XFS) solely by the rwsem behaviour.
>
> > You don't have guarantees about which system call will "appear to run
> > before the other" if the execution times of the two system calls
> > overlap each other, but you actually never had such a guarantee from a
> > correctness perspective (i.e. you could never guarantee which of the
> > two would get queued on the rwsem ahead of the other).
>
> Sure, but as long as the submission side ordering is deterministic,
> it doesn't matter.
>
> > > Ok, so I can see where your latency figure comes from, but it's
> > > still a change of ordering in that W2 is no longer a barrier to the
> > > reads queued after it.
> >
> > My claim is that it's not a visible change from a correctness
> > perspective
>
> I am not arguing that it is incorrect. I'm arguing that the change
> of ordering semantics breaks assumptions a lot of code makes about
> how these locks work.
>
> > > similar to this:
> > >
> > > W1R1W2R2W3R3.....WnRm
> > >
> > > By my reading of the algorithm you are proposing, after W1 is
> > > released, we end up with the queue being treated like this:
> > >
> > > R1R2R3....RmW2W3....Wn
> > >
> > > Right?
> >
> > Yes, if what you are representing is the state of the queue at a given
> > point of time (which implies R1..Rm must be distinct threads, not just
> > the same thread doing repeated requests).
>
> Yup, that's very typical.
>
> > As new requests come in over time, one important point to remember is
> > that each writer will see at most one batch of readers wake up ahead
> > of it (though this batch may include readers that were originally
> > queued both before and after it).
>
> And that's *exactly* the problem with the changes you are proposing
> - rwsems will no longer provide strongly ordered write vs read
> barrier semantics.
>
> > I find the name 'barrier' actually confusing when used to describe
> > synchronous operations. To me a barrier is usualy between
> > asynchronous operations, and it is well defined which operations
> > are ahead or behind of the barrier (either because they were
> > queued by the same thread, or they were queued by different
> > threads which may have synchronized together some other way).
>
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
>
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions to queue behind the metadata modification, waits for
> everything submitted before the metadata modification to complete
> and then runs the metadata modification. Once it completes, it then
> allows everything queued up to the next metadata modification to
> run....
>
> That matches your definition of a barrier pretty well, I think.
>
> > But in this case, we have two
> > synchronous operations with overlapping execution times - they don't
> > have a way to know which one is 'supposed to' be ahead of the other.
> > The two threads can't observe their relative ordering since they are
> > blocked during the operation...
>
> We don't care about the ordering between multiple concurrent
> metadata modifications - what matters is whether the ongoing data IO
> around them is ordered correctly.

Dave,

The point that Michel is making is that there never was any ordering
guarantee by rwsem. It's an illusion.

The reason is simple: to even get to the lock the cpu has to be
sleep-able. So for every submission that you believe is ordered, is by
its very nature __not ordered__, even when used by kernel code.

Why? Because any thread on its way to claim the lock (reader or writer)
could be pre-empted for some other task, thus delaying the submission of
whatever i/o you believed to be ordered.

So just to reiterate: there is no 'queue' and no 'barrier'. The
guarantees that rwsem makes are;
1. Multiple readers can own the lock.
2. Only a single writer can own the lock.
3. Readers will not starve writers.

> On Tue, 2013-03-12 at 13:36 +1100, Dave Chinner wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > - since all readers are woken at once, you might see burst of direct
> > IO operations followed by bursts of truncate operations, instead of
> > having them interleaved in smaller groups as before.
>
> And this will result in the same application IO pattern giving
> vastly different results. e.g. a read IO promoted ahead of a
> truncate will now return data instead of -1 (beyond EOF).

The 'same application IO pattern' will give you 'vastly different
results' with the __current__ implementation for any given
machine/day/time/run. The reason is that you cannot control which cpu
submits which IO, and is delayed because it's busy handling which
network interrupt, etc.

Where lock policy can have a significant impact is on performance. But
predicting that impact is difficult -- it's better just to measure.

It's not my intention to convince you (or anyone else) that there should
only be One True Rwsem, because I don't believe that. But I didn't want
the impression to persist that rwsem does anything more than implement a
fair reader/writer semaphore.

Regards,
Peter Hurley

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ben Hutchings: "Re: [ 031/100] iwlwifi: always copy first 16 bytes of commands"
Previous message: Marcelo Tosatti: "Re: [PATCH 6/6] KVM: MMU: fast zap all shadow pages"
In reply to: Michel Lespinasse: "Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader"
Next in thread: Ingo Molnar: "Re: [PATCH 11/12] rwsem: wake all readers when first waiter is areader"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]