Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support

From: Theodore Ts'o
Date: Tue Jun 16 2015 - 23:16:10 EST

Next message: Michael Ellerman: "linux-next: manual merge of the rdma tree with the nfs tree"
Previous message: juncheng bai: "Re: [PATCH RFC] storage:rbd: make the size of request is equal to the, size of the object"
In reply to: Tejun Heo: "Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support"
Next in thread: Tejun Heo: "Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jun 16, 2015 at 05:54:36PM -0400, Tejun Heo wrote:
> Hello, Ted.
>
> On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote:
> > So if there is some way we can signal to any cgroup that that might be
> > throttling writeback or disk I/O that the jbd/jbd2 process should be
> > considered privileged, that would be a good since it would allow us to
> > avoid a potential priority inversion problem.
>
> I see. In the long term, I think we might need to come up with a way
> to overcharge a slower cgroup to avoid blocking faster ones for cases
> where some IOs are depended upon by more than one cgroups. That'd
> take quite a bit of work from blkcg side. Will think more about it.

Hmm, while we're at it, there's another priority inversion that can be
painful. If a block directory has been pushed out of memory (possibly
because it was initially accessed by a cgroup with a very tiny amount
of memory allocated to its cgroup) and a process with a cgroup tries
to do a lookup in that directory, it will issue the read with such a
tightly constrained disk time that it might take minutes for the read
to complete. The problem is that the VFS has locked the directory's
i_mutex *before* calling ext4_lookup().

If a high priority process then tries to read the same directory, or
in fact any VFS operation which requires taking the directory's
i_mutex first, including renaming the directory, the high priority
process will end up blocking until the read is completed --- which can
be minutes if the low priority process has a tiny amount of disk time
allocated to it.

There is a related problem where if a read for a particular block is
issued with a very low amount of amount of disk time, and that same
block is required by a high priority process, we can also get hit with
a very similar priority inversion problem.

To date the answer has always been, "Doctor, Doctor it hurts when I do
that...." The only way I can think of fixing the directory mutex
problem is by returning an error code to the VFS layer which instructs
it to unlock the directory, and then have it wait on some wait channel
so it ends up calling the lookup after the directory block has been
read into memory (and we can hope that due to a tight memory cgroup
the block doesn't end up getting ejected from memory right away).

As another solution for another part of the problem, if a high
priority process attempts a read and the I/O is already queued up, but
it's at the back of the bus because it was originally posted by a low
priority cgroup, the rest of the fix would be to elevate the priority
of said I/O request and then resort the queue.

As far as the filemap_fdatawait() call is concerned, if it's being
called by fsync() run by a low priority process, or from the writeback
thread, then it can certainly take place at a low prority. But if the
filemap_fdatawait() is being done by a high priority process, such as
a jbd/jbd2 thread, then there needs to be a way that we can set a flag
in the wbc structure indicating that the writes should be submitted as
if it was issued from the kernel thread, and not based on who
originally dirtied the page.

It's going to be a number of point solutions, which is a bit ugly, but
I think that is much more likely to be successful than trying to
implement, say, a generalized priority inheritance scheme for block
I/O requests and related locks. :-)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michael Ellerman: "linux-next: manual merge of the rdma tree with the nfs tree"
Previous message: juncheng bai: "Re: [PATCH RFC] storage:rbd: make the size of request is equal to the, size of the object"
In reply to: Tejun Heo: "Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support"
Next in thread: Tejun Heo: "Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]