Re: [PATCH] cfq-iosched: Add some more documentation about idling

From: Shaohua Li
Date: Mon Aug 01 2011 - 20:41:19 EST


On Mon, 2011-08-01 at 23:55 +0800, Vivek Goyal wrote:
> There are always questions about why CFQ is idling on various conditions.
> Recent ones is Christoph asking again why to idle on REQ_NOIDLE. His
> assertion is that XFS is relying more and more on workqueues and is
> concerned that CFQ idling on IO from every workqueue will impact
> XFS badly.
>
> So he suggested that I add some more documentation about CFQ idling
> and that can provide more clarity on the topic and also gives an
> opprotunity to poke a hole in theory and lead to improvements.
>
> So here is my attempt at that. Any comments are welcome.
> Signed-off-by: Vivek Goyal <vgoyal@xxxxxxxxxx>
> ---
> Documentation/block/cfq-iosched.txt | 70 +++++++++++++++++++++++++++++++++++
> 1 files changed, 70 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
> index e578fee..7ce81b8 100644
> --- a/Documentation/block/cfq-iosched.txt
> +++ b/Documentation/block/cfq-iosched.txt
> @@ -43,3 +43,73 @@ If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
> to IOPS mode and starts providing fairness in terms of number of requests
> dispatched. Note that this mode switching takes effect only for group
> scheduling. For non-cgroup users nothing should change.
> +
> +CFQ IO scheduler Idling Theory
> +==============================
> +Idling on a queue is primarily about waiting for next request to come on
> +same queue after completion of a request. In this process CFQ will not
> +dispatch requests from other cfq queues even if requests are pending
> +there.
> +
> +The rationale behind idling is that it can cut down on number of seeks
> +on rotational media. For example, if a process is doing dependent
> +sequential reads (next read will come on only after completion of previous
> +one), then not dispatching request from other queue sould help as we
> +did not move the disk head and kept on dispatching sequential IO from
> +one queue.
> +
> +CFQ does not do idling on all the queues. It primarily tries to do idling
> +on queues which are doing synchronous sequential IO. The synchronous
> +queues which are not doing sequential IO are put on a separate service
> +tree (called sync-noidle tree) where we do not idle on individual
> +cfq queue, but idle on the whole tree or IOW, idle on a group of cfq
> +queues.
> +
> +CFQ has following tree service trees and various queues are put on these
> +trees.
> +
> + sync-idle sync-noidle async
> +
> +All cfq queues doing synchronous sequential IO go on to sync-idle tree.
> +On this tree we idle on each queue individually.
> +
> +All synchronous non-sequential queues go on sync-noidle tree. Also any
> +request which are marked with REQ_NOIDLE go on this service tree.
> +
> +All async writes go on async service tree. There is no idling on async
> +queues.
Maybe mention CFQ don't do idle for SSD too.

> +FAQ
> +===
> +Q1. Why to idle at all on queues marked with REQ_NOIDLE.
> +
> +A1. We only do group idle on queues marked with REQ_NOIDLE. This helps in
^^^^^
tree or group? I suppose you are talking about tree, as below example
doesn't mention group.
The sentence is a little confusing. we do tree/group idle for queue with
random sync I/O too even without REQ_NOIDLE if the queue is the last one
of the tree/group.

> + providing isolation with all the sync-idle queues. Otherwise in presence
> + of many sequential readers, other synchronous IO might not get fair
> + share of disk.
> +
> + For example, if there are 10 sequential readers doing IO and they get
> + 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
> + roughly after 1 second. If after completion of REQ_NOIDLE request we
> + do not idle, and after a couple of mili seconds a another REQ_NOIDLE
> + request comes in, again it will be scheduled after 1second. Repeat it
> + and notice how a workload can lose its disk share and suffer due to
> + multiple sequnetial readers.
> +
> + fsync can generate dependent IO where bunch of data is written in the
> + context of fsync, and later some journaling data is written. Journaling
> + data comes in only after fsync has finished its IO (atleast for ext4
> + that seemed to be the case). Now if one decides not to idle on fsync
> + thread due to REQ_NOIDLE, then next journaling write will not get
> + scheduled for another second. A process doing small fsync, will suffer
> + badly in presence of multiple sequntial readers.
> +
> + Hence doing group idling on threads using REQ_NOIDLE flag on requests
^^^^^ same here.
Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/