Re: CFQ timer precision

From: Jan Kara
Date: Thu Nov 19 2015 - 09:55:58 EST


Hi Jeff,

On Mon 16-11-15 12:23:05, Jeff Moyer wrote:
> Jens Axboe <axboe@xxxxxxxxx> writes:
> > On 11/16/2015 08:11 AM, Jan Kara wrote:
> >> Hello,
> >>
> >> lately I was looking into a big performance hit we take when blkio
> >> controller is enabled and jbd2 thread ends up in a different cgroup than
> >> user process. E.g. dbench4 throughput drops from ~140 MB/s to ~20 MB/s.
> >> However artificial dbench4 is, this kind of drop will likely be clearly
> >> visible in real life workloads as well. With unified cgroup hierarchy
> >> the above cgroup split between jbd2 and user processes is unavoidable
> >> once you enable blkio controller so IMO we should accomodate that better.
>
> Is group idle enabled? What happens if you set that to 0? What's the
> storage?

So both slice_idle and group_idle are at their defaults - i.e., 8 ms.
Storage is a SATA drive. And the regression actually happens because of
slice_idle - e.g. the cfq queue belonging to jbd2 is the only queue in the
service tree so cfq_should_idle() returns true and we start idling once we
have committed a transaction. dbench4 cannot preempt jbd2 thread because it
is in a different cgroup (the preemption is what makes the performance good
in case no cgroups are involved) so it can submit only after idle timer
fires.

In the other direction when dbench4 ends up waiting for jbd2 to commit a
transaction the situation is the same - jbd2 has to wait for idle timer on
dbench4 queue to fire before it can start a transaction commit.

BTW: I've seen the same issue happening on a different machine with
group_idle as well. There we had more partitions (so test partition and
root partition were on the same disk) and so test processes were not the
only ones in their service trees, cfq_should_idle() returned false and
we fell back to group idling.

Basically my thought is that idling for 8 ms to guarantee QoS for a service
tree - that's the reason for the condition

if (st->count == 1 && cfq_cfqq_sync(cfqq) &&
!cfq_io_thinktime_big(cfqd, &st->ttime, false))

in cfq_should_idle() - or to guarantee QoS for a cgroup - that's the reason
for group_idle - is just too much. I'd like to experiment with idling only
for say min(2*think_time, slice_idle) or something like that but for this
to work we need higher timer precision than jiffies.

> >> I have couple of CFQ idling improvements / fixes which I'll post later this
> >> week once I'll complete some round of benchmarking. They improve the
> >> throughput to ~40 MB/s which helps but clearly there's still a big room for
> >> improvement. The reason for the performance drop is essentially in idling
> >> we do to avoid starvation of CFQ queues. Now when idling in this context,
> >> current default of 8 ms idle window is far to large - we start the timer
> >> after the final request is completed and thus we effectively give the
> >> process 8 ms of CPU time to submit the next IO request. Which I think is
> >> usually far too much.
>
> I think I'd need to see blktrace data to make any real sense of this
> problem statement. It's just too vague.

Currently the machine is running other tests so I cannot give you the
blktrace data but I definitely want to include some blktrace excerpts when
submitting my cfq idling changes.

> >> The problem is that more fine grained idling is
> >> actually problematic because e.g. SUSE distro kernels have HZ=250 and thus
> >> 1 jiffy is 4 ms. Hence my proposal: Do you think it would be OK to convert
> >> CFQ to use highres timers and do all the accounting in microseconds?
> >> Then we could tune the idle time to be say 1ms or even autotune it based on
> >> process' think time both of which I expect would get us much closer to
> >> original throughput (4 ms idle window gets us to ~70 MB/s with my patches,
> >> disabling idling gets us to original throughput as expected).
> >
> > Converting to a non-jiffies timer base should be quite fine. We didn't
> > have hrtimers when CFQ was written :-)
>
> I actually have a patch that allows setting slices at the usec
> granularity, but didn't go so far as to use hrtimers. I have no idea
> why I wrote it, or whether it works. ;-) I've attached it in case it's
> useful to you, Jan.

Thanks! It looks like your already did a big chunk of the work ;-)

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/