Re: CONFIG_PREEMPT and server workloads

From: Andrew Morton
Date: Thu Mar 18 2004 - 17:32:18 EST


Takashi Iwai <tiwai@xxxxxxx> wrote:
>
> --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> @@ -290,6 +290,9 @@ write_out_data_locked:
> commit_transaction->t_sync_datalist = jh;
> break;
> }
> +
> + if (need_resched())
> + break;
> } while (jh != last_jh);
>
> if (bufs || need_resched()) {

Yup, this has problems.

What we're doing here is walking (under spinlock) a ring of buffers
searching for unlocked dirty ones to write out.

Suppose the ring has thousands of locked buffers and there is a RT task
scheduling at 1kHz. Each time need_resched() comes true we break out of
the search, schedule away and then restart the search from the beginning.

So if the time to walk that ring of buffers exceeds one millisecond the
kernel will make no progress. Eventually things will fix themselves up as
the I/O completes against those buffers but in the meanwhile we chew 100%
of CPU.

A few lines higher up we advance the searching start point so the next
search will start at the current position. But that's when it is known
that we made some forward progress. In the above patch we are vulnerable
to the situation where _all_ of the buffers on that ring are locked.

The below might suit, although I haven't tested it. It needs careful
testing too - it will impact I/O scheduling efficiency.



--- 25/fs/jbd/commit.c~a Thu Mar 18 14:23:40 2004
+++ 25-akpm/fs/jbd/commit.c Thu Mar 18 14:28:46 2004
@@ -280,6 +280,22 @@ write_out_data_locked:
goto write_out_data;
}
}
+ } else if (need_resched()) {
+ /*
+ * A locked buffer, and we're being a CPU pig. We
+ * cannot just schedule away because it may be the case
+ * that *all* the buffers are locked. So let's just
+ * wait until this buffer has been written.
+ *
+ * If we already have some buffers in wbuf[] then fine,
+ * we're making forward progress.
+ */
+ if (bufs)
+ break;
+ commit_transaction->t_sync_datalist = jh;
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ goto write_out_data;
}
if (bufs == ARRAY_SIZE(wbuf)) {
/*

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/