Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4

From: Andrew Morton
Date: Thu May 15 2014 - 17:24:20 EST

Next message: Geoff Levand: "[RFC 1/1] vmlinux.lds.h: Add new SECTION() macro"
Previous message: Andi Kleen: "Re: [PATCH v2]: fix Haswell precise store data source encoding"
In reply to: Oleg Nesterov: "Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4"
Next in thread: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 15 May 2014 11:48:09 +0100 Mel Gorman <mgorman@xxxxxxx> wrote:

> Changelog since v3
> o Correct handling of exclusive waits
>
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
>
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
>
> The test case used to evaulate this is a simple dd of a large file done
> multiple times with the file deleted on each iterations. The size of
> the file is 1/10th physical memory to avoid dirty page balancing. In the
> async case it will be possible that the workload completes without even
> hitting the disk and will have variable results but highlight the impact
> of mark_page_accessed for async IO. The sync results are expected to be
> more stable. The exception is tmpfs where the normal case is for the "IO"
> to not hit the disk.
>
> The test machine was single socket and UMA to avoid any scheduling or
> NUMA artifacts. Throughput and wall times are presented for sync IO, only
> wall times are shown for async as the granularity reported by dd and the
> variability is unsuitable for comparison. As async results were variable
> do to writback timings, I'm only reporting the maximum figures. The sync
> results were stable enough to make the mean and stddev uninteresting.
>
> The performance results are reported based on a run with no profiling.
> Profile data is based on a separate run with oprofile running. The
> kernels being compared are "accessed-v2" which is the patch series up
> to this patch where as lockpage-v2 includes this patch.
>
> async dd
> 3.15.0-rc3 3.15.0-rc3
> accessed-v3 lockpage-v3
> ext3 Max elapsed 11.5900 ( 0.00%) 11.0000 ( 5.09%)
> ext4 Max elapsed 13.3400 ( 0.00%) 13.4300 ( -0.67%)
> tmpfs Max elapsed 0.4900 ( 0.00%) 0.4800 ( 2.04%)
> btrfs Max elapsed 12.7800 ( 0.00%) 13.8200 ( -8.14%)
> xfs Max elapsed 2.0900 ( 0.00%) 2.1100 ( -0.96%)

So ext3 got 5% faster and btrfs got 8% slower?

>
> ...
>

The numbers look pretty marginal from here and the patch is, umm, not a
thing of beauty or simplicity.

I'd be inclined to go find something else to work on, frankly.

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -241,15 +241,22 @@ void delete_from_page_cache(struct page *page)
> }
> EXPORT_SYMBOL(delete_from_page_cache);
>
> -static int sleep_on_page(void *word)
> +static int sleep_on_page(struct page *page)
> {
> - io_schedule();
> + /*
> + * A racing unlock can miss that the waitqueue is active and clear the
> + * waiters again. Only sleep if PageWaiters is still set and timeout
> + * to recheck as races can still occur.
> + */
> + if (PageWaiters(page))
> + io_schedule_timeout(HZ);

ew.

> return 0;
> }
>
> ...
>
> +/* Returns true if the page is locked */

Comment is inaccurate.

> +static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
> + wait_queue_t *wq, int state, int bit_nr, bool exclusive)
> +{
> +
> + /* Set PG_waiters so a racing unlock_page will check the waitiqueue */
> + if (!PageWaiters(page))
> + SetPageWaiters(page);
> +
> + if (exclusive)
> + prepare_to_wait_exclusive(wqh, wq, state);
> + else
> + prepare_to_wait(wqh, wq, state);
> + return test_bit(bit_nr, &page->flags);
> }
>
> ...
>
> int wait_on_page_bit_killable(struct page *page, int bit_nr)
> {
> + wait_queue_head_t *wqh;
> DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> + int ret = 0;
>
> if (!test_bit(bit_nr, &page->flags))
> return 0;
> + wqh = page_waitqueue(page);
> +
> + do {
> + if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> + ret = sleep_on_page_killable(page);
> + } while (!ret && test_bit(bit_nr, &page->flags));
> + finish_wait(wqh, &wait.wait);
>
> - return __wait_on_bit(page_waitqueue(page), &wait,
> - sleep_on_page_killable, TASK_KILLABLE);
> + return ret;
> }

Please find a way to test all this nicely when there are signals pending?

> /**
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Geoff Levand: "[RFC 1/1] vmlinux.lds.h: Add new SECTION() macro"
Previous message: Andi Kleen: "Re: [PATCH v2]: fix Haswell precise store data source encoding"
In reply to: Oleg Nesterov: "Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4"
Next in thread: Peter Zijlstra: "Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]