Re: [patch 1/3] mm: memcontrol: lockless page counters

From: Vladimir Davydov
Date: Fri Oct 17 2014 - 03:47:43 EST


On Mon, Oct 13, 2014 at 09:46:01PM -0400, Johannes Weiner wrote:
> Memory is internally accounted in bytes, using spinlock-protected
> 64-bit counters, even though the smallest accounting delta is a page.
> The counter interface is also convoluted and does too many things.
>
> Introduce a new lockless word-sized page counter API, then change all
> memory accounting over to it. The translation from and to bytes then
> only happens when interfacing with userspace.
>
> The removed locking overhead is noticable when scaling beyond the
> per-cpu charge caches - on a 4-socket machine with 144-threads, the
> following test shows the performance differences of 288 memcgs
> concurrently running a page fault benchmark:
>
> vanilla:
>
> 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
> 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
> 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
> 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
> 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
> <not supported> stalled-cycles-frontend
> <not supported> stalled-cycles-backend
> 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
> 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
> 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
>
> 132.474343877 seconds time elapsed ( +- 0.21% )
>
> lockless:
>
> 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
> 832,850 context-switches # 0.068 K/sec ( +- 0.54% )
> 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
> 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
> 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
> <not supported> stalled-cycles-frontend
> <not supported> stalled-cycles-backend
> 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
> 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
> 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
>
> 91.369330729 seconds time elapsed ( +- 0.45% )
>
> On top of improved scalability, this also gets rid of the icky long
> long types in the very heart of memcg, which is great for 32 bit and
> also makes the code a lot more readable.
>
> Notable differences between the old and new API:
>
> - res_counter_charge() and res_counter_charge_nofail() become
> page_counter_try_charge() and page_counter_charge() resp. to match
> the more common kernel naming scheme of try_do()/do()
>
> - res_counter_uncharge_until() is only ever used to cancel a local
> counter and never to uncharge bigger segments of a hierarchy, so
> it's replaced by the simpler page_counter_cancel()
>
> - res_counter_set_limit() is replaced by page_counter_limit(), which
> expects its callers to serialize against themselves
>
> - res_counter_memparse_write_strategy() is replaced by
> page_counter_limit(), which rounds down to the nearest page size -
> rather than up. This is more reasonable for explicitely requested
> hard upper limits.
>
> - to keep charging light-weight, page_counter_try_charge() charges
> speculatively, only to roll back if the result exceeds the limit.
> Because of this, a failing bigger charge can temporarily lock out
> smaller charges that would otherwise succeed. The error is bounded
> to the difference between the smallest and the biggest possible
> charge size, so for memcg, this means that a failing THP charge can
> send base page charges into reclaim upto 2MB (4MB) before the limit
> would have been reached. This should be acceptable.
>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>

Definitely better than it was.

Acked-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/