Re: mm: unnecessary COW phenomenon

From: Nadav Amit
Date: Wed Nov 10 2021 - 05:47:38 EST

Next message: Michael S. Tsirkin: "Re: [RFC] hypercall-vsock: add a new vsock transport"
Previous message: David Laight: "RE: [PATCH 20/22] x86,word-at-a-time: Remove .fixup usage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> On Oct 13, 2021, at 10:10 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote:
>> Andrea, Peter, others,
>
> Hi, Nadav,
>
>>
>> I encountered many unnecessary COW operations on my development kernel
>> (based on Linux 5.13), which I did not see a report about and I am not
>> sure how to solve. An advice would be appreciated.
>>
>> Commit 09854ba94c6aa ("mm: do_wp_page() simplification”) prevents the reuse of
>> a page on write-protect fault if page_count(page) != 1. In that case,
>> wp_page_reuse() is not used and instead the page is COW'd by wp_page_copy
>> (). wp_page_copy() is obviously much more expensive, not only because of the
>> copying, but also because it requires a TLB flush and potentially a TLB
>> shootodwn.
>>
>> The scenario I encountered happens when I use userfaultfd, but presumably it
>> might happen regardless of userfaultfd (perhaps swap device with
>> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
>> anonymous page as read-only and a second write-protect fault that happens
>> shortly after on the same page. In this case the page count is almost always
>> elevated and therefore a COW is needed.
>>

[ snip ]

>>
>> It turns out that the elevated page count is due to the caching of the page in
>> the local LRU cache (by lru_cache_add() which is called by
>> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since the
>> first fault happened shortly before the second write-protect fault, the LRU
>> cache was still not drained, so the page count was not decreased and a COW is
>> needed.
>>
>> Calling lru_add_drain() during this flow resolves the issue most of the time.
>> Obviously, it needs to be called on the core that allocated (i.e., faulted
>> in) the page initially to work. It is possible to do it conditionally only if
>> the page-count is greater than 1.
>
> I agree with your analysis. I didn't even notice the lru_cache_add() can cause
> it very likely to trigger the COW in your uffd use case (and also for swap),
> but that's indeed something could happen with the current page reuse logic in
> do_wp_page(), afaiu.

Just an update for the record based on an offline correspondence with Andrea
and Peter, who were very helpful (thanks!)

I could not come up with a non-hacky solution just for this problem. While it
is possible to drain the LRU conditionally, it is admittedly a hack with some
downsides.

The aforementioned issue - unnecessary TLB flush (or even shootdown) on COW
operations - is not limited to userfaultfd and not even to
SWP_SYNCHRONOUS_IO. It seems that whenever the swap is set on very
low-latency device (e.g., pmem, zram), the unnecessary COW might happen and
impact performance negatively.

I created a small test to verify the impact of the phenomenon (the test code
is below). The swap is set on an emulated pmem device and then run with:

./forceswap 2 100000 1

The benchmark runs 100k rounds in which a page is accessed first for read,
then for write, and then the page is paged out using MADV_PAGEOUT. The two
accesses cause a page-fault. The test only measures the time of the second
access, which should include the wp page-fault. I also measured the delta
in “nr_tlb_remote_flush" from /proc/vmstat.

The results are:

cycles/op nr_tlb_remote_flush
-------------------------------------------------------------------
v5.8 bcf876870b95 1606 300000
mainline cb690f5238d7 10534 399935

As shown, the write-protect fault in mainline takes ~6.5x, which
is explained by the COW operation that is exhibited in the extra
TLB shootdown (nr_tlb_remote_flush). On bare-metal this overhead
should be lower, yet if the number of threads is higher the
overhead would increase.

I tried also to collect the number of IOs, but for some reason
they do not show on /sys/dev/block/X/stat for pmem.

[ Some config details:
KVM VM running on Haswell.
host: max-freq; kvm_intel's ple_gap=0; 2MB pages.
VM: mitigations=off idle=poll. Kernel compiled with
CONFIG_DEBUG_TLBFLUSH=y. CONFIG_BLK_DEV_PMEM=y ]

-- >8 --

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/mman.h>
#include <errno.h>
#include <sys/types.h>
#include <unistd.h>

#define PAGE_SIZE (4096)
#define MAX_THREADS (50)

volatile int stop = 0;
unsigned long nops;

void* thread_start(void *arg)
{
while (!stop) {
asm volatile ("pause" ::: "memory");
}

return (void*)NULL;
}

static inline uint64_t rdtscp()
{
uint64_t rax, rdx, aux;

asm volatile ("rdtscp\n" : "=a" (rax), "=d" (rdx), "=c" (aux) : : );
return (rdx << 32) + rax;
}

int main(int argc, char *argv[])
{
int r, nthreads, npages, j;
unsigned long i;
pthread_attr_t attr;
pthread_t thread_ids[MAX_THREADS];
void *res;
volatile char *p, c;
uint64_t time = 0;

if (argc < 4) {
fprintf(stderr, "usage: %s [nthreads] [nops] [npages]\n", argv[0]);
exit(-1);
}

r = pthread_attr_init(&attr);
if (r != 0) {
fprintf(stderr, "error setting attributes %d\n", r);
exit(-1);
}

nthreads = atoi(argv[1]);
nops = strtoull(argv[2], NULL, 0);
npages = atoi(argv[3]);

for (i = 0; i < nthreads - 1; i++) {
r = pthread_create(&thread_ids[i], &attr, &thread_start, NULL);
if (r != 0) {
fprintf(stderr, "error creating thread %d\n", r);
exit(-1);
}
}

p = (volatile char*)mmap(0, PAGE_SIZE * npages, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

if (p == MAP_FAILED) {
perror("mmap");
exit(-1);
}

for (i = 0; i < nops; i++) {
if (madvise((void *)p, PAGE_SIZE * npages, MADV_PAGEOUT)) {
perror("madvise");
exit(-1);
}

for (j = 0; j < npages; j++) {
c = p[j * PAGE_SIZE];
c++;
time -= rdtscp();
p[j * PAGE_SIZE] = c;
time += rdtscp();
}
}
stop = 1;
for (i = 0; i < nthreads - 1; i++) {
r = pthread_join(thread_ids[i], &res);
if (r != 0) {
fprintf(stderr, "error join\n");
exit(-1);
}
}
printf("time: %ld\n", time/nops);
return 0;
}

Next message: Michael S. Tsirkin: "Re: [RFC] hypercall-vsock: add a new vsock transport"
Previous message: David Laight: "RE: [PATCH 20/22] x86,word-at-a-time: Remove .fixup usage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]