Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

From: Nikita Yushchenko
Date: Sat Dec 18 2021 - 09:31:58 EST

Next message: Kirill A. Shutemov: "Re: [PATCH v1 04/11] mm: thp: simlify total_mapcount()"
Previous message: Kirill A. Shutemov: "Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()"
In reply to: Dave Hansen: "Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()"
Next in thread: Dave Hansen: "Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This allows archs to optimize it, by
freeing multiple tables in a single release_pages() call. This is
faster than individual put_page() calls, especially with memcg
accounting enabled.

Could we quantify "faster"? There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.

I currently don't have numbers for this patch taken alone. This patch originates from work done some years ago to reduce cost of memory accounting, and x86-only version of this patch was in virtuozzo/openvz kernel since then. Other patches from that work have been upstreamed, but this one was missed.

Still it's obvious that release_pages() shall be faster that a loop calling put_page() - isn't that exactly the reason why release_pages() exists and is different from a loop calling put_page()?

static void __tlb_remove_table_free(struct mmu_table_batch *batch)
{
- int i;
-
- for (i = 0; i < batch->nr; i++)
- __tlb_remove_table(batch->tables[i]);
-
+ __tlb_remove_tables(batch->tables, batch->nr);
free_page((unsigned long)batch);
}

This leaves a single call-site for __tlb_remove_table():

static void tlb_remove_table_one(void *table)
{
tlb_remove_table_sync_one();
__tlb_remove_table(table);
}

Is that worth it, or could it just be:

__tlb_remove_tables(&table, 1);

I was considering that while preparing the patch, however that resulted into even larger change in archs, due to removal of non-batched call, and I decided not to follow this way.

And, Peter's suggestion to integrate free_page_and_swap()-based implementation of __tlb_remove_table() into mm/mmu_gather.c under ifdef, and then do the optimization locally in mm/mmu_gather.c, looks better.

+void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
+{
+ __free_pages_and_swap_cache(pages, nr, false);
}

This went unmentioned in the changelog. But, it seems like there's a
specific optimization here. In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU. It doesn't need the lru_add_drain().

This is a somewhat different topic.

In scope of this patch, the _nolru version was added because there was no lru draining in the looped call to __tlb_remove_table(). Having it added to the batched version, although won't break things, does add overhead that was not there before, which is in direct conflict with the original goal.

If the version with draining lru is indeed not needed, it can be cleaned out in scope of a different patchset.

if (!do_lru)
VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
pagep[i]);
free_swap_cache(...);

This looks like a good safety measure, will add it.

But, even more than that, do all the architectures even need the
free_swap_cache()?

I was under impression that process page tables are a valid target for swapping out. Although I can be wrong here.

Nikita

Next message: Kirill A. Shutemov: "Re: [PATCH v1 04/11] mm: thp: simlify total_mapcount()"
Previous message: Kirill A. Shutemov: "Re: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount()"
In reply to: Dave Hansen: "Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()"
Next in thread: Dave Hansen: "Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]