Re: [PATCH] xen: speed up grant-table reclaim

From: Juergen Gross
Date: Tue Jun 13 2023 - 02:46:22 EST

Next message: Stephen Rothwell: "linux-next: build failure after merge of the net-next tree"
Previous message: kernel test robot: "[tip:master] BUILD SUCCESS 7728822622f78a493273454af746309f852c4c0e"
In reply to: Demi Marie Obenour: "Re: [PATCH] xen: speed up grant-table reclaim"
Next in thread: Demi Marie Obenour: "Re: [PATCH] xen: speed up grant-table reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12.06.23 22:09, Demi Marie Obenour wrote:

On Mon, Jun 12, 2023 at 08:27:59AM +0200, Juergen Gross wrote:

On 10.06.23 17:32, Demi Marie Obenour wrote:

When a grant entry is still in use by the remote domain, Linux must put
it on a deferred list.

This lacks quite some context.

The main problem is related to the grant not having been unmapped after
the end of a request, but the side granting the access is assuming this
should be the case.

The GUI agent has relied on deferred grant reclaim for as long as it has
existed. One could argue that doing so means that the agent is misusing
gntalloc, but this is not documented anywhere. A better fix would be to
use IOCTL_GNTDEV_SET_UNMAP_NOTIFY in the GUI daemon.

I don't think this is a gntalloc specific problem. You could create the same
problem with a kernel implementation.

And yes, using IOCTL_GNTDEV_SET_UNMAP_NOTIFY might be a good idea.

In general this means that the two sides implementing the protocol don't
agree how it should work, or that the protocol itself has a flaw.

What would a better solution be? This is going to be particularly
tricky with Wayland, as the wl_shm protocol makes absolutely no
guarantees that compositors will promptly release the mapping and
provides no way whatsoever for Wayland clients to know when this has
happened. Relying on an LD_PRELOAD hack is not sustainable.

Normally, this list is very short, because
the PV network and block protocols expect the backend to unmap the grant
first.

Normally the list is just empty. Only in very rare cases like premature
PV frontend module unloading it is expected to see cases of deferred
grant reclaims.

In the case of a system using only properly-designed PV protocols
implemented in kernel mode I agree. However, both libxenvchan and the
Qubes GUI protocol are implemented in user mode and this means that if
the frontend process (the one that uses gntalloc) crashes, deferred
grant reclaims will occur.

This would be the user space variant of frontend driver unloading.

Worse, it is possible for the domain to use
the grant in a PV protocol. If the PV backend driver then maps and
unmaps the grant and then tells the frontend driver to reclaim it, but
the backend userspace process (the one using gntdev) maps it before the
frontend does reclaim it, the frontend will think the backend is trying
to exploit XSA-396 and freeze the connection.

Let me sum up how I understand above statement:

1. Frontend F in DomA is granting DomB access via grant X for PV-device Y.
2. DomB backend B for PV-device Y is mapping the page using grant X and uses it.
3. DomB backend B unmaps grant X and signals end of usage to DomA frontend F.
4. DomB userspace process maps grant X
5. DomA frontend F tries to reclaim grant X and fails due to DomB userspace
mapping

So why would DomB userspace process map grant X? This would imply a malicious
process in DomB. This might be possible, but I don't see how such an attack
would relate to your problem. It could happen with any malicious userspace
program with root access in a domain running a PV backend.

However, Qubes OS's GUI protocol is subject to the constraints
of the X Window System, and as such winds up with the frontend unmapping
the window first. As a result, the list can grow very large, resulting
in a massive memory leak and eventual VM freeze.

I do understand that it is difficult to change the protocol and/or
behavior after the fact, or that performance considerations are in the
way of doing so.

Would the correct fix be to use IOCTL_GNTDEV_SET_UNMAP_NOTIFY? That
would require that the agent either create a new event channel for each
window or maintain a pool of event channels, but that should be doable.

I think this sounds like a promising idea.

This still does not solve the problem of the frontend exiting
unexpectedly, though.

Such cases need to be handled via deferred reclaim. You could add a flag
to struct deferred_entry when called through gntalloc_vma_close(),
indicating that this is an expected deferred reclaim not leading to
error messages.

To partially solve this problem, make the number of entries that the VM
will attempt to free at each iteration tunable. The default is still
10, but it can be overridden at compile-time (via Kconfig), boot-time
(via a kernel command-line option), or runtime (via sysfs).

Is there really a need to have another Kconfig option for this? AFAICS
only QubesOS is affected by the problem you are trying to solve. I don't
see why you can't use the command-line option or sysfs node to set the
higher reclaim batch size.

Fair. In practice, Qubes OS will need to use the sysfs node, since
the other two do not work with in-VM kernels.

Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")

I don't think this "Fixes:" tag is appropriate. The mentioned commit didn't
have a bug. You are adding new functionality on top of it.

I’ll drop the "Fixes:" tag, but I will keep the "Cc: stable@xxxxxxxxxxxxxxx"
as I believe this patch meets the following criterion for stable
backport (from Documentation/process/stable-kernel-rules.rst):

Serious issues as reported by a user of a distribution kernel may also
be considered if they fix a notable performance or interactivity issue.

Sure, this seems appropriate.

Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Demi Marie Obenour <demi@xxxxxxxxxxxxxxxxxxxxxx>
---
drivers/xen/Kconfig | 12 ++++++++++++
drivers/xen/grant-table.c | 40 ++++++++++++++++++++++++++++-----------
2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index d5d7c402b65112b8592ba10bd3fd1732c26b771e..8f96e1359eb102d6420775b66e7805004a4ce9fe 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -65,6 +65,18 @@ config XEN_MEMORY_HOTPLUG_LIMIT
This value is used to allocate enough space in internal
tables needed for physical memory administration.
+config XEN_GRANTS_RECLAIM_PER_ITERATION
+ int "Default number of grant entries to reclaim per iteration"
+ default 10
+ range 10 4294967295
+ help
+ This sets the default value for the grant_table.free_per_iteration
+ kernel command line option, which sets the number of grants that
+ Linux will try to reclaim at once. The default is 10, but
+ workloads that make heavy use of gntalloc will likely want to
+ increase this. The current value can be accessed and/or modified
+ via /sys/module/grant_table/parameters/free_per_iteration.
+

Apart from the fact that I don't like adding a new Kconfig option, the help
text is not telling the true story. Heavy use of gntalloc isn't to blame, but
the fact that your PV-device implementation is based on the reclaim
functionality. TBH, someone not familiar with the grant reclaim will have no
chance to select a sensible value based on your help text.

That’s a good point. Qubes OS will need to use the sysfs value anyway
in order to support in-VM kernels, so the Kconfig option is not really
useful.

Nice.

config XEN_SCRUB_PAGES_DEFAULT
bool "Scrub pages before returning them to system by default"
depends on XEN_BALLOON
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index e1ec725c2819d4d5dede063eb00d86a6d52944c0..fa666aa6abc3e786dddc94f895641505ec0b23d8 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -498,14 +498,20 @@ static LIST_HEAD(deferred_list);
static void gnttab_handle_deferred(struct timer_list *);
static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
+static atomic64_t deferred_count;
+static atomic64_t leaked_count;
+static unsigned int free_per_iteration = CONFIG_XEN_GRANTS_RECLAIM_PER_ITERATION;
+
static void gnttab_handle_deferred(struct timer_list *unused)
{
- unsigned int nr = 10;
+ unsigned int nr = READ_ONCE(free_per_iteration);
+ const bool ignore_limit = nr == 0;
struct deferred_entry *first = NULL;
unsigned long flags;
+ size_t freed = 0;
spin_lock_irqsave(&gnttab_list_lock, flags);
- while (nr--) {
+ while ((ignore_limit || nr--) && !list_empty(&deferred_list)) {
struct deferred_entry *entry
= list_first_entry(&deferred_list,
struct deferred_entry, list);
@@ -515,10 +521,13 @@ static void gnttab_handle_deferred(struct timer_list *unused)
list_del(&entry->list);
spin_unlock_irqrestore(&gnttab_list_lock, flags);
if (_gnttab_end_foreign_access_ref(entry->ref)) {
+ uint64_t ret = atomic64_sub_return(1, &deferred_count);
put_free_entry(entry->ref);
- pr_debug("freeing g.e. %#x (pfn %#lx)\n",
- entry->ref, page_to_pfn(entry->page));
+ pr_debug("freeing g.e. %#x (pfn %#lx), %llu remaining\n",
+ entry->ref, page_to_pfn(entry->page),
+ (unsigned long long)ret);

I'd prefer not to issue lots of prints (being it debug or info ones) if the
reclaim is expected to happen with a specific PV-device.

My preferred solution would be a per-device flag, but at least you should
switch to pr_*_ratelimited(). Same applies further down.

What do you mean by “per-device flag”? These grants are allocated by
userspace using gntalloc, so there is no PV device on which the flag
could be set.

In this case the flag should relate to the file pointer used for
gntalloc_open().

Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Next message: Stephen Rothwell: "linux-next: build failure after merge of the net-next tree"
Previous message: kernel test robot: "[tip:master] BUILD SUCCESS 7728822622f78a493273454af746309f852c4c0e"
In reply to: Demi Marie Obenour: "Re: [PATCH] xen: speed up grant-table reclaim"
Next in thread: Demi Marie Obenour: "Re: [PATCH] xen: speed up grant-table reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]