Re: [PATCH v5] x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race

From: Haitao Huang
Date: Thu Jul 27 2023 - 10:16:45 EST


On Wed, 26 Jul 2023 21:50:02 -0500, Huang, Kai <kai.huang@xxxxxxxxx> wrote:

On Wed, 2023-07-26 at 18:02 -0700, Haitao Huang wrote:
Under heavy load, the SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC

If I read correctly, Dave suggested to not use "high" (heavy in this sentence)
or "low" pressure:

https://lore.kernel.org/lkml/op.179a4xs0wjvjmi@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m9120eac6a4a94daa7c9fcc47709f241cd181e5dc

And I agree. For instance, consider this happens to one extremely "small"
enclave, while there's a new "big" enclave starts to run. I don't think we
should say this is "under heavy load". Just stick to the fact that the
reclaimer may reclaim the SECS page.

Mybe I have some confusion here but I did not think Dave had issues with 'heavy load'. When this happens, the last page causing #PF (page A below) should be the the "youngest" in PTE and it got paged out together with the SECS before the #PF is even handled. Based on that the ksgxd moves 'young' pages to the back of the queue for reclaiming, for that to happen, almost all EPC pages must be paged out for all enclaves at that time, so it means heavy load to me. And that's also consistent with my tests.

page for an enclave and set encl->secs.epc_page to NULL. But the SECS
EPC page is used for EAUG in the SGX page fault handler without checking
for NULL and reloading.

Fix this by checking if SECS is loaded before EAUG and loading it if it
was reclaimed.

The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.
...
But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.

Hitting this bug requires that race triggered with a #PF for EAUG.

The above race can happen for the normal ELDU path too, thus I suppose it will
be better to mention why the normal ELDU path doesn't have this issue: it
already does what this fix does.

Should we focus on the bug and fix itself instead of explaining a non-bug case?
And the simple changes in this patch clearly show that too if people look for that.

Thanks
Haitao