On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai <kai.huang@xxxxxxxxx> wrote:
On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:I'll add some comments. Was assuming this is just following the old design as ksgxd.
[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
> I tested with that approach and found we can only target number of
> pages
> attempted to reclaim not pages actually reclaimed due to the
> uncertainty
> of how long it takes to reclaim pages. Besides targeting number of
> scanned pages for each cycle is also what the ksgxd does.
>
> If we target actual number of pages, sometimes it just takes too long.
> I
> saw more timeouts with the default time limit when running parallel
> selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.
I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.
Looks this is a valid information that can be used to justify whatever you are
implementing in the EPC cgroup reclaiming function(s).
There were some comments at the beginning of sgx_epc_cgrooup_reclaim_page().
/*
* Attempting to reclaim only a few pages will often fail and is
* inefficient, while reclaiming a huge number of pages can result in
* soft lockups due to holding various locks for an extended duration.
*/
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
I think it can be improved to emphasize we only "attempt" to finish scanning fixed number of pages for reclamation, not enforce number of pages successfully reclaimed.