Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

From: Huang, Kai
Date: Thu Feb 22 2024 - 17:32:00 EST

Next message: Kent Overstreet: "Re: stable-kernel-rules was Re: fs/bcachefs/"
Previous message: Eyal Birger: "Re: [PATCH net-next 2/2] net: geneve: enable local address bind for geneve sockets"
In reply to: Haitao Huang: "Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup"
Next in thread: Haitao Huang: "Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 23/02/2024 6:20 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai <kai.huang@xxxxxxxxx> wrote:

On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:

[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
>     I tested with that approach and found we can only target number of
> pages
>     attempted to reclaim not pages actually reclaimed due to the
> uncertainty
>     of how long it takes to reclaim pages. Besides targeting number of
>     scanned pages for each cycle is also what the ksgxd does.
>
>     If we target actual number of pages, sometimes it just takes too long.
> I
>     saw more timeouts with the default time limit when running parallel
>     selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.

I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.

Looks this is a valid information that can be used to justify whatever you are
implementing in the EPC cgroup reclaiming function(s).

I'll add some comments. Was assuming this is just following the old design as ksgxd.
There were some comments at the beginning of sgx_epc_cgrooup_reclaim_page().
        /*
         * Attempting to reclaim only a few pages will often fail and is
         * inefficient, while reclaiming a huge number of pages can result in
         * soft lockups due to holding various locks for an extended duration.
         */
        unsigned int nr_to_scan = SGX_NR_TO_SCAN;

I think it can be improved to emphasize we only "attempt" to finish scanning fixed number of pages for reclamation, not enforce number of pages successfully reclaimed.

Not sure need to be this comment, but at somewhere just state you are trying to follow the ksgxd() (the current sgx_reclaim_pages()), but trying to do it "_across_ given cgroup and all the descendants".

That's the reason you made @nr_to_scan as a pointer.

And also some text to explain why to follow ksgxd() -- not wanting to block longer due to loop over descendants etc -- so we can focus on discussing whether such justification is reasonable.

Next message: Kent Overstreet: "Re: stable-kernel-rules was Re: fs/bcachefs/"
Previous message: Eyal Birger: "Re: [PATCH net-next 2/2] net: geneve: enable local address bind for geneve sockets"
In reply to: Haitao Huang: "Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup"
Next in thread: Haitao Huang: "Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]