Re: [PATCH Part2 v6 14/49] crypto: ccp: Handle the legacy TMR allocation when SNP is enabled

From: Kalra, Ashish
Date: Thu Nov 17 2022 - 15:57:09 EST


Hello Boris,


+        if (ret)
+            goto cleanup;
+
+        ret = rmp_make_shared(pfn, PG_LEVEL_4K);
+        if (ret)
+            goto cleanup;
+
+        pfn++;
+        n++;
+    }
+
+    return 0;
+
+cleanup:
+    /*
+     * If failed to reclaim the page then page is no longer safe to
+     * be released, leak it.
+     */
+    snp_leak_pages(pfn, npages - n);

So this looks real weird: we go and reclaim pages, we hit an error
during reclaiming a page X somewhere in-between and then we go and mark
the *remaining* pages as not to be used?!

Why?

Why not only that *one* page which failed and then we continue with the
rest?!


I had a re-look at this while fixing the memory_failure() handling and realized that we can't do a *partial* recovery here if we fail to reclaim a single page, i.e., if we hit an error during reclaiming a single page we need to mark the remaining pages as not usable.

This is because this page could be a part of larger buffer which would have been transitioned to firmware state and need to be restored back in *full* to HV/shared state, any access to a partially transitioned buffer will still cause failures, basically the callers won't be able to do any kind of a recovery/access on a partially restored/reclaimed buffer and now potentially fragmented buffer, which anyway means failure due to data loss on non reclaimed page(s).

So we need to be able to reclaim all the pages or none.

Also this failure won't be happening regularly/frequently, it is a *rare* error case and if there is a reclamation failure on a single page, there is a high probability that there will be reclamation failures on subsequent pages.

Thanks,
Ashish