Re: SGX vs LSM (Re: [PATCH v20 00/28] Intel SGX1 support)

From: Stephen Smalley
Date: Fri May 24 2019 - 11:44:47 EST


On 5/24/19 3:24 AM, Xing, Cedric wrote:
Hi Andy,

From: Andy Lutomirski [mailto:luto@xxxxxxxxxx]
Sent: Thursday, May 23, 2019 6:18 PM

On Thu, May 23, 2019 at 4:40 PM Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
wrote:

On Thu, May 23, 2019 at 08:38:17AM -0700, Andy Lutomirski wrote:
On Thu, May 23, 2019 at 7:17 AM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:

On Thu, May 23, 2019 at 01:26:28PM +0300, Jarkko Sakkinen wrote:
On Wed, May 22, 2019 at 07:35:17PM -0700, Sean Christopherson wrote:
But actually, there's no need to disallow mmap() after ECREATE
since the LSM checks also apply to mmap(), e.g. FILE__EXECUTE
would be needed to
mmap() any enclave pages PROT_EXEC. I guess my past self
thought mmap() bypassed LSM checks? The real problem is that
mmap()'ng an existing enclave would require FILE__WRITE and
FILE__EXECUTE, which puts us back at square one.

I'm lost with the constraints we want to set.

As is today, SELinux policies would require enclave loaders to
have FILE__WRITE and FILE__EXECUTE permissions on
/dev/sgx/enclave. Presumably other LSMs have similar
requirements. Requiring all processes to have
FILE__{WRITE,EXECUTE} permissions means the permissions don't add
much value, e.g. they can't be used to distinguish between an
enclave that is being loaded from an unmodified file and an enclave that is being
generated on the fly, e.g. Graphene.

Looking back at Andy's mail, he was talking about requiring
FILE__EXECUTE to run an enclave, so perhaps it's only FILE__WRITE
that we're trying to special case.


I thought about this some more, and I have a new proposal that helps
address the ELRANGE alignment issue and the permission issue at the
cost of some extra verbosity. Maybe you all can poke holes in it :)
The basic idea is to make everything more explicit from a user's
perspective. Here's how it works:

Opening /dev/sgx/enclave gives an enclave_fd that, by design,
doesn't give EXECUTE or WRITE. mmap() on the enclave_fd only works
if you pass PROT_NONE and gives the correct alignment. The
resulting VMA cannot be mprotected or mremapped. It can't be
mmapped at all until

I assume you're thinking of clearing all VM_MAY* flags in sgx_mmap()?

after ECREATE because the alignment isn't known before that.

I don't follow. The alignment is known because userspace knows the
size of its enclave. The initial unknown is the address, but that
becomes known once the initial mmap() completes.

[...]

I think I made the mistake of getting too carried away with implementation details rather
than just getting to the point. And I misremembered the ECREATE flow -- oops. Let me try
again. First, here are some problems with some earlier proposals (mine, yours
Cedric's):

- Having the EADD operation always work but have different effects depending on the
source memory permissions is, at the very least, confusing.

Inheriting permissions from source pages IMHO is the easiest way to validate the EPC permissions without any changes to LSM. And the argument about its security is also easy to make.

I understand that it may take some effort to document it properly but otherwise don't see any practical issues with it.


- If we want to encourage user programs to be well-behaved, we want to make it easy to
map the RX parts of an enclave RX, the RW parts RW, the RO parts R, etc. But this
interacts poorly with the sgx_mmap() alignment magic, as you've pointed out.

- We don't want to couple LSMs with SGX too tightly.

So here's how a nice interface might work:

int enclave_fd = open("/dev/sgx/enclave", O_RDWR);

/* enclave_fd points to a totally blank enclave. Before ECREATE, we need to decide on an
address. */

void *addr = mmap(NULL, size, PROT_NONE, MAP_SHARED, enclave_fd, 0);

/* we have an address! */

ioctl(enclave_fd, ECREATE, ...);

/* now add some data to the enclave. We want the RWX addition to fail
immediately unless we have the relevant LSM pemission. Similarly, we
want the RX addition to fail immediately unless the source VMA is appropriate. */

ioctl(enclave_fd, EADD, rx_source_1, MAXPERM=RX, ...); [the ...
includes SECINFO, which the kernel doesn't really care about] ioctl(enclave_fd, EADD,
ro_source_1, MAXPERM=RX ...); ioctl(enclave_fd, EADD, rw_source_1, MAXPERM=RW ...);
ioctl(enclave_fd, EADD, rwx_source_1, MAXPERM=RWX ...);

If MAXPERM is taken from ioctl parameters, the real question here is how to validate MAXPERM. Guess we shouldn't allow arbitrary MAXPERM to be specified by user code, and the only logical source I can think of is from the source pages (or from the enclave source file, but memory mapping is preferred because it offers more flexibility).

ioctl(enclave_fd, EINIT, ...); /* presumably pass sigstruct_fd here, too. */

/* at this point, all is well except that the enclave is mapped PROT_NONE. There are a
couple ways I can imagine to fix this. */

We could use mmap:

mmap(baseaddr+offset, len, PROT_READ, MAP_SHARED | MAP_FIXED, enclave_fd, 0); /* only
succeeds if MAXPERM & R == R */

But this has some annoying implications with regard to sgx_get_unmapped_area(). We could
use an ioctl:

There's an easy fix. Just let sgx_get_unmapped_area() do the natural alignment only if MAP_FIXED is *not* set, otherwise, honor both address and len.

But mmap() is subject to LSM check (probably against /dev/sgx/enclave?). How to do mmap(RX) if FILE__EXECUTE is *not* granted for /dev/sgx/enclave, even if MAXPERM=RX?


ioctl(enclave_fd, SGX_IOC_MPROTECT, offset, len, PROT_READ);

which has the potentially nice property that we can completely bypass the LSM hooks,
because the LSM has *already* vetted everything when the EADD calls were allowed. Or we
could maybe even just use
mprotect() itself:

mprotect(baseaddr + offset, len, PROT_READ);

How to bypass LSM hooks in this mprotect()?


Or, for the really evil option, we could use a bit of magic in .fault and do nothing here.
Instead we'd make the initial mapping PROT_READ|PROT_WRITE|PROT_EXEC and have .fault
actually instantiate the PTEs with the intersection of the VMA permissions and MAXPERM. I
don't think I like this alternative, since it feels more magical than needed and it will
be harder to debug. I like the fact that /proc/self/maps shows the actual permissions in
all the other variants.

Agreed.


All of the rest of the crud in my earlier email was just implementation details. The
point I was trying to make was that I think it's possible to implement this without making
too much of a mess internally. I think I favor the mprotect() approach since it makes the
behavior fairly obvious.

I don't think any of this needs to change for SGX2. We'd have an
ioctl() that does EAUG and specifies MAXPERM. Trying to mprotect() a page that hasn't
been added yet with any permission other than PROT_NONE would fail. I suppose we might
end up needing a way to let the EAUG operation *change* MAXPERM, and this operation would
have to do some more LSM checks and walk all the existing mappings to make sure they're
consistent with the new MAXPERM.

EAUG ioctl could be a solution, but isn't optimal at least. What we've done is #PF based. Specifically, an SGX2 enclave will have its heap mapped as RW, but without any pages populated before EINIT. Then when the enclave needs a new page in its heap, it issues EACCEPT, which will cause a #PF and the driver will respond by EAUG a new EPC page. And then the enclave will be resumed and the faulted EACCEPT will be retried (and succeed).


As an aside, I wonder if Linus et all would be okay with a new MAP_FULLY_ALIGNED mmap()
flag that allocated memory aligned to the requested size. Then we could get rid of yet
another bit of magic.

--Andy

I've also got a chance to think more about it lately.

When we talk about EPC page permissions with SGX2 in mind, I think we should distinguish between initial permissions and runtime permissions. Initial permissions refer to the page permissions set at EADD. They are technically set by "untrusted" code so should go by policies similar to those applicable to regular shared objects. Runtime permissions refer to the permissions granted by EMODPE, EAUG and EACCEPTCOPY. They are resulted from inherent behavior of the enclave, which in theory is determined by the enclave's measurements (MRENCLAVE and/or MRSIGNER).

And we have 2 distinct files to work with - the enclave file and /dev/sgx/enclave. And I consider the enclave file a logical source for initial permissions while /dev/sgx/enclave is a means to control runtime permissions. Then we can have a simpler approach like the pseudo code below.

/**
* Summary:
* - The enclave file resembles a shared object that contains RO/RX/RW segments
* - FILE__* are assigned to /dev/sgx/enclave, to determine acceptable permissions to mmap()/mprotect(), valid combinations are
* + FILE__READ - Allow SGX1 enclaves only
* + FILE__READ|FILE__WRITE - Allow SGX2 enclaves to expand data segments (e.g. heaps, stacks, etc.)
* + FILE__READ|FILE__WRITE|FILE__EXECUTE - Allow SGX2 enclaves to expend both data and code segments. This is necessary to support dynamically linked enclaves (e.g. Graphene)
* + FILE__READ|FILE__EXECUTE - Allow RW->RX changes for SGX1 enclaves - necessary to support dynamically linked enclaves (e.g. Graphene) on SGX1. EXECMEM is also required for this to work

I think EXECMOD would fit better than EXECMEM for this case; the former is applied for RW->RX changes for private file mappings while the latter is applied for WX private file mappings.

* + <None> - Disallow the calling process to launch any enclaves
*/

/* Step 1: mmap() the enclave file according to the segment attributes (similar to what dlopen() would do for regular shared objects) */
int image_fd = open("/path/to/enclave/file", O_RDONLY);

FILE__READ checked to enclave file upon open().

foreach phdr in loadable segments /* phdr->p_type == PT_LOAD */ {
/* <segment permission> below is subject to LSM checks */
loadable_segments[i] = mmap(NULL, phdr->p_memsz, MAP_PRIATE, <segment permission>, image_fd, phdr->p_offset);

FILE__READ revalidated and FILE__EXECUTE checked to enclave file upon mmap() for PROT_READ and PROT_EXEC respectively. FILE__WRITE not checked even for PROT_WRITE mappings since it is a private file mapping and writes do not reach the file. EXECMEM checked if any segment permission has both W and X simultaneously. EXECMOD checked on any subsequent mprotect() RW->RX changes (if modified).

}

/* Step 2: Create enclave */
int enclave_fd = open("/dev/sgx/enclave", O_RDONLY /* or O_RDWR for SGX2 enclaves */);

FILE__READ checked (SGX1) or both FILE__READ and FILE__WRITE checked (SGX2) to /dev/sgx/enclave upon open(). Assuming that we are returning an open file referencing the /dev/sgx/enclave inode and not an anon inode, else we lose all subsequent FILE__* checking on mmap/mprotect and trigger EXECMEM on any mmap/mprotect PROT_EXEC.

void *enclave_base = mmap(NULL, <enclave size>, MAP_SHARED, PROT_READ, enclave_fd, 0); /* Only FILE__READ is required here */

FILE__READ revalidated to /dev/sgx/enclave upon mmap().

ioctl(enclave_fd, IOC_ECREATE, ...);

/* Step 3: EADD and map initial EPC pages */
foreach s in loadable_segments {
/* IOC_EADD_AND_MAP_SEGMENT will make sure s->perm is a subset of VMA permissions of the source pages, and use that as *both* EPCM and VMA permissions).
* Given enclave_fd may have FILE__READ only, LSM has to be bypassed so the "mmap" part has to be done inside the driver.
* Initial EPC pages will be mapped only once, so no inode is needed to remember the initial permissions. mmap/mprotect afterwards are subject to FILE__* on /dev/sgx/enclave
* The key point here is: permissions of source pages govern initial permissions of EADD'ed pages, regardless FILE__* on /dev/sgx/enclave
*/
ioctl(enclave_fd, IOC_EADD_AND_MAP_SEGMENT, s->base, s->size, s->perm...);
}
/* EADD other enclave components, e.g. TCS, stacks, heaps, etc. */
ioctl(enclave_fd, IOC_EADD_AND_MAP_SEGMENT, tcs, 0x1000, RW | PT_TCS...);
ioctl(enclave_fd, IOC_EADD_AND_MAP_SEGMENT, <zero page>, <stack size>, RW...);
...

/* Step 4 (SGX2 only): Reserve ranges for additional heaps, stacks, etc. */
/* FILE__WRITE required to allow expansion of data segments at runtime */
/* Key point here is: permissions, if needed to change at runtime, are subject to FILL__* on /dev/sgx/enclave */
mprotect(<heap address>, <heap size>, PROT_READ | PROT_WRITE);

FILE__READ and FILE__WRITE revalidated to /dev/sgx/enclave upon mprotect().


/* Step 5: EINIT */
ioctl(IOC_EINIT, <sigstruct>...);

/* Step 6 (SGX2 only): Set RX for dynamically loaded code pages (e.g. Graphene, encrypted enclaves, etc.) as needed, at runtime */
/* FILE__EXECUTE required */
mprotect(<RX address>, <RX size>, PROT_READ | PROT_EXEC);

FILE__READ revalidated and FILE__EXECUTE checked to /dev/sgx/enclave upon mprotect(). Cumulative set of checks at this point is FILE__READ|FILE__WRITE|FILE__EXECUTE.

What would the step be for a SGX1 RW->RX change? How would that trigger EXECMOD? Do we really need to distinguish it from the SGX2 dynamically loaded code case?


-Cedric