Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

From: H. Peter Anvin
Date: Fri Apr 11 2014 - 17:25:21 EST


On 04/11/2014 02:16 PM, Andy Lutomirski wrote:
> On 04/11/2014 11:29 AM, H. Peter Anvin wrote:
>> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>>> Is this bug really still present in modern CPUs? This change breaks
>>> running 16-bit apps in Wine. I have a few really old games I like to
>>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>>
>> It is not a bug, per se, but an architectural definition issue, and it
>> is present in all x86 processors from all vendors.
>>
>> Yes, it does break running 16-bit apps in Wine, although Wine could be
>> modified to put 16-bit apps in a container. However, this is at best a
>> marginal use case.
>
> I wonder if there's an easy-ish good-enough fix:
>
> Allocate some percpu space in the fixmap. (OK, this is ugly, but
> kvmclock already does it, so it's possible.) To return to 16-bit
> userspace, make sure interrupts are off, copy the whole iret descriptor
> to the current cpu's fixmap space, change rsp to point to that space,
> and then do the iret.
>
> This won't restore the correct value to the high bits of [er]sp, but it
> will at least stop leaking anything interesting to userspace.
>

This would fix the infoleak, at the cost of allocating a chunk of memory
for each CPU. It doesn't fix the functionality problem.

If we're going to do a workaround I would prefer to do something that
fixes both, but it is highly nontrivial.

This is a writeup I did to a select audience before this was public:

> Hello,
>
> It appears we have an information leak on x86-64 by which at least bits
> [31:16] of the kernel stack address leaks to user space (some silicon
> including the 64-bit Pentium 4 leaks [63:16]). This is due to the the
> behavior of IRET when returning to a 16-bit segment: IRET restores only
> the bottom 16 bits of the stack pointer.
>
> This is known on 32 bits and we, in fact, have a workaround for it
> ("espfix") there. We do not, however, have the equivalent on 64 bits,
> nor does it seem that it is very easy to construct a workaround (see below.)
>
> This is both a functionality problem (16-bit code gets the upper bits of
> %esp corrupted when the kernel is invoked) and an information leak. The
> 32-bit workaround was labeled as a fix for the functionality problem,
> but it of course also addresses the leak.
>
> On 64 bits, the easiest mitigation seems to be to make modify_ldt()
> refuse to install a 16-bit segment when running on a 64-bit kernel.
> 16-bit support is already somewhat crippled on 64 bits since there is no
> V86 support; obviously, for "full service" support we can always set up
> a virtual machine -- most (but sadly, not all) 64-bit parts are also
> virtualization capable.
>
> I would have suggested rejecting modify_ldt() entirely, to reduce attack
> surface, except that some early versions of 32-bit NPTL glibc use
> modify_ldt() to exclusion of all other methods of establishing the
> thread pointer, so in order to stay compatible with those we would need
> to allow 32-bit segments via modify_ldt() still.
>
> However, there is no doubt this will break some legitimate users of
> 16-bit segments, e.g. Wine for 16-bit Windows apps (which don't work on
> 64-bit Windows either, for what it is worth.)
>
> We may very well have other infoleaks that dwarf this, but the kernel
> stack address is a relatively high value item for exploits.
>
> Some workarounds I have considered:
>
> a. Using paging in a similar way to the 32-bit segment base workaround
>
> This one requires a very large swath of virtual user space (depending on
> allocation policy, as much as 4 GiB per CPU.) The "per CPU" requirement
> comes in as locking is not feasible -- as we return to user space there
> is nowhere to release the lock.
>
> b. Return to user space via compatibility mode
>
> As the kernel lives above the 4 GiB virtual mark, a transition through
> compatibility mode is not practical. This would require the kernel to
> reserve virtual address space below the 4 GiB mark, which may interfere
> with the application, especially an application launched as a 64-bit
> application.
>
> c. Trampoline in kernel space
>
> A trampoline in kernel space is not feasible since all ring transition
> instructions capable of returning to 16-bit mode require the use of the
> stack.
>
> d. Trampoline in user space
>
> A return to the vdso with values set up in registers r8-r15 would enable
> a trampoline in user space. Unfortunately there is no way
> to do a far JMP entirely with register state so this would require
> touching user space memory, possibly in an unsafe manner.
>
> The most likely variant is to use the address of the 16-bit user stack
> and simply hope that this is a safe thing to do.
>
> This appears to be the most feasible workaround if a workaround is
> deemed necessary.
>
> e. Transparently run 16-bit code segments inside a lightweight VMM
>
> The complexity of this solution versus the realized value is staggering.
> It also doesn't work on non-virtualization-capable hardware (including
> running on top of a VMM which doesn't support nested virtualization.)
>
> -hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/