Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode

From: Andy Lutomirski
Date: Thu Apr 07 2016 - 10:40:19 EST


On Apr 7, 2016 5:12 AM, "Dmitry Safonov" <dsafonov@xxxxxxxxxxxxx> wrote:
>
> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
>>
>> [cc Dave Hansen for MPX]
>>
>> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@xxxxxxxxxxxxx> wrote:
>>>
>>> Now each process that runs natively on x86_64 may execute 32-bit code
>>> by proper setting it's CS selector: either from LDT or reuse Linux's
>>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>>> compatible task is also possible by choosing USER_CS.
>>> So we may switch between 32 and 64 bit code execution in any process.
>>> Linux will choose the right syscall numbers in entries for those
>>> processes. But it still will consider them native/compat by the
>>> personality, that elf loader set on launch. This affects i.e., ptrace
>>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>>> according to process's mode (that's how strace detect task's
>>> personality from 4.8 version).
>>>
>>> This patch adds arch_prctl calls for x86 that make possible to tell
>>> Linux kernel in which mode the application is running currently.
>>> Mainly, this is needed for CRIU: restoring compatible & native
>>> applications both from 64-bit restorer. By that reason I wrapped all
>>> the code in CONFIG_CHECKPOINT_RESTORE.
>>> This patch solves also a problem for running 64-bit code in 32-bit elf
>>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>>> When switching between native <-> compat mode by arch_prctl, it will
>>> remap needed vdso binary blob for target mode.
>>
>> General comments first:
>
> Thanks for your comments.
>>
>> You forgot about x32.
>
> Will add x32 support for v2.
>
>> I think that you should separate vdso remapping from "personality".
>> vdso remapping should be available even on native 32-bit builds, which
>> means that either you can't use arch_prctl for it or you'll have to
>> wire up arch_prctl as a 32-bit syscall.
>
> I cant say, I got your point. Do you mean by vdso remapping
> mremap for vdso/vvar pages? I think, it should work now.

For 32-bit, the vdso *must* exist in memory at the address that the
kernel thinks it's at. Even if you had a pure 32-bit restore stub,
you would still need vdso remap, because there's a chance the vdso
could land at an unusable address, say one page off from where you
want it. You couldn't map a wrapper because there wouldn't be any
space for it without moving the real vdso out of the way.

Remember, you *cannot* mremap() the 32-bit vdso because you will
crash. It works by luck for 64-bit, but it's plausible that we'd want
to change that some day. (I have awful patches that speed a bunch of
things up at the cost of a vdso trampoline for 64-bit code and a bunch
of other hacks. Those patches will never go in for real, but
something else might want the ability to use 64-bit vdso trampolines.)

> I did remapping for vdso as blob for native x86_64 task differs
> to compatible task. So it's just changing blobs, address value
> is there for convenience - I may omit it and just remap
> different vdso blob at the same place where was previous vdso.
> I'm not sure, why do we need possibility to map 64-bit vdso blob
> on native 32-bit builds?

That would fail, but I think the API should exist. But a native
32-bit program should be able to remap the 32-bit vdso.

IOW, I think you should be able to do, roughly:

map_new_vdso(VDSO_32BIT, addr);

on any kernel.

Am I making sense?

>
>> For "personality", someone needs to enumerate all of the various thigs
>> that try to track bitness and see how many of them even make sense.
>> On brief inspection:
>>
>> - TIF_IA32: affects signal format and does something to ptrace. I
>> suspect that whatever it does to ptrace is nonsensical, and I don't
>> know whether we're stuck with it.
>>
>> - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
>> isn't even done in a sensible way).
>>
>> - is_64bit_mm affects MPX and uprobes.
>>
>> On even more brief inspection:
>>
>> - uprobes using is_64bit_mm is buggy.
>>
>> - I doubt that having TASK_SIZE vary serves any purpose. Does anyone
>> know why TASK_SIZE is different for different tasks? It would save
>> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
>> - Using TIF_IA32 for signal processing is IMO suboptimal. Instead,
>> we should record which syscall installed the signal handler and use
>> the corresponding frame format.
>
> Oh, I like it, will do.
>
>> - Using TIF_IA32 of the *target* for ptrace is nonsense. Having
>> strace figure out syscall type using that is actively buggy, and I ran
>> into that bug a few days ago and cursed at it. strace should inspect
>> TS_COMPAT (I don't know how, but that's what should happen). We may
>> be stuck with this for ABI reasons.
>
> ptrace may check seg_32bit for code selector, what do you think?

Not sure. I have never fully wrapped my had around ptrace.