Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

From: Andy Lutomirski
Date: Wed Feb 25 2015 - 15:11:17 EST


On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin <avagin@xxxxxxxxx> wrote:
> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko <dvlasenk@xxxxxxxxxx>:
>>>> 64-bit code was using six stack slots less by not saving/restoring
>>>> registers which are callee-preserved according to C ABI,
>>>> and not allocating space for them.
>>>> Only when syscall needed a complete "struct pt_regs",
>>>> the complete area was allocated and filled in.
>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
>>>> trick is used, to make nested interrupt stacks easier to unwind.
>>>>
>>>> This proved to be a source of significant obfuscation and subtle bugs.
>>>> For example, stub_fork had to pop the return address,
>>>> extend the struct, save registers, and push return address back. Ugly.
>>>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>>>> throwing a wrench into CPU return stack cache.
>>>>
>>>> This patch changes code to always allocate a complete "struct pt_regs".
>>>> The saving of registers is still done lazily.
>>>>
>>>> "Partial pt_regs" trick on interrupt stack is retained.
>>>>
>>>> Macros which manipulate "struct pt_regs" on stack are reworked:
>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>>>> SAVE_EXTRA_REGS saves to it all other registers.
>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>>>>
>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>>>> return pointer.
>>>>
>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>>>> instead of magic numbers.
>>>>
>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>>>> instead of having it open-coded yet again.
>>>>
>>>> Patch was run-tested: 64-bit executables, 32-bit executables,
>>>> strace works.
>>>> Timing tests did not show measurable difference in 32-bit
>>>> and 64-bit syscalls.
>>>
>>> Hello Denys,
>>>
>>> My test vm doesn't boot with this patch. Could you help to investigate
>>> this issue?
>>
>> I think I found it. This part of my patch is possibly wrong:
>>
>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \
>> TRACE_IRQS_ON; \
>> sti; \
>> - SAVE_REST; \
>> + SAVE_EXTRA_REGS; \
>> LOCKDEP_SYS_EXIT; \
>> - RESTORE_REST; \
>> + RESTORE_EXTRA_REGS; \
>> cli; \
>> TRACE_IRQS_OFF;
>>
>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
>> but the patch changed it so that they are written to existing stack
>> slots above.
>>
>> From code inspection it should work in almost all cases, but some
>> locations where it is used are really obscure.
>>
>> If there are places where *pushing* regs is really necessary,
>> this can corrupt rbp,rbx,r12-15 registers.
>>
>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
>> was here.
>> Please find updated patch attached. Can you try it?
>
> It doesn't work
>
> [ 2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
> [ 2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
> [ 2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1
> [ 2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1
> [ 2.292417] microcode: Microcode Update Driver: v2.00
> <tigran@xxxxxxxxxxxxxxxxxxxx>, Peter Oruba
> [ 2.392592] piix4_smbus 0000:00:01.3: SMBus Host Controller at
> 0xb100, revision 0
> [ 2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files,
> 166911/512000 blocks
> [ 2.521463] Adding 4128764k swap on /dev/sda2. Priority:-1
> extents:1 across:4128764k FS
> [ 2.573597] EXT4-fs (sda1): mounted filesystem with ordered data
> mode. Opts: (null)
> [ 2.597771] systemd-journald[283]: Received request to flush
> runtime journal from PID 1
> [ 2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369
> old=0 auid=4294967295 ses=4294967295 res=1
> [ 2.819660] traps: systemd-cgroups[380] general protection
> ip:7fb227939028 sp:7fff6bac59c8 error:0 in
> ld-2.18.so[7fb227921000+20000]
> [ 3.016262] traps: systemd-cgroups[390] general protection
> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
> ld-2.18.so[7f456f79e000+20000]

The change to stub_\func looks wrong to me. It saves and restores
regs, but those regs might already have been saved if we're on the
slow path. (Yes, all that code is quite buggy even without all these
patches.) So is execve.

This means that, for example, execve called in the slow path will
save/restore regs twice. If the values in the regs after the first
save and before the second save are different, then we corrupt user
state.

I think that the changes to all the stub-like things should be reverted.

--Andy

>
>>
>> --
>> vda
>>



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/