Re: x86-64 bad pmds in 2.6.11.6

From: Andi Kleen
Date: Thu Apr 14 2005 - 13:12:39 EST


On Thu, Apr 14, 2005 at 06:34:58PM +0100, Hugh Dickins wrote:
> On Thu, 14 Apr 2005, Andi Kleen wrote:
> >
> > Thanks for the analysis. However I doubt the load_cr3 patch can fix
> > it. All it does is to stop the CPU from prefetching mappings (which
> > can cause different problem).
>
> I thought that the leave_mm code (before your patch) flushes the TLB, but
> restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.
>
> So any speculation, not just prefetching, on that cpu is in danger of
> bringing address translations according to that mm back into the TLB.
>
> But when the mm is torn down in exit_mmap, there's no longer any record
> that the TLB on that cpu needs flushing, so stale translations remain.
>
> As a rule, we always flush TLB _after_ invalidating, not just before,
> for this kind of reason.

Yes this is all true. In fact I have several bug fixes for problems
in this area.

But this all cannot explain corruptions comming from the kernel,
you tend to only see problems with the CPU prefetching something.

Note that with the cr3 reload you end up with init_mm, which
is not any useful mm. So even if there was a store from the kernel
into a stale mapping it would cause -EFAULT now. But that is
not happening.

>
> My paranoia of speculation may be excessive: I _think_ what I outline
> above is a real possibility on Intel, but you and others know AMD much
> better than I (and the reports I've seen are on AMD64, not EM64T).

It is not both on Intel and AMD :) These CPUs do a lot of prefetching
behind your back, any stale mappings at any time in the TLB eventually
cause problems. But other ones than this.


> Sure, the "mm/memory.c:97: bad pmd" messages are coming from
> clear_pmd_range, when the corrupted task exits later (but probably
> not much later, since its user stack is oddly distributed across
> two different pages: some mentioned SIGSEGVs I think).
>
> The pmd really is bad, but it got to be bad because it had stack data
> written into it by create_elf_tables, when the TLB mistakenly thought
> it already knew what physical page 0x00007ffffffff000 was mapped to
> (prior kernel accesses to that user stack are not by user address).

What I meant is that the overwriting must be from Linux code
acting in the direct mapping, not due stale TLBs for addresses < __PAGE_OFFSET.

I will take a closer look at the rc1/rc2 patches later this evening
and see if I can spot something. Can only report back tomorrow though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/