Re: pipe/page fault oddness.

From: Linus Torvalds
Date: Tue Sep 30 2014 - 14:58:10 EST


On Tue, Sep 30, 2014 at 11:20 AM, Dave Jones <davej@xxxxxxxxxx> wrote:
>
> page_fault_kernel: address=__per_cpu_end ip=copy_page_to_iter error_code=0x2

Interesting. "error_code" in particular. The value "2" means that the
CPU thinks that the page is not present (bit zero is clear).

(That "address" is useless - it's tried to turn a user address into a
kernel symbol, and the percpu symbols are zero-based, so it picks the
last of them. The "ip" is useless too, since it doesn't give the
offset)

So the CPU thinks it's a write to a not-present page, which means that
_PAGE_PRESENT bit is clear.

Now the *kernel* thinks a page is present not just if _PAGE_PRESENT is
set, but also if _PAGE_PROTNONE or _PAGE_NUMA are set. Sadly, your
trace is not very useful, because inlining has caused pretty much all
the cases to be in "handle_mm_fault()", so the trace doesn't really
tell which path this all takes.

But we can still do *some* analysis on the trace: do_wp_page()
shouldn't have been inlined, so it would have shown up in the trace if
it had been called. So I think we can be pretty confident that the
ptep_set_access_flags() we see is the one from handle_pte_fault().

And if that is the case, then we know that "pte_present()" is indeed
true as far a the kernel is concerned. So with _PAGE_PRESENT not being
set (based on the error code), we know that _PAGE_PROTNONE must be
set, otherwise we'd have triggered the pte_numa() check and exited
through do_numa_page().

So it smells like we have a PROT_NONE VM area (at least the paeg table
entries imply that). But "access_error()" should have flagged that (it
checks "vma->vm_flags & VM_WRITE"). How do we have a page table entry
marked _PAGE_PROTNONE, but VM_WRITE set in the vma?

Or, possibly, we have some confusion about the page tables themselves
(corruption, wrong %cr3 value, whatever), explaining why the CPU
thinks one thing, but our software page table walker thinks another.

I'm not seeing how this all happens. But I'm adding Kirill to the cc,
since he might see something I missed, and he touched some of this
code last ("tag, you're it").

Kirill: the thread is on lkml, but basically it boils down to the
second byte write in fault_in_pages_writeable() faulting forever,
despite handle_mm_fault() apparently thinking that everything is fine.

Also adding Hugh Dickins, just because the more people who know this
code that are involved, the better.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/