Re: [PATCH] 2.6 workaround for Athlon/Opteron prefetch errata

From: dada1
Date: Thu Sep 11 2003 - 01:00:54 EST



>
> I don't have any. But it would be very similar to the in kernel checking
> code (see the is_prefetch function in my patches). Just you feed it
> the fields from sigcontext in the signal handler and replace __get_user
> with a normal memory access.

OK will try... but how test it... sounds not easy.

> > I do use preftechnta instructions on my programs, and this errata could
> > explain some strange crashes.
>
> The bogus faults are very easy to diagnose. When you have a core dump
> and disassemble the faulting instruction (in gdb x/i $eip) and it is
> a prefetch (prefetch/prefetchw/prefetchnt*) then it could be that.

Well, the program is using more than 2Go ram... the core is not written to
disk as the machine hangs just *after*

And this is a remote machine in a Datacenter.

>
> If it is a different instruction it is unrelated.
>
> It would also only happen when you prefetch ever on unmapped addresses.

NULL for example ?

Typical example of code ;

T_cell *ptr, *next ;
for (ptr = list.head ; ptr != NULL ; ptr = next) {
next = ptr->next ;
prefetch(next) ;
some_work(ptr) ;
}

I may replace NULL by &FakeMappedData (allways present in memory)

>
> That sounds like an unrelated issue.
>
> When user space crashes on this the kernel is unaffected.

This is not a kernel crash. But total freeze as all memory is used by
network buffers, in no more than 10 seconds.
This application receive smalls TCP messages (about 30 bytes), but the
network stacks allocates 4KB buffers to store this little messages.

No oops, but what can we do, if the freeze lasts eternity ?

I posted a test application some days ago about this problem and got no
answers/feedback.

>
> In case the 2.6 kernel crashes on this (2.4 does not trigger it)
> then you can also run the oops through ksymoops and check if the
> faulting instruction is prefetch. If it isn't then it is something else.
>
> Network buffers using up all of low mem and then crash
> is likely some OOM handling problem. If you're on 2.4 try an -aa kernel,
> they handle this much better than the marcelo tree. If it's 2.6 then
> I would recommend posting oopses on this list, maybe someone can fix
> it. I suspect 2.6's OOM handling could be still improved.
>
> -Andi
> -

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/