Re: Somewhat OT: gcc, x86, -ffast-math, and Linux

From: Andy Isaacson
Date: Fri Mar 26 2004 - 16:46:43 EST


linux-kernel isn't the right forum for this, but I'll take a stab
anyways.

On Fri, Mar 26, 2004 at 02:54:31PM -0600, Daniel Forrest wrote:
[snip: 180 dual Xeon boxes]
> I am running one of our applications that has been compiled using gcc
> with the -ffast-math option. I am finding that the identical program
> using the same input data files is producing different results on
> different machines. However, the differences are all less than the
> precision of a single-precision floating point number. By this I mean
> that if the results (which are written to 15 digits of precision) are
> only compared to 7 digits then the results are the same. Also, most
> of the time the 15 digit values are the same.
>
> My question is this: Why aren't the results always the same? What is
> the -ffast-math option doing? How are the excess bits of precision
> dealt with during context switches? Shouldn't the same binary with
> the same inputs produce the same output on identical hardware?

The kernel should be doing the right thing to preserve FPU state during
context switches. That doesn't prevent the app from doing things wrong
and thus getting the wrong answer (perhaps only under certain
circumstances). And of course the kernel might have bugs (though it's
unlikely to be as simple as "doesn't preserve FPU state correctly"; a
lot of people depend on that codepath being right.)

Likely there is some difference in one of the following areas:
- hardware problems
- kernel
- libraries
- CPU microcode

Or, you have a bug in your program which is triggered by some
environmental factor. (For example, an inter-thread race condition
affected by IO interrupts.)

To eliminate them:
- first, run memtest86 or a similar program to verify that you are not
simply victim of a bad memory stick.
- next, check that the kernel, libc, and libm are identical across the
machines that display the problem.
- next, check /proc/cpuinfo and dmesg(1) output to verify that your
CPUs are the same stepping, and running the same microcode. (The
likelihood that this is the problem is so small as to be almost not
worth mentioning.)

> I have run the same test with the program compiled without -ffast-math
> enabled and the results are always identical.

You don't say how many different results you've gotten. Is there just
one correct and one incorrect result? Or do different runs give
different incorrect results? What is the software environment?
(language, libraries, threading, etc.)

Basically, at this point you haven't provided us enough information to
be able to even point a finger at the kernel. It's certainly possible
that there's a bug, but it's pretty unlikely (IMO). I'd be looking at
hardware and at threading problems in the apps, first.

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/