Inside the Pentium II Math BugBy Robert R. Collins |
Just two days before its biggest processor announcement in years, Intel was hit by reports of a math bug in its Pentium Pro and (the soon to be announced) Pentium II processors. The bad timing prompted reports that the bug disclosure was deliberately timed to coincide with the Pentium II announcement, thereby maximizing the embarrassment to Intel. Another early rumor put AMD behind the bug report. Yet another industry rumor said that the Pentium II used for the tests was illegally obtained. As intriguing as these theories may be, none of them are true. How do I know? Because I wrote the bug report. The bug was known as the Dan-0411 bug by the news media and Internet community. Intel had its own name for it - the Flag Erratum.
The FactsI received e-mail from "Dan" who asked if I could reproduce what he thought was a bug in the Pentium Pro. After contemplating my involvement for ten days, I finally decided to help out (see the accompanying text box). I wrote an assembly-language program that checked into the problem. I ran the test on Pentium Pro, Pentium II, Pentium classic (P54C), Pentium MMX (P55C), and AMD K6 processors. (I had purchased the Pentium II over the counter at Fry's Electronics in Sunnyvale, California, six weeks before its official introduction. There was nothing illegal about the acquisition of the Pentium II processor.) After running the test on these various processors, I came to the conclusion that a bug did exist in the Pentium Pro and Pentium II. Why Dan-0411? These days, astronomers name new stars and comets by combining the discoverer's name and some number. Why should microprocessor bugs be different? In this case, "Dan" is the discoverer of the bug, and 04-11 (1997) is the date on which I got my first e-mail about it. So I've named the bug "Dan - 0411" after its discoverer and the date he first reported it to me. (Please refer to http://www.rcollins.org/secrets/Dan0411.html for the text of the original bug announcement.) What is the Bug and What Does it Affect?The bug relates to operations that convert floating-point numbers into integer numbers. All floating-point numbers are stored inside of the microprocessor in an 80-bit format. Even though the external representation of a number may not be an 80-bit format, once the number is loaded into the microprocessor, it is converted to an 80-bit format. Integer numbers are stored externally in two different sizes. A short integer is stored in 16 bits, and a long integer is stored in 32 bits. It is often desirable to store the floating-point numbers as integer numbers. On occasion, the converted numbers won't fit into the smaller integer format. This is when the bug occurs. The host software are is supposed to be warned by the microprocessor when such a floating-point conversion error occurs; a specific error flag is supposed to be set in a floating-point status register. If the microprocessor fails to set this flag, it does not comply with the IEEE Floating Point Standards, which mandate such behavior. For the Dan-0411 bug, the Pentium II and Pentium Pro fail to set this error flag in many cases. It is interesting to note that a launch failure of the Ariane 5 rocket, which happened less than a minute into the launch, was traced to behavior around an overflow condition. In this case, it was a software bug, not a microprocessor bug, that caused the problem. One of the computers on board had a floating-point to integer conversion that had overflowed. The overflow was not expected and, therefore, not detected by the computer software. As a result, the computer did a dump of its memory. Unfortunately, this memory dump was interpreted by the rocket as instructions to its rocket nozzles. Result: Boom! The case of the Ariane rocket is a sensational example of the drastic consequences of an unhandled float-to-integer overflow. Pentium Pro and Pentium II users, on the other hand, are most likely to see the results of this bug in their graphics displays or in heavy-duty numerical analysis programs. Intel says ordinary users might see a temporary screen glitch on some games when this bug occurs. The Nature of the BugThe Dan-0411 bug occurs when a large negative floating-point number is stored to memory in an integer format. Under normal operation, the largest negative integer (MAXNEG) is stored in memory when a floating-point number is too large to fit in the integer format. The FPU Status Word is supposed to indicate that an Invalid Operand Exception (#IE) occurred (FSW.IE = 1). Floating-point numbers that overflow the "real number" format are supposed to behave differently than floating-point numbers that overflow the "integer number" format. Float-to-real overflows are supposed to set the overflow flag (FSW.OE=1); Float-to-integer overflows are supposed to set the Invalid Operand Exception flag (FSW.IE). Section 7.8.4 of the Pentium Pro Family Developer's Manual, Volume 2 makes this difference quite clear: Float-to-real overflows:
Float-to-integer overflows:
Instead of setting the Invalid Operand Exception (FSW.IE) bit, only the precision exception (FSW.PE) bit is set. The precision-exception flag indicates that a computation can't be precisely represented by the floating-point operation - in this case, the float-to-integer store operation. In most cases, this bit is ignored by programmers. Therefore, when the conditions are met for the Dan-0411 bug to occur, programmers may never know that an error occurred. If that isn't bad enough, it gets worse. The Dan - 0411 bug occurs for three out of four rounding modes, and when exceptions are either masked or unmasked. In the case of masked exceptions, the correct value is stored to memory; only the Floating-Point Status Word (FSW) is incorrectly set. For unmasked exceptions, the errant behavior is more serious.
|
|
I'm not sure why this bug wasn't detected sooner, but there are clues that could help provide an explanation. Professor William Kahan of the University of California, Berkeley, has written a suite of floating-point test programs in FORTRAN (see http://http.cs.berkeley.edu/~wkahan/). These programs are commonly used to test the Float-to-Integer Store instructions (FIST and FISTP). Dan ported Dr. Kahan's FORTRAN programs to C and ran the tests against the Pentium Pro - this is when the bug came to light. So in the end, either Intel failed to run Dr. Kahan's test on the Pentium Pro, misconfigured the program, or a FORTRAN compiler hid the bug in the chip.
Source code and two executable programs are available for download. The programs are executable versions of the stand-alone assembly-language source code. The first program, FISTBUG.EXE, demonstrates the bug in a straightforward manner. When you run the program, all that appears on the screen is either the simple message "*** Dan-0411 bug found. ***", or "Dan-0411 not found." The second program, FISTBUGV.EXE, runs the same exact tests as the first, but is much more verbose. This program shows the microprocessor stepping information and itemized results. Each operand under test is printed to the screen, along with pass/fail status for four different testing methods.
View results of FISTBUG
http://www.rcollins.org/ftp/source/fistbug/fistbug.res
Source Code Availability
View source code for FISTBUG.EXE and FISTBUGV.EXE
http://www.rcollins.org/ftp/source/fistbug/fistbug.asm
http://www.rcollins.org/ftp/source/fistbug/makefile
Executable Programs
Download FISTBUG.EXE and FISTBUGV.EXE binary executables.
http://www.rcollins.org/ftp/source/fistbug/fistbug.exe
http://www.rcollins.org/ftp/source/fistbug/fistbugv.exe
http://www.rcollins.org/secrets/Dan0411.zip
The Entire FISTBUG Archive
Download fistbug.zip archive. Archive contains source code,
binary executables, and my results.
http://www.rcollins.org/ftp/dloads/fistbug.zip