The Intel Pentium F00F Bug Description and Workarounds By Robert R. Collins |
When any x86 processor from the 80186 and beyond encounters and invalid instruction, the processor is supposed to generate an invalid opcode exception. The undefined opcode exception is known as a #UD in Intel vernacular. The #UD handler usually signals an error condition and terminates the errant program. When this mechanism works, the errant program can't harm the computer system. However, should this mechanism fail, the errant program can bring down the entire computer. If the computer is a network server, or Internet Service Provider (ISP), then the errant program can bring down the entire network.
That's what
could happen when the Pentium Processor encounters the
F00F bug. The F00F bug received its name from its
instruction encoding F0 0F C7 C8. This instruction
encoding maps to a LOCK CMPXCHG8B EAX instruction.
CMPXCHG8B compares 64-bit memory contents with the
contents in EDX and EAX. One of the operands must be
memory, and the other (implied) operand is EDX:EAX. It is
possible to construct an instruction encoding that
doesn't map to a memory operand. Since the non-memory
form of this instruction is invalid, a compiler or
assembler will not generate this code. Instead, the
assembly language programmer must construct it by hand. Such an illegal encoding should generate the requisite #UD. As one would expect, a CMPXCHG8B EAX instruction does generate a #UD. However, when this illegal encoding is prepended with a LOCK prefix, the processor fails to work correctly. Using the LOCK prefix on this form of CMPXCHG8B is illegal in and of itself. LOCK prefixes are only allowed on memory-based read-modify-write instructions. Hence a LOCK prefix on the register-based CMPXCHG8B EAX instruction should also generate an invalid opcode exception. Instead, the Pentium processor locks up and freezes the entire computer when it encounters this instruction. This bug is especially nasty, because any user can construct a program with this instruction, and upload it to a network computer, or incorporate it within an Active-X applet. Once the program is run on the network, the network server crashes. The only possible recovery comes by hitting the big red switch. Suppose you download an Active-X applet which contains this code. As soon as the code executes, YOUR computer will freeze up. The serious nature of this bug has prompted Intel to give it their highest attention. Within one week, they announced a software workaround, which can be incorporated into virtually any operating system (except real-mode operating systems, like DOS). Here's how the bug works When the processor encounters this instruction (F0 0F C7 C8, or anything from F0 0F C7 C8..CF), the F00F bug occurs. The processor recognizes that an invalid opcode has occurred and tries to dispatch the #UD handler. Because of the LOCK prefix, the processor is confused. When the processor issues the bus reads to get the #UD handler vector address, the processor erroneously asserts the LOCK# signal. The LOCK# signal can only be asserted for read-modify-write instructions which modify memory. When the bus is locked, a locked memory read must be followed by a locked memory write, lest unpredictable results may occur. But in this case, the LOCK# signal remains asserted for the two consecutive memory reads required to retrieve the #UD vector address. The processor never issues any intervening locked write, and then hangs itself. This behavior is shown in the logic analyzer trace in Listing 1. As you can see, the Pentium tries to retrieve the #UD vector with two locked reads. After that point, all processor activity stops. Sequence Address Data Mnemonic Timestamp -------------------------------------------------------------------------------- T 524285 000000B2 ----7E--------(-IO-WRITE-)------------------------------300-ns 524286 00000018 ----E14C ( LOCKED MEM READ ) 440 ns 524287 0000001A F000---- ( LOCKED MEM READ ) 100 ns Listing 1 -- F00F Bug Example The Various Workarounds There are various possible workarounds to this bug. Not all of them are good. In fact a few of them are outright kludgey. Intel has proposed two workarounds. One of the workarounds actually takes advantage of the bug behavior to do the right thing. Their other workaround is ingenious, though it's a horrible kludge. The first two alternate workarounds presented below, are given for academic purposes only. Even though the workarounds have demonstrated their ability to obscure the bug behavior, they are not entirely reliable. Intel's first workaround
This is an ingenious solution to a horrible problem. Unfortunately, the solution is just as bad as the problem. When the processor receives any of the first seven exceptions (Divide by Zero through Invalid Opcode), the processor generates a page fault instead of the appropriate exception. The page fault handler gets mucked up with all kinds of code to check privilege levels and whether or not the fault was caused by another exception. If I had my druthers, I'd stay as far away from this solution as possible. Intel's Second Workaround
This workaround is really quite clever. This workaround takes advantage of the bug as a means to provide a fix to the problem. When any of the first six exceptions occur, they are handled as they normally would. Divide by Zero through BOUND exceptions vector to their normal exception handlers without any intervening code in the page fault handler. However, when the F00F bug occurs, the page fault handler is invoked instead of the #UD handler. Why? CR0.WP=1 instructs the microprocessor to generate a page fault when an attempt is made by the supervisor to modify a memory page. The processor doesn't actually attempt to modify the Interrupt Descriptor Page (IDT page holding the #UD vector address) when the F00F opcode is encountered. But the bug actually makes the processor think it's modifying the IDT page with the #UD vector. The locked memory cycle somehow convinces the internal state of the Pentium to think that a write cycle is going to occur. Since the transition to the #UD handler is considered a supervisor task, the processor thinks it's going to write to this page. Thus when CR0.WP=1, a page fault occurs. Even though this is a very clever fix, there are two things I don't like about it:
If I were forced to choose between two of Intel's "blessed" solutions, I'd choose this one. However, because Intel set a precedent in documenting a solution that actually takes advantage of the bug behavior, this could give rise to much more elegant solutions that also take advantage of the bug behavior. |
Why Intel failed to find the F00F Bug On November 6, 1997 a message was posted anonymously on comp.os.linux.advocacy -- one of the thousands of Internet newsgroups. The message warned users of a bug in the Pentium and Pentium MMX processors. The bug could completely lock up the computer from any operating mode in any operating system. At first glance, the ordinary news reader might say "so what!" After all, many users are accustomed to their Windows 3.1 and Windows 95 systems locking up with a regular frequency. But the placement of the bug on comp.os.linux.advocacy was calculated and intentional. The readers on that newsgroup knew exactly what the bug meant. To them, the bug meant that an ordinary user or saboteur could unleash a program to bring down their network servers. These servers form the backbone of our modern Internet community. Internet Service Providers, Web hosts, Government agencies, and Computer departments at universities were petrified of the potential damage that could be caused by this bug. Internet Service Providers (ISPs) could be attacked with a program that would bring down their web servers. Saboteurs could incorporate the code for this bug into an Active-X control that could be downloaded and used to crash YOUR computer. Students that didn't want to do their homework could attack universities. Think of it, the bug could be the latest variation of "my dog ate my homework." Therefore the news of the bug had a huge impact in this small community. But the history of this bug didn't start on November 6th. The real story starts months, maybe even years before this date. A couple of months before this bug anonymously appeared on the Internet, I received a phone call from an industry colleague. This person was elated about this bug and strongly urged me to publish it at my web site (http://www.x86.org). The whole story didn't sit well with me, especially their desire for me to make the public disclosure, instead of them. I felt like I was being used. But that wasn't my only reason for refusing to disclose the bug. Quite simply, I didn't want to be responsible for telling a potential saboteur the computing equivalent of how to build an atom bomb, or the legal liability associated with the lawsuits that might accompany such a disclosure. Within a day or so, I called another colleague who used to work at one of Intel's competitors. I told him about the bug. To which he responded: "yeah, I found that bug about a year and a half ago." If this bug was so easy for one or more of Intel's competitors to find, why hadn't they ever discovered it, disclosed it, or quietly fixed it? The answers lie in the nature of Intel's competition, and Intel's design verification methodology. It is quite natural for Intel's competition to find bugs like this. They are in the microprocessor cloning business. Therefore, they must write programs that scan the entire opcode space in search of hidden and undocumented instructions. Invalid opcodes come in many forms. The most obvious form is an undefined instruction -- an opcode that doesn't map to any instruction. These opcodes are easy to scan programmatically. Other invalid opcodes are actually invalid encodings of valid instructions. These invalid encodings are often times overlooked when scanning for undefined instructions. Intel's competition must find these instructions. Therefore they write programs that scan the entire invalid opcode space. A well-designed program will test all of the invalid opcodes and invalid encodings of valid opcodes. This is how my colleague found this bug 18 months ago. Even though he knew about the bug, his knowledge was the intellectual property of his company. Therefore he would have been prohibited from warning the Internet community, even if he wanted to warn them. Intel doesn't need to write programs like this or do they? Intel defined the opcode space. If anybody knows what's in it, they do. Therefore they don't really have a need to hire low-level assembly language programmers to write such programs or do they? In fact, Intel's design verification department has turned down some of the best x86 experts in the industry, citing Intel's lack of need for their low-level assembly language skills. Instead, Intel has relied on test generation programs, primarily written in C. These random test generation programs (RTPGs) randomly create programs, which runs on the target microprocessor. RTPGs are great for finding certain types of microprocessor bugs. But the RTPG is only as good as its design rules allow. Most likely, Intel's RTPG would not test for these invalid opcode encodings. Without augmenting their RTPG with low-level assembly programmers to write such programs, Intel's design verification methodology ultimately failed to find the F00F bug. Maybe Intel Design Verification Department will think twice next time when presented with an x86 assembly language expert. |
Alternate Solution #1
This solution is quite easy.
All exceptions vector to their appropriate interrupt handler. The page fault handler doesn't need to be mucked up with any extra code. All of the exception handling code may remain unmodified. When the F00F bug occurs, the processor issues the two consecutive locked reads. However, the processor doesn't lock up because the page is non-cacheable. Listing 2 shows the logic analyzer trace of the microprocessor recovering from the F00F bug.
Sequence Address Data Mnemonic Timestamp -------------------------------------------------------------------------------- 428677 00060FF8 0008049A ( LOCKED MEM READ ) 330 ns 00060FFC 00008E00 ( LOCKED MEM READ ) 428678 0001E8B8 BFF0FFFF ( LOCKED MEM READ ) 170 ns 0001E8BC 00009B01 ( LOCKED MEM READ ) 428679 0001EE9C 00000008 ( LOCKED MEM WRITE ) 130 ns 428680 0001EE98 000003F2 ( MEM WRITE ) 60 ns
Listing 2 -- F00F bug on non-cacheable page
Alternate Solution #2
This is by far the best solution. This solution maintains all of the benefits of having the page cacheable. However because the page is considered write-through, the processor is tricked into recovering from the bus LOCK up condition. Listing 3 shows the results of encountering the F00F bug when the page is cacheable, but marked as write-through.
Sequence Address Data Mnemonic Timestamp -------------------------------------------------------------------------------- 429135 00060FF8 0008049A ( LOCKED MEM READ ) 330 ns 00060FFC 00008E00 ( LOCKED MEM READ ) 429136 0001E8B8 BFF0FFFF ( LOCKED MEM READ ) 170 ns 0001E8BC 00009B01 ( LOCKED MEM READ ) 429137 0001EE9C 00000008 ( LOCKED MEM WRITE ) 140 ns 429138 0001EE98 000003F2 ( MEM WRITE ) 50 ns
Listing 3 - F00F bug on page write-through
Alternate Solution #3 (for DOS users)
This isn't really a viable solution for most people. Turning off the microprocessor cache can have a dramatically negative performance impact on your computer.
Sequence Address Data Mnemonic Timestamp -------------------------------------------------------------------------------- 426333 00060FF8 00080496 ( LOCKED MEM READ ) 60 ns 00060FFC 00008E00 ( LOCKED MEM READ ) 426334 0001E8B8 BFF0FFFF ( LOCKED MEM READ ) 170 ns 0001E8BC 00009B01 ( LOCKED MEM READ ) 426335 0001EEA0 00010002 ( LOCKED MEM WRITE ) 120 ns 426336 0001EE9C 00000008 ( MEM WRITE ) 60 ns 426337 0001EE98 000003EE ( MEM WRITE ) 50 ns
Listing 4 - F00F bug with cache disabled
Conclusion
The F00F bug occurs when a LOCK prefix is prepended to an invalid encoding of the CMPXCHG8B instruction. The CMPXCHG8B EAX instruction is already an invalid encoding and generates an invalid opcode exception. When the instruction is prefixed with a LOCK, the microprocessor gets confused and locks up.
The ultimate solution is obtained when disabling the cache. This demonstrates that some interaction exists between the cache and this bug. Instead of taking advantage of the cache interaction, Intel's second solution takes advantage of interaction between the bug and the page fault mechanism. Now that they've set the precedent of using the bug behavior as a workaround, nobody should be concerned by the two more elegant solutions provided herein. My second alternate solution is by far the best. The exception handlers don't need to be mucked up with extra code, and the processor performance isn't impacted in the slightest manner. Unfortunately, none of these two alternate workarounds have proven reliable in production code.
Source Code Availability
To demonstrate this bug and the various workarounds, I've written two programs to be distributed with this article. The first program, F00FBUG.EXE demonstrates all five workarounds for the bug. The second program, F00FBUG2.EXE is a simple program, which demonstrates the most elegant workaround -- Alternate Solution #2.
Source Code Archive:http://www.rcollins.org/ftp/dloads/f00fbug.zip