Branch prediction in the Pentium familyHow the branch prediction mechanism in the Pentium has been uncovered with all its quirks, and the incredibly more effective branch prediction in the later versions.By Agner FogWhat is branch prediction? Imagine a simple microprocessor where all instructions are handled in two steps: decoding and execution. The microprocessor can save time by decoding one instruction while the preceding instruction is executing. This assembly line-principle is called pipelining. In advanced microprocessors, the pipeline may have many steps so that many consecutive instructions are underway in the assembly line at the same time, one at each stage in the pipeline. The problem now occurs when we meet a branch instruction. A branch instruction is the implementation of an if-then-else construct. If a condition is true then jump to some other location; if false then continue with the next instruction. This gives a break in the flow of instructions through the pipeline because the processor doesn't know which instruction comes next until it has finished executing the branch instruction. The longer the pipeline, the longer time it will have to wait until it knows which instruction to feed next into the pipeline. As modern microprocessors tend to have longer and longer pipelines, there has been a growing need for doing something about this problem. The solution is branch prediction. The microprocessor tries to predict whether the branch instruction will jump or not, based on a record of what this branch has done previously. If it has jumped the last four times then chances are high that it will also jump this time. The microprocessor decides which instruction to load next into the pipeline based on this prediction, before it knows for sure. This is called speculative execution. If the prediction turns out to be wrong, then it has to flush the pipeline and discard all calculations that were based on this prediction. But if the prediction was correct, then it has saved a lot of time. The detective work Intel manuals have never been very specific about how the branch prediction works. However, since mispredictions are expensive in terms of execution time, I found it important to know how the prediction works in order to optimize my programs. I started to do a lot of experiments together with some clever persons I had met on the Internet, most importantly Karki J. Bahadur at the university of Colorado, and Terje Mathisen in Norway, the guy who reverse engineered system software to find out how to get access to the performance monitor counters on the Pentium chip. Well, my first finding was that the Pentium predicts a branch instruction to jump if it has jumped any of the last two times. This fitted all my experiments, but Karki pointed out that a branch which jumps every third time is predicted one time out of six, where, according to my first model, it should never be predicted correctly. Then followed a series of new experiments until Karki and I independently came out with the same state diagram, shown in fig. 1a. While we agreed on this mechanism, we disagreed on the interpretation and in particular on why it was asymmetric. In the meantime, another guy had found an old article in Microprocessor Report claiming that the mechanism was a symmetric one as illustrated in fig 1b. My opinion was that the designers had actually intended the mechanism to be as in fig. 1b and that the asymmetry was a bug. But Karki and Terje maintained that there had to be an intention behind this asymmetry. It didn't convince them that I demonstrated how the symmetric mechanism was superior to the asymmetric one in almost all cases. |
|||
Now I
discovered a powerful tool to dig deeper into this
mechanism. The Pentium has a set of test registers that
make it possible to read or write directly into the area
that holds the history information for all branches, the
branch target buffer (BTB). I had found this information
on the home page of another hacker, Christian Ludloff.
His page was shut down (rumors say that this was due to
pressure from Intel) but fortunately I had downloaded his
page before it was too late. Having direct access to the
BTB, I was able to see exactly what happened: When a
branch does not have an entry in the BTB it is predicted
to not jump. The first time it jumps it gets an entry in
the BTB and immediately goes to state 3. The complication
is that the designers have equated state 0 with 'vacant
BTB entry'. This makes sense because state 0 is predicted
to not jump anyway. But since it cannot distinguish
between state 0 and a vacant BTB entry it will go to
state 3 next time the branch jumps rather than to state
1. This is where the quirk comes from. Apparently,
somebody at the design labs has done a lot of research to
find a good branch prediction scheme, and then somebody
else has messed it all up by letting state 0 mean vacant
BTB entry without realizing the consequence. And the
consequence is that a branch which seldom jumps will have
three times as many mispredictions as it would with the
symmetric design. Karki and Terje were still not convinced that this design was a blunder. The convincing proof came when I discovered that tight loops behave differently. In a small loop the microprocessor doesn't have enough time to update the BTB entry for a branch instruction before it meets the same branch instruction again. In order to avoid a delay, it bypasses the BTB and reads the branch prediction state directly from the pipeline. And in this case it is actually able to go from state 0 to state 1, as in fig. 1b. |
a. asymmetric design in the Pentium: The state follows the +arrows when the branch instruction jumps, and the -arrows when not jumping. The branch instruction is predicted to jump next time if in state 2 or 3, and to not jump when in state 0 or 1. b. symmetric design: This is how the branch prediction should work. The state is incremented when jumping (+arrows) and decremented when not jumping (-arrows). |
||
More quirks We soon found that there were more strange things about the Pentium's branch prediction. We couldn't make sense of what happened when more branch instructions came close after one another. This time Karki and Terje came with the 'wild' ideas that led to the solution, while I played the role of the sceptic. After a hectic period where we exchanged results by E-mail every day, we found that the BTB information may actually be stored several instructions ahead of the branch it refers to. If there happens to be another branch in between then the BTB information is likely to be misapplied to somewhere in the wrong branch. This can lead to many funny phenomena: a branch instruction can have more than one BTB entry; two branches can share the same BTB entry so that one branch is predicted to go to the target of the other one; an unconditional jump instruction can be predicted to not jump; and a non-jumping instruction can be predicted to jump. I will not go into detail with all these quirks here, but you can find it all on my homepage. None of these quirks are fatal, because all mispredictions eventually get corrected. A much more powerful mechanism The later processors in the Pentium family: the Pentium MMX, Pentium Pro, Pentium II, Celeron, and Xeon, all have a much more advanced branch prediction mechanism. I will refrain from more detective stories here and go right to the mechanism. |
|||
This
mechanism is based on the same fundamental idea of the
state diagram in fig 1b. This is simply a two-bit counter
with saturation. The counter is incremented when jumping
and decremented when not jumping. The branch instruction
is predicted to jump next time if the counter is in state
2 or 3, and to not jump if in state 0 or 1. This
mechanism makes sure that the branch has to deviate twice
from what it does most of the time before the prediction
changes. The improvement in the later processors comes from the so-called two-level branch prediction. The first level is a shift register that stores the history of the last four events for any branch instruction. This gives sixteen possible bit patterns. You get a pattern of 0000 if the branch did not jump the last four times, and a pattern of 1111 after four times of jumping. The second level in the branch prediction mechanism is constituted of sixteen 2-bit counters of the type in fig. 1b. It uses the 4-bit pattern in the first level to choose which of the sixteen counters to use in the second level. See fig. 2. |
Level two consists of 16 two-bit counters of the type in fig. 1b. Level one is a four bit shift register storing the history of the last four events. This four bit pattern is used to select which of the 16 two-bit counters to use for the next prediction. |
||
The
advantage of this mechanism is that it can learn to
recognize repetitive patterns. Imagine a branch that
jumps every second time. You can write this pattern as
01010101 where 0 means no-jump and 1 means jump. After
0101 always comes an 0. Every times this happens, the
counter with the binary number [0101] will be decremented
until it reaches its lowest state. It has now learned
that after 0101 comes an 0 and will therefore make this
prediction correctly the next time. Similarly, counter
number [1010] will be incremented until state three so
that it will always predict a 1 after 1010. The remaining
fourteen counters for this branch are never used as long
as the pattern is the same. This mechanism is quite powerful as it can handle complex repetitive patterns like 00101-00101-00101. In fact, it can handle any repetitive pattern with a period of up to five, most patterns of period six and seven, and even some patterns with periods as high as sixteen. To see if a pattern of period n can be handled without misprediction, write down the n 4-bit sub-sequences in the pattern. If they are all different, then you will have no mispredictions after an initial learning time of two periods. But the two-level mechanism is more powerful than that. It is also extremely good at handling deviations from a regular pattern. If a branch instruction has an almost regular pattern with occasional deviations, then the processor will soon learn what the deviations look like, so that it can handle almost any kind of recurrent deviation with only one misprediction. Furthermore, it can handle a situation where you alternate between two different repetitive patterns. Assume that you have given the processor one repetitive pattern until it has learned to handle it without mispredictions. Then another pattern. And then return to the first pattern. If the two patterns do not have any 4-bit subsequences in common, then they do not use the same counters, so the processor doesn't have to re-learn the first pattern. Therefore, it can handle the transitions back and forth between the two patterns with a minimum of mispredictions. Conclusion The first microprocessor in the Pentium family introduced a simple one-level branch prediction mechanism with many ludicrous quirks. The later versions, Pentium MMX, Pentium Pro, Pentium II, etc. have longer pipelines and therefore a higher need for effective branch prediction. This need has been met by the incredibly powerful two-level mechanism with its ability to learn and recognize repetitive patterns and even deviations from the regular patterns. This mechanism is also quite economical in terms of chip area as the history of a branch can be stored in only 32 bits. The most important shortcoming of the two-level branch prediction is that it is not very good at predicting the branch pattern of a loop control. If, for example, you have a program with a loop that always repeats ten times, then the control instruction at the bottom of the loop will branch back nine times and fall through the tenth time at the cost of one misprediction. For the Pentium Pro and Pentium II, where branch misprediction costs a lot of time, it may acually be advantageous to replace a loop that executes ten times with two nested loops that execute five and two times, in order to avoid mispredictions. |
Technical details can be found at my homepage: www.announce.com/agner/assem.