Rendered at 20:32:32 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
OskarS 1 days ago [-]
Hmm, that's interesting. The code as written only has one branch, the if statement (well, two, the while loop exit clause as well). My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.
But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.
Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?
rayiner 1 days ago [-]
Your mental model is close. Predictors generally work by having some sort of table of predictions and indexing into that table (usually using some sort of hashing) to obtain the predictions.
The simplest thing to do is use the address of the branch instruction as the index into the table. That way, each branch instruction maps onto a (not necessarily unique) entry in the table. Those entries will usually be a two-bit saturating counter that predicts either taken, not taken, or unknown.
But you can add additional information to the key. For example, a gselect predictor maintains a shift register with the outcome of the last M branches. Then it combines that shift register along with the address of the branch instruction to index into the table: https://people.cs.pitt.edu/~childers/CS2410/slides/lect-bran... (page 9). That means that the same branch instruction will map to multiple entries of the table, depending on the pattern of branches in the shift register. So you can get different predictions for the same branch depending on what else has happened.
That, for example, let’s you predict small-iteration loops. Say you have a loop inside a loop, where the inner loop iterates 4 times. So you’ll have a taken branch (back to the loop header) three times but then a not-taken branch on the fourth. If you track that in the branch history shift register, you might get something like this (with 1s being taken branches):
11101110
If you use this to index into a large enough branch table, the table entries corresponding to the shift register ending in “0111” will have a prediction that the branch will be not taken (i.e. the next outcome will be a 0) while the table entries corresponding to the shift register ending in say “1110” will have a prediction that the next branch will be taken.
So the basic principle of having a big table of branch predictions can be extended in many ways by using various information to index into the table.
OskarS 10 hours ago [-]
Yeah, the "two-bit saturating counter" thing is pretty much exactly how I thought it worked (which would be terrible for the example in the article), but I had no idea about the fact that it also kept track of the branch history thing, and how that maps to different branch predictor entries. Thanks for the link, that's really fascinating!
fc417fc802 16 hours ago [-]
It seems like that would struggle with detecting how many layers of branching to pay attention to. Imagine the two nested loops surrounded by a randomized one. Wouldn't that implementation keep hitting patterns it hadn't seen before?
Obviously that must be a solved problem; I'd be curious to know what the solution is.
PunchyHamster 11 hours ago [-]
might be but what real code does that ?
10 hours ago [-]
ahartmetz 9 hours ago [-]
It even gets much more sophisticated than that. Even the first Ryzen had something perceptron-based (yes, neuronal network!) and there are several predictors + logic to pick the best one for a branch.
jcalvinowens 1 days ago [-]
Check out [1]: it has the most thorough description of branch prediction I've ever seen (chapter 3), across a lot of historical and current CPUs. It is mostly empirical, so you do have to take it with a grain of salt sometimes (the author acknowledges this).
Supposedly the branch prediction on modern AMD CPUs is far more sophisticated, based on [2] (a citation pulled from [1]).
> My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.
I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.
Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value. This article raises more questions than answers, I’m intrigued now.
Tuna-Fish 4 hours ago [-]
It does not and cannot use the input value. The branch predictor runs like a dozen cycles before execution, it generally cannot possibly see values in registers. It also runs like 3-4 cycles before decode, so it cannot even use any bits from the instruction that is being executed.⁰ Branch predictors strictly use branch history, but this is more complex than just looking at the probability of branching on a single branch, there are things like tables that maintain all branches over the past tens of thousands of cycles, and try to cover common patterns.
0: This is why the first prediction is always "don't branch", because the first time executing code the predictor has literally no information at all. Every now and then people ask for hint bits on branches, but, er, how are you planning to do that when the instruction with the branch hasn't arrived from L1 when the prediction is due?
nkurz 8 hours ago [-]
> I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.
Sure, that would work significantly better than no predictor at all. But you'd agree that a better predictor would work better, right? The missing detail might be how expensive mispredicted branches are compared to other costs. If you can go from 50% accuracy to 90% accuracy, it wouldn't be surprising to more than double your performance.
> Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value.
It doesn't, and can't for the reasons you hint at. The reason branch prediction is necessary is that the value often isn't available yet when the branch is taken. Was there something in the article that implied the opposite?
--
I wonder if Daniel's tricksy approach using a random number generator to simulate a complex pattern is misleading people here.
One of the main benefits of branch prediction is predicting the end of a loop, particularly, a loop within a loop. In assembly, a loop is just a comparison at the end and a branch back to the beginning. Assume you had a loop that always executes 8 times, or some other small fixed value. Also assume there is some reason you can't unroll that loop, and that loop is inside another loop that executes millions of times. It's a real boost to performance if you can consistently predict the end of the inner loop.
If you predicted just on the last time the loop closing branch was taken, you'd always miss the ending. But if you can remember a pattern that is longer than 8, you can always get it right. This is obviously valuable. The bigger question is how much more valuable it is to predict a loop (where "loop" might actually be a complex execution pattern across multiple branches) that is thousands long rather than just 8. But quantifying how long this pattern can be on different processors is part of the groundwork for analyzing this.
pyrolistical 9 hours ago [-]
Agreed. I wonder if this silicon is designed for this benchmark and if not how useful is it with real code.
I would be surprised if this silicon area could not be better utilized for something else
Tuna-Fish 4 hours ago [-]
No, branch predictors are really important and even small improvements in them are extremely valuable on real loads. Improving branch prediction is both a power and a performance optimization.
throawayonthe 3 hours ago [-]
relevant video: https://youtu.be/nczJ58WvtYo How Branch Prediction Works in CPUs - Matt Godbolt with Computerphile
eigenform 12 hours ago [-]
The idea here is about maintaining a "path history"!
When looking up a register that tracks the "local" history of outcomes for a particular branch, you want to have a hash function that captures enough context to distinguish
the different situations where that branch might be encountered.
Apart from folding a long "global history" of recent outcomes and mixing in the current program counter, I think many modern machines also mix in the target addresses of recently-taken branches.
LPisGood 1 days ago [-]
There are many branch prediction algorithms out there. They range from fun architecture papers that try to use machine learning to static predictors that don’t even adapt to the prior outcomes at all.
gpderetta 1 days ago [-]
Typical branch predictors can both learns patterns (even very long patterns) and use branch history (the probability of a branch being taken depends on the path taken to reach that branch). They don't normally look at data other than branch addresses (and targets for indirect branches).
jeffbee 1 days ago [-]
They can't. The data that would be needed isn't available at the time the prediction is made.
1718627440 1 days ago [-]
Yeah, otherwise you wouldn't need to predict anything.
intrasight 17 hours ago [-]
I was self-taught in high school on computer architecture by reading book. I didn't own a computer, understand, but these book served the same purpose in terms of learning CPU architectures and machine language programming. The 6502 was the CPU I studied.
In 1985 as an EE student, I took a course in modern CPU architectures. I still recall having my mind blown when learning about branch prediction and speculative execution. It was a humbling moment - as was pretty much all of my studies as CMU.
stephencanon 1 days ago [-]
Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).
pbsd 1 days ago [-]
I mean, he's comparing 2024 Zen 5 and M4 against two generations behind 2022 Intel Raptor Lake. The Lion Cove should be roughly on par with the M4 on this test.
stephencanon 23 hours ago [-]
That would fall under "more constrained", due to process limits.
bee_rider 1 days ago [-]
I guess the generate_random_value function uses the same seed every time, so the expectation is that the branch predictor should be able to memorize it with perfect accuracy.
But the memorization capacity of the branch predictor must be a trade-off, right? I guess this generate_random_value function is impossible to predict using heuristics, so I guess the question is how often we encounter 30k long branch patterns like that.
Which isn’t to say I have evidence to the contrary. I just have no idea how useful this capacity actually is, haha.
bluGill 1 days ago [-]
30k long patterns are likely rare. However in the real world there is a lot of code with 30k different branches that we use several times and so the same ability memorize/predict 30k branches is useful even though this particular example isn't realistic it still looks good.
Of course we can't generalize this to Intel bad. This pattern seems unrealistic (at least at a glance - but real experts should have real data/statistics on what real code does not just my semi-educated guess), and so perhaps Intel has better prediction algorithms for the real world that miss this example. Not being an expert in the branches real world code takes I can't comment.
bee_rider 1 days ago [-]
Yeah, I’m also not an expert in this. Just had enough architecture classes to know that all three companies are using cleverer branch predictors than I could come up with, haha.
Another possibility is that the memorization capacity of the branch predictors is a bottleneck, but a bottleneck that they aren’t often hitting. As the design is enhanced, that bottleneck might show up. AMD might just have most recently widened that bottleneck.
Super hand-wavey, but to your point about data, without data we can really only hand-wave anyway.
AMD CPUs have been killing it lately, but this benchmark feels quite artificial.
It's a tiny, trivial example with 1 branch that behaves in a pseudo-random way (random, but fixed seed). I'm not sure that's a really good example of real world branching.
How would the various branch predictors perform when the branch taken varies from 0% likely to 100% likely, in say, 5% increments?
How would they perform when the contents of both paths are very heavy, which involves a lot of pipeline/SE flushing?
How would they perform when many different branches all occur in sequence?
How costly are their branch mispredictions, relative to one another?
Without info like that, this feels a little pointless.
bee_rider 1 days ago [-]
It is a tiny example, but it measures something. It doesn’t handle the other performance characteristics you mention, but it has the advantage of being a basically pure measurement of the memorization ability of the branch predictors.
The blog post is not very long—not much longer than some of the comments we’ve written here about it. So, I think it is reasonable to expect the reader to be able to hold the whole thing in their head, and understand it, and understand that it is extremely targeted at a specific metric.
jeffbee 1 days ago [-]
He isn't trying to determine how well it works. He's trying to determine how large it is.
Night_Thastus 1 days ago [-]
Their post gives the impression that clearly AMD's branch prediction is better, because this one number is bigger. "Once more I am disappointed by Intel"
While it could very well be true that the AMD branch predictor is straight-up better, the data they provided is insufficient for that conclusion.
vlovich123 24 hours ago [-]
You may want to look up who Daniel Lemire is and the work he's done. What he's basically saying is "in the totality of things I've examined where Intel has come up short, this is another data point that is in line with their performance across the board". It's not "this one benchmark proves Intel sucks hurr hurr" - it's saying it's yet another data point supporting the perception that Intel is struggling against the competition.
Paul_Clayton 19 hours ago [-]
By only testing one static branch, it is possible that the performance of the Intel Emerald Rapids predictor is not representative of a more realistic workload. If path information is used to index the predictor in addition to global (taken/not taken) branch history without xoring with the global history (or fulling mingling these different data) or if the branch address is similarly not fully scrambled with the global history, using only one branch might result in predictor storage being unused (never indexed). Either mechanism might be useful for reducing tag overhead while maintaining fewer aliases. Another possibility is that the associativity of the tables does not allow tags for the same static branch to differ.
(Tags could be made to differ by, e.g., XORing a limited amount of global history with the hash of the address.)
It is also possible that the AMD Zen 5 and Apple M4 have similar unused predictor capacity and simply have much larger predictors.
I did not think even TAGE predictors used 5k branch history, so there may be some compression of the data (which is only pseudorandom).
It might be interesting to unroll the loop (with sufficient spacing between branches to ensure different indexing) to see if such measurably effected the results.
Of course, since "write to buffer" is just a store and increment and the compiler should be able to guarantee no buffer overflow (buffer size allocated for worst case) and that the memory store has no side effects, the branch could be predicated by selecting either new value to be stored or the old value and always storing. This would be a little extra work and might have store queue issues (if not all store queue entries can have the same address but different version numbers), so it might not be a safe optimization.
ralferoo 11 hours ago [-]
I use a similar conditional write paradigm on the GPU and it's usually easiest to do an unconditional write and update the address using a branchless conditional, assuming you are using a system with strict write ordering. Usually the unnecessary writes won't make it out of L1 cache.
ibobev 9 hours ago [-]
I'm wondering why my submission, made 22 hours ago, is marked as a duplicate, but this submission, made just 12 hours ago, isn't.
I'm also wondering why the same URL for both submissions isn't recognized as the same, and why the new submission was allowed.
stevefan1999 15 hours ago [-]
I still remember learning about TAGE and preceptron predictors, and how machine learning and neural networks has long been, in some form, been used in CPU architecture design.
The simplest binary saturating counter, ala bimodal predictor, already achieved more than 90% success rate. What comes next is just extension around that, just use multiple bimodal predictors and build a forest of it, but the core idea that treating the branch prediction using a Bayesian approach, never fades.
It is a combined effort between hardware design and software compiler, though.
withinboredom 1 days ago [-]
Before switching to a hot and branchless code path, I was seeing strangely lower performance on Intel vs. AMD under load. Realizing the branch predictor was the most likely cause was a little surprising.
barbegal 19 hours ago [-]
This is good work. I wish branch predictor were better reverse engineered so CPU simulation could be improved. It would be much better to be able to accurately predict how software will work on other processors in software simulation rather than having to go out and buy hardware to test on (which is the way we still have to do things in 2026)
infinitewars 17 hours ago [-]
By the no-free-lunch theorem, and the fact this 30k random branch pattern is so atypical in the real world, it would imply the loser here (Intel) is more likely to be the best branch predictor in actual benchmarks.
At least that's my prediction.
fc417fc802 15 hours ago [-]
The atypical benchmark here is a manufactured worst case scenario for the purpose of quantifying the hardware capabilities. A deeper predictor means accommodating more complex program branching patterns. Obviously you'd expect to see diminishing returns versus silicone area at some point but I see no reason to assume that AMD would have made a poor allocation decision here.
rsmtjohn 13 hours ago [-]
The Rust borrow checker has indirectly made me more aware of branch patterns -- it sometimes forces code restructuring that changes what the predictor actually sees.
The clearest wins I've found: replacing conditional returns in hot loops with branchless arithmetic. The predictor loves it when you stop giving it choices. Lookup tables for small bounded ranges are another one that consistently surprises me with how much headroom there still is.
piinbinary 1 days ago [-]
How does the benchmark tell how many branches were mispredicted? Is that something the processor exposes?
ErikCorry 20 hours ago [-]
Yeah performance counters
cloudbonsai 14 hours ago [-]
To fill in the details, here is the code used for the measurement:
It fetches the number of mispredicted instructions from Linux's perf
subsystem, which in turn gathers the metrics from CPU's PMU
(Performance Monitoring Unit) interface.
ww520 16 hours ago [-]
Branch prediction works really well on loops. The looping condition is mostly true except for the very last time. The loop body is always predicted to run. If you structure the loop body to have no data dependence between iterations, multiple iterations of the loop can run in parallel. Greatly improve the performance.
atq2119 15 hours ago [-]
I find it interesting that the S-curve is much steeper for AMD than it is for the others. AMD maintains perfect prediction for much larger sizes than the others, but it also reaches essentially random behaviour earlier.
Are they really keeping a branch history that's 30k deep? Or is there some kind of hashing going on, and AMD's hash just happens to be more attuned to the PRNG used here?
Would be interesting to see how robust these results are against the choice of PRNG and seed.
stinkbeetle 14 hours ago [-]
> Are they really keeping a branch history that's 30k deep? Or is there some kind of hashing going on, and AMD's hash just happens to be more attuned to the PRNG used here?
No, you don't need much branch history to get a vanishingly small probability that any two branches would collide. ~40 bits maybe. The limit will be running out of prediction table capacity I would say. It's possible the better ones are able to cleverly fit competing entries in different TAGE tables whereas the worse ones might start thrashing just one or some of the tables since the test is so regular. It's also possible the better ones just have more prediction resources available (or fewer, bigger tables, or ...).
> Would be interesting to see how robust these results are against the choice of PRNG and seed.
Provided it is somewhat random, I would say very robustly since all those CPUs likely have far more than enough history to uniquely fingerprint every branch even if the PRNG was not a great one.
user070223 1 days ago [-]
Does any JIT/AOT/hot code optimization/techniques/compilers/runtime takes into account whether the branch prediction is saturated and try to recompile to go branchless
BoardsOfCanada 1 days ago [-]
In general branchless is better for branches that can't be predicted 99.something %, saturating the branch prediction like this benchmark isn't a concern. The big concern is mispredicting a branch, then executing 300 instructions and having to throw them away once the branch is actually executed.
rayiner 1 days ago [-]
Using random values defeats the purpose of the branch predictor. The best branch predictor for this test would be one that always predicts the branch taken or not taken.
gpderetta 1 days ago [-]
The author is running the benchmark multiple times with the same random seed to discover how long a pattern can the predictor learn.
dundarious 1 days ago [-]
There will be runs of even and runs of odd outputs from the rng. This benchmark tests how well does the branch predictor "retrain" to the current run. It is a good test of this adaptability of the predictor.
The benchmark is still narrow in focus, and the results don't unequivocally mean AMD's predictor is overall "the best".
tonetegeatinst 17 hours ago [-]
Intel is currently looking into replacing their branch prediction with a system based on astrology, tarot cards and crystal balls.
Should be titled: How I Learned to Stop Worrying and Love the Branch Predictor
fc417fc802 15 hours ago [-]
It's odd. The section headers says "Density matters" but then the experiment shows that, actually, density does not matter. Only the total number of branches encountered across the linear piece of code. (In the chart "block size" is what determines density.)
Also note in that chart what's being tested is unconditional jumps placed at unique addresses in a linear piece of code, not conditional jumps at the same address in a loop that fits entirely in L1i.
But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.
Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?
The simplest thing to do is use the address of the branch instruction as the index into the table. That way, each branch instruction maps onto a (not necessarily unique) entry in the table. Those entries will usually be a two-bit saturating counter that predicts either taken, not taken, or unknown.
But you can add additional information to the key. For example, a gselect predictor maintains a shift register with the outcome of the last M branches. Then it combines that shift register along with the address of the branch instruction to index into the table: https://people.cs.pitt.edu/~childers/CS2410/slides/lect-bran... (page 9). That means that the same branch instruction will map to multiple entries of the table, depending on the pattern of branches in the shift register. So you can get different predictions for the same branch depending on what else has happened.
That, for example, let’s you predict small-iteration loops. Say you have a loop inside a loop, where the inner loop iterates 4 times. So you’ll have a taken branch (back to the loop header) three times but then a not-taken branch on the fourth. If you track that in the branch history shift register, you might get something like this (with 1s being taken branches):
11101110
If you use this to index into a large enough branch table, the table entries corresponding to the shift register ending in “0111” will have a prediction that the branch will be not taken (i.e. the next outcome will be a 0) while the table entries corresponding to the shift register ending in say “1110” will have a prediction that the next branch will be taken.
So the basic principle of having a big table of branch predictions can be extended in many ways by using various information to index into the table.
Obviously that must be a solved problem; I'd be curious to know what the solution is.
Supposedly the branch prediction on modern AMD CPUs is far more sophisticated, based on [2] (a citation pulled from [1]).
[1] https://www.agner.org/optimize/microarchitecture.pdf
[2] https://www.cs.utexas.edu/%7Elin/papers/hpca01.pdf
I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.
Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value. This article raises more questions than answers, I’m intrigued now.
0: This is why the first prediction is always "don't branch", because the first time executing code the predictor has literally no information at all. Every now and then people ask for hint bits on branches, but, er, how are you planning to do that when the instruction with the branch hasn't arrived from L1 when the prediction is due?
Sure, that would work significantly better than no predictor at all. But you'd agree that a better predictor would work better, right? The missing detail might be how expensive mispredicted branches are compared to other costs. If you can go from 50% accuracy to 90% accuracy, it wouldn't be surprising to more than double your performance.
> Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value.
It doesn't, and can't for the reasons you hint at. The reason branch prediction is necessary is that the value often isn't available yet when the branch is taken. Was there something in the article that implied the opposite?
--
I wonder if Daniel's tricksy approach using a random number generator to simulate a complex pattern is misleading people here.
One of the main benefits of branch prediction is predicting the end of a loop, particularly, a loop within a loop. In assembly, a loop is just a comparison at the end and a branch back to the beginning. Assume you had a loop that always executes 8 times, or some other small fixed value. Also assume there is some reason you can't unroll that loop, and that loop is inside another loop that executes millions of times. It's a real boost to performance if you can consistently predict the end of the inner loop.
If you predicted just on the last time the loop closing branch was taken, you'd always miss the ending. But if you can remember a pattern that is longer than 8, you can always get it right. This is obviously valuable. The bigger question is how much more valuable it is to predict a loop (where "loop" might actually be a complex execution pattern across multiple branches) that is thousands long rather than just 8. But quantifying how long this pattern can be on different processors is part of the groundwork for analyzing this.
I would be surprised if this silicon area could not be better utilized for something else
When looking up a register that tracks the "local" history of outcomes for a particular branch, you want to have a hash function that captures enough context to distinguish the different situations where that branch might be encountered.
Apart from folding a long "global history" of recent outcomes and mixing in the current program counter, I think many modern machines also mix in the target addresses of recently-taken branches.
In 1985 as an EE student, I took a course in modern CPU architectures. I still recall having my mind blown when learning about branch prediction and speculative execution. It was a humbling moment - as was pretty much all of my studies as CMU.
But the memorization capacity of the branch predictor must be a trade-off, right? I guess this generate_random_value function is impossible to predict using heuristics, so I guess the question is how often we encounter 30k long branch patterns like that.
Which isn’t to say I have evidence to the contrary. I just have no idea how useful this capacity actually is, haha.
Of course we can't generalize this to Intel bad. This pattern seems unrealistic (at least at a glance - but real experts should have real data/statistics on what real code does not just my semi-educated guess), and so perhaps Intel has better prediction algorithms for the real world that miss this example. Not being an expert in the branches real world code takes I can't comment.
Another possibility is that the memorization capacity of the branch predictors is a bottleneck, but a bottleneck that they aren’t often hitting. As the design is enhanced, that bottleneck might show up. AMD might just have most recently widened that bottleneck.
Super hand-wavey, but to your point about data, without data we can really only hand-wave anyway.
It's a tiny, trivial example with 1 branch that behaves in a pseudo-random way (random, but fixed seed). I'm not sure that's a really good example of real world branching.
How would the various branch predictors perform when the branch taken varies from 0% likely to 100% likely, in say, 5% increments?
How would they perform when the contents of both paths are very heavy, which involves a lot of pipeline/SE flushing?
How would they perform when many different branches all occur in sequence?
How costly are their branch mispredictions, relative to one another?
Without info like that, this feels a little pointless.
The blog post is not very long—not much longer than some of the comments we’ve written here about it. So, I think it is reasonable to expect the reader to be able to hold the whole thing in their head, and understand it, and understand that it is extremely targeted at a specific metric.
While it could very well be true that the AMD branch predictor is straight-up better, the data they provided is insufficient for that conclusion.
(Tags could be made to differ by, e.g., XORing a limited amount of global history with the hash of the address.)
It is also possible that the AMD Zen 5 and Apple M4 have similar unused predictor capacity and simply have much larger predictors.
I did not think even TAGE predictors used 5k branch history, so there may be some compression of the data (which is only pseudorandom).
It might be interesting to unroll the loop (with sufficient spacing between branches to ensure different indexing) to see if such measurably effected the results.
Of course, since "write to buffer" is just a store and increment and the compiler should be able to guarantee no buffer overflow (buffer size allocated for worst case) and that the memory store has no side effects, the branch could be predicated by selecting either new value to be stored or the old value and always storing. This would be a little extra work and might have store queue issues (if not all store queue entries can have the same address but different version numbers), so it might not be a safe optimization.
https://news.ycombinator.com/item?id=47438490
The simplest binary saturating counter, ala bimodal predictor, already achieved more than 90% success rate. What comes next is just extension around that, just use multiple bimodal predictors and build a forest of it, but the core idea that treating the branch prediction using a Bayesian approach, never fades.
It is a combined effort between hardware design and software compiler, though.
At least that's my prediction.
The clearest wins I've found: replacing conditional returns in hot loops with branchless arithmetic. The predictor loves it when you stop giving it choices. Lookup tables for small bounded ranges are another one that consistently surprises me with how much headroom there still is.
https://github.com/lemire/counters/blob/main/include/counter...
It fetches the number of mispredicted instructions from Linux's perf subsystem, which in turn gathers the metrics from CPU's PMU (Performance Monitoring Unit) interface.
Are they really keeping a branch history that's 30k deep? Or is there some kind of hashing going on, and AMD's hash just happens to be more attuned to the PRNG used here?
Would be interesting to see how robust these results are against the choice of PRNG and seed.
No, you don't need much branch history to get a vanishingly small probability that any two branches would collide. ~40 bits maybe. The limit will be running out of prediction table capacity I would say. It's possible the better ones are able to cleverly fit competing entries in different TAGE tables whereas the worse ones might start thrashing just one or some of the tables since the test is so regular. It's also possible the better ones just have more prediction resources available (or fewer, bigger tables, or ...).
> Would be interesting to see how robust these results are against the choice of PRNG and seed.
Provided it is somewhat random, I would say very robustly since all those CPUs likely have far more than enough history to uniquely fingerprint every branch even if the PRNG was not a great one.
The benchmark is still narrow in focus, and the results don't unequivocally mean AMD's predictor is overall "the best".
https://www.cs.utexas.edu/~lin/papers/tocs02.pdf
https://blog.cloudflare.com/branch-predictor/
Should be titled: How I Learned to Stop Worrying and Love the Branch Predictor
Also note in that chart what's being tested is unconditional jumps placed at unique addresses in a linear piece of code, not conditional jumps at the same address in a loop that fits entirely in L1i.