Secrets of Steamroller: Digging deep into AMD's next-gen core

AMD's Steamroller core, found in Kaveri APUs, is an interesting duck. It's the first chip to ship with HSA support, the first CPU core that can be paired with a GCN-based graphics core, and it promised significant improvements in performance-per-clock that were then largely offset by a lower clock speed. Our investigation shows that the situation isn't that simple, however -- there are some areas, particularly on the APU side of the equation, where Steamroller is far better than its predecessor. This article is a low-level investigation into where the chip improved, where it fell short, and what AMD might do going forward.

Since it launched in 2011, Bulldozer's perennial problem has been the disconnect between the sorts of gains AMD seemed to be promising and the real-world gains that actually materialized. AMD's slides, if you recall, promised enormous gains:

A 30% improvement in total operations delivered per cycle. 20% reduction in missed branches. Instruction cache misses reduced by 30%. This last is critically important in a chip with a long pipeline; branch mispredicts kill CPU performance. But the truly puzzling thing about Steamroller is this: Despite a great many improvements, the core still seems handicapped, held back from its full potential.

Let's try and find out why.

Test setup

All of our tests were conducted with an Asus A88X Pro motherboard, 8GB of DDR3-2133, a Samsung 840 Pro SSD, and Windows 8.1. We tested the A10-7850K against the Piledriver/Richland-based A10-6800K. AMD's previous APU is also a quad-core design, but has higher clock speeds balanced against an older GPU core and second-generation CPU core.

Turbo Mode was disabled for these tests; the A10-7850K and A10-6800K were both locked to a 4.2GHz clock speed. This is a significant overclock compared to the A10-7850K's turbo speed of 4.0GHz, but it's a down clock for the A10-6800K, which normally runs at 4.4GHz Turbo. Both integrated memory controllers and northbridges were run at their default clock speeds (1.8GHz for the Northbridge (Sandra reports that this results in a memory controller clock speed of 3.6GHz).

We'll start our investion with a look at the cache performance of the two chips. This is one area where AMD has historically struggled, so let's see if Steamroller improves the equation. We'd like to thank CPU analyst and programmer Agner Fog for his assistance in testing and identifying ongoing bottlenecks in the Steamroller microarchitecture. Agner has written multiple guides to optimizing x86 processors, built an open-source set of benchmarks for testing specific characteristics (available on his website(Opens in a new window)) and compiled detailed latency charts and edge case information for both Intel and AMD chips.

Next page: Benchmarking the cache structure and throughput

Cache structure, throughput

Steamroller made some significant changes to the cache structure. The L1 instruction cache is 50% larger and is three-way associative, rather than two-way. This increases the chance that data locations will be located within the L1. There are still significant differences in L1 structure between AMD and Intel. Each Intel core has a 32K instruction cache that's eight-way associative, whereas Steamroller has a 96KB shared cache that's just three-way associative. Cache conflicts remain a significant problem -- when two different threads are running in the same module, they can overwrite each others' code.

Steamroller's L2 cache remains very slow compared to Intel, but it's marginally faster than Piledriver. The big news, however, is the improved L2 write throughput. AIDA64 4.2 underscores just how significantly the new chip boosts performance.

The L1 write and copy bandwidth is significantly better for Steamroller. L2 bandwidth is better across the board, with 25-45% better read, write, and copy performance. The latency reduction for L2 writes impacts the performance of the entire design, because Steamroller uses a write-through cache system, where data evicted from the L1 is then written to L2.

The improved L1 caches come with an oddity, however. According to Agner's test, L1 cache throughput goes down when two threads run on two different modules at the same time. Note that this should never be the case -- the L1 caches don't share any data when running one thread per module -- but throughput still takes a whack. AMD is still investigating these findings.

We can see that cache performance improved significantly between the two processors, making bandwidth less of an issue than it was in the past. What else could be tripping up AMD's third-generation Bulldozer core?

Next page: Fetch, decode, branch prediction, and FPU performance

Fetch, decode, and branch prediction

Ever since Bulldozer, conventional wisdom has held that AMD's decision to share decode stages may have crippled the core's performance. Bulldozer and Piledriver could only dispatch four instructions per clock cycle, and both of them did so in round-robin fashion -- meaning each core was served every other cycle. Steamroller, in contrast, can dispatch up to four instructions per cycle to both of its compute units, or eight per module. This double dispatch capability definitely improved multi-threaded scaling by roughly 10% . Single-threaded performance between Steamroller and Piledriver only improved about 7%, clock-for-clock, implying that the core's bottlenecks aren't caused by decoder hardware. We know that shared units can work -- Intel's Hyper-Threaded cores share the front end -- so whatever issues are clogging up Steamroller, they're not intrinsic to the concept.

There are three prime candidates for a bottleneck: Fetch, branch prediction, and ALU design. AMD claimed that branch prediction had dramatically improved in Steamroller, and there's some evidence this is true. We tested the Chess program DIEP, an application that calculates all of the real-world chess moves possible within a game out to an increasing number of positions. We measure out to 14 ply, in which a ply is a half move. The 14th ply, in other words, is the total number of chess moves within the first seven turns of a game. We use DIEP over a program like AIDA64's Queens benchmarks because DIEP is tuned to scale in real-world chess matchups and has been used in high-level tournament play.

The good news is that AMD delivers on this front. Single-threaded performance is 7% better, clock-for-clock. Multi-threaded performance, meanwhile, is a full 14.4% better. Performance against Haswell, however, remains low -- at a maximum clock of 3.8GHz, the Core i7-4770K is a full 46% faster than the AMD processor.

Improved branch prediction, however, doesn't seem to deliver the performance gains we would've liked to see. Instruction fetch is the next area to consider -- though we can't benchmark it directly. Agner's tests, however, may shed some light on the problem. According to his work, the fetch units on Bulldozer, Piledriver, and Steamroller, despite being theoretically capable of handling up to 32 bytes (16 bytes per core) tops out in real-world tests at 21 bytes per clock. This implies that doubling the decode units couldn't help much -- not if the problem is farther up the line. Steamroller does implement some features, like a very small loop buffer, that help take pressure off the decode stages by storing very small previously decoded loops (up to 40 micro-instructions), but the fact that doubling up on decoder stages only modestly improved overall performance implies that significant bottlenecks still exist.

ALU Performance

According to Agner, " Two of the pipes have all the integer execution units while the other two pipes are used only for memory read instructions and address generation (not LEA), and on some models for simple register moves. This means that the processor can execute only two integer ALU instructions per clock cycle, where previous models can execute three. This is a serious bottleneck for pure integer code. The single-core throughput for integer code can actually be doubled by doing half of the instructions in vector registers, even if only one element of each vector is used."

This has been the case since Bulldozer debuted -- but issues here could explain why integer performance on Steamroller is so low compared to other cores. This is where things become frustratingly opaque -- each of the areas we've identified could be the principle bottleneck -- or it's possible that the bottleneck is a combination of multiple factors (long pipelines, low fetch, cache collisions and low integer performance).

FPU Performance

One positive area we want to call out is Steamroller's improved FPU performance. The chip uses a three-pipe FPU, unlike the four-pipe architecture of Piledriver and Bulldozer. In some tests, however, this seems to have provided a substantial benefit, possibly due to low-level efficiency and utilization improvements.

AIDA64's FPU tests show a gain of 2% - 13% at the same clock speed. Considering that the FPU unit is now smaller and better optimized, that's a significant set of gains over the old design.

That takes care of the CPU -- but the CPU isn't where the real excitement is.

Next page: Investigating the APU fabric

The APU

Next, I want to change gears and talk about the APU fabric -- the interconnect system that connects the CPU and the GPU together. What we've found suggests that the bulk of AMD's work on Kaveri went to improving these areas of the chip. Let's take a look at why. Note that none of these tests are HSA enabled -- what we're measuring here is performance in standard OpenCL.

Intra-APU bandwidth is 1.5x that of Richland. The chip hits internal copy rates that even surpass even Intel's, and that's no small feat for this architecture. In the past, Intel's ring bus that connects its GPU to the CPU and a dedicated 8MB L3 cache has given Chipzilla an advantage in these tests.

The time-to-copy, read, and write metrics are also hugely improved for Kaveri as compared to Richland. This goes along with the improved bandwidth, of course, but the fact is, this core is significantly faster than even Intel's Core i7-4770K. What we've measured here is the time it takes for the GPU to perform these operations -- the combined CPU+GPU test also shows significant improvement for the A10-7850K.

Finally, there's memory latencies. We measured access time to shared memory, sequential access to main memory (typically the fastest access) and random memory accesses within the same page.

Again, Kaveri significantly outperforms Richland in these metrics. Better OpenCL performance is every bit as important as HSA itself. One of the goals of Kaveri is to nudge programmers towards OpenCL in general, even if they aren't ready to to implement HSA specifically. Intel wasn't included in these tests because the benchmarks returned extremely low results -- so low as to indicate that they weren't being performed properly. Next page: Putting it all together, and conclusion

Putting it all together

The Steamroller performance story is more nuanced than we first believed. Compare the performance gains that the chip offers in OpenCL and low-level GPGPU tests against the additonal performance in CPU-centric workloads, and it's obvious that AMD chose to spend much of its efforts improving the heterogeneous compute side of the equation. This is separate from HSA -- our tests didn't use HSA, but they show enormous gains for the Kaveri APU against the older Richland core.

One of the differences between AMD and Intel is that AMD has been forced to innovate on both the CPU and GPU sides of its APU equation every single cycle. Intel, in contrast, has typically alternated -- saving major GPU performance jumps for die shrinks when the CPU architecture hasn't changed much, then doing the CPU updates the following cycle. Given how many other products AMD has ramped in the same time period, it's reasonable to conclude that the Steamroller CPU core received less total developer attention than it would have back when AMD only had one CPU architecture as opposed to three different CPUs, two console design wins, and the challenge of integrating GCN into an APU.

At the same time, however, it's also clear that dual decoders wasn't the fix that many AMD enthusiasts were hoping it would be. L1 cache contention remains problematic, as does the low set associativity. Integer throughput is poor partly because only two of Steamroller's four integer pipelines are practically useful for most work. The long pipeline ensures that branch prediction misses will always hit the chip hard. The chip's L2 latency remains much higher than its Intel counterpart and its memory controller is much slower.

The question of whether next year's Carrizo can "fix" the Bulldozer architecture depends entirely on which design attributes are holding the core back. The only thing we know for certain about the core at this point is that Excavator includes support for AVX2. If Steamroller's low performance is primarily caused by the shared fetch unit, than decoupling that system and adding 256-bit registers for AVX2 could significantly improve the core's integer performance. If, on the other hand, the chip's low performance is directly related to its long pipeline and high cache contention in the L1, it's going to be much harder to solve.

A glimmer of hope

If there's one reason to be positive about Excavator, it's this: For the first time since 2011, AMD's next-generation APU won't be struggling to make the transition to a new GPU technology, add support for HSA, or jump for a new process node. Carrizo retains Kaveri's graphics engine and XDMA capability. AMD may further improve its HSA implementation or adjust the interconnect fabric, but HSA is fully supported in Kaveri already. That leaves more resources available for bringing the CPU core up to speed.

I still stand by what I said regarding Jaguar versus Steamroller -- the smaller core remains a better fit for AMD's long-term plans and the industry's goals, in my opinion -- but that doesn't mean Excavator couldn't significantly improve on Kaveri's performance per watt. Meanwhile, our investigation demonstrates that AMD did a great deal of low-level work on the APU that will pay benefits down the road -- benefits that weren't immediately apparent when we wrote our initial coverage.