Beyond DDR4: The differences between Wide I/O, HBM, and Hybrid Memory Cube

There are major changes coming in the memory interface world, and recent interest in AMD and Nvidia's plans to adopt the new High Memory Bandwidth standard make this a good time to explain the three new standards: Wide I/O, HBM, and HMC. Let's kick things off with a basic question -- why do we need new memory standards in the first place?

DDR4 and LPDDR4 are both incremental, evolutionary improvements to existing DRAM designs. As we'll explore in this story, both standards improve power consumption and performance relative to DDR3/LPDDR3, but they're not a huge leap forward. Many of the underlying technologies baked into the standard were set a decade or more ago, when total system bandwidth was a fraction of current levels and CPUs were all single-core.

While the standard has evolved considerably from where it began, it's worth remembering that the first modern SDRAM DIMMs debuted on a 66MHz interface and provided 533MB/s of bandwidth. DDR4-3200, in contrast, is clocked at up to 1600MHz and offers up to 25.6GB/s of memory bandwidth. That's an increase of 48x over nearly 20 years, but it also means that we've pushed the standard a very long way. While there's been debate over whether or not to define a traditional DDR5, the broad industry consensus is that new solutions are necessary.

Samsung's Wide I/O: Ultra low-power bandwidth

Wide I/O and Wide I/O 2 have been backed by companies like Samsung and are designed to provide mobile SoCs with a maximum amount of bandwidth at the lowest possible power consumption. It's a technology that's been most interesting to companies building smartphones and embedded systems, where high resolution displays have put enormous pressure on bandwidth while low power requirements are critical to battery life.

Samsung's Wide I/O 2 architecture

Wide I/O is designed specifically to stack on top of SoCs and use vertical interconnects to minimize electrical interference and die footprint. This optimizes the package's size, but also imposes certain thermal limitations, since heat radiated from the SoC has to pass through the entire memory die. Operating frequencies are lower, but a large number of I/O pins increases bandwidth by using a memory bus that's up to 1024 bits wide.

Wide I/O is the first version of the standard, but it's Wide I/O 2 that's expected to actually reach the mass market -- though some have argued that true adoption won't come until Wide I/O 3, which should finally open a gap between itself and LPDDR4. The standard was ratified by JEDEC, but it's often associated with Samsung due to that company's extensive work on bringing it to market. Timing is unclear, but no major devices are expected to ship with Wide I/O in the first half of 2015. We may see some limited pickup in the back half of the year, possibly from Samsung's own foundries.

Wide I/O is explicitly designed to be a 3D interface, but 2.5D interposer designs are possible. Since one of the major challenges of a 3D Wide I/O structure is cooling the CPU underneath the DRAM it's possible that the first chips will be 2.5D interposer designs.

Intel and Micron: Hybrid Memory Cube

In Corner #2, we have Hybrid Memory Cube, the joint Intel-Micron standard. HMC is designed to emphasize massive amounts of bandwidth at higher power consumption and cost than Wide I/O 2. Intel and Micron have claimed that up to 400GB/s of bandwidth may be possible via HMC, with production expected in 2016 and commercial availability in 2017.

HMC is not a JEDEC standard but has multiple development partners, including Samsung, Micron, Microsoft, Altera, ARM, Intel, HP, and Xilinx. One of the major goals of HMC is to strip out the duplicative control logic of modern DIMMS, simplify the design, connect the entire stack in a 3D configuration, then use a single control logic layer to handle all read/write traffic.

The promise of Hybrid Memory Cube is an architecture that's explicitly designed to respond to multi-core scenarios and deliver data with much higher bandwidth and lower overall latency. HMC is extremely forward looking, and it solves a number of problems related to exascale computing, but it's also dependant on a number of profound improvements to semiconductor manufacturing. It's the most expensive new standard, and the only one not ratified by JEDEC.

Power consumption comparison

The slide above is from 2011, but the projections appear to still be accurate. At huge scale, memory power consumption from DDR3 and DDR4 is simply too high to allow for efficient scaling. Slashing memory power consumption by two-thirds would have a huge impact on supercomputing in the 2020 timeframe.

Next page: High bandwidth memory...

AMD, Nvidia, and Hynix: High Bandwidth Memory

Finally, we have High Bandwidth Memory, as planned by Hynix, AMD, and Nvidia. HBM is explicitly designed for graphics, but it's a specialized application of Wide I/O 2. Both AMD and Nvidia are adopting it for next-generation GPUs -- Nvidia has stated they'll use it for Pascal in 2016, while AMD is working on the tech but hasn't yet publicly stated which GPUs will support it.

HBM uses 128-bit wide channels and can stack up to eight of them for a 1024-bit interface. total bandwidth is in the 128GB/s to 256GB/s range with die stacks of four to eight DRAM dies. Each memory controller is independently timed and controlled. Future GPUs built with HBM might reach 512GB/s to 1TB/s of main memory bandwidth as compared to 336GB/s on the current Titan Black.

HBM is the "middle" option as far as cost and bandwidth -- it's not as cheap as Wide I/O, or as power efficient, but it's explicitly designed for high performance GPU environments and still should be cheaper than HMC.

HBM may also be the first technology to reach the mass market depending on Samsung's own rollout schedule for Wide I/O 2.

Putting it all together: Comparing and contrasting the three standards

Finally, I want to put the various features and capabilities of these memories side-by-side. The following slides are drawn from TSMC's reports and several other vendors and analyst companies and represent the current best-known details and state of affairs.

First, here's the comprehensive sheet for all three memory standards plus current DDR4 and LPDDR4.

This next chart presents all three solutions by bandwidth, power efficiency, and overall cost alongside the current DDR4 and LPDDR4 standards. Again, these are TSMC's charts, but we can assume that GlobalFoundries and Samsung are using similar metrics.

Next-generation memory standards

On the left, you have Power and Bandwidth -- on the right, price and bandwidth. What these graphs collectively show is that conventional DDR is relatively power inefficient but well priced, while upcoming standards like Wide IO2 offer excellent power and bandwidth capabilities, but at significantly higher costs. HBM and HMC push out bandwidth and power, but also price, which explains why these are primarily intended for enterprise and graphics applications as opposed to mobile, commodity hardware.

TSMC's roadmap Finally, here's TSMC's roadmap for introducing each of the major new technologies. HBM ramps begin this year, HMC in 2016, and Wide I/O in 2016 or 2017. In this case, it's possible Samsung will bring Wide I/O 2 to market before TSMC -- or we might not see a widespread ramp on that technology until after LPDDR4 has run its course and is in need of a replacement. TSVs have been the Next Big Thing for several years now, and while I absolutely believe that they're coming in the long term, the time frame has shifted outwards on more than one occasion.

Conclusion

When these new technologies ship, they collectively could revolutionize memory access speeds and overall performance. Many of the new standards explicitly allow for multi-threading and simultaneous accesses to different banks of memory which could drastically cut latency on common operations. Meanwhile, in mobile, tablet GPUs should see a profound performance kick. One reason why tablet games continue to lag behind their desktop counterparts is because mobile parts simply lack the memory bandwidth for operations that desktop GPUs can handle.

These collective improvements could also obviate the critical bandwidth shortage that has hurt the GPU performance of both AMD's APUs and, to a lesser extent, Intel's CPUs as well. Both companies have designed solutions that make maximum use of a scarce resource, but a dual-channel memory architecture puts severe limits on how much GPU horsepower can fit into a CPU socket effectively. HMC and HBM could blow the doors off that problem, far beyond what AMD's hypothetical GDDR5-equipped Kaveri might have offered.

AMD has talked about building an APU with HBM, but it's not clear if we'll see that chip in 2016 or at a later date. When it does come, the advantage could be profound. While AMD can't pack a 300W GPU into a CPU socket for thermal reasons, the company could improve integrated GPU performance by 40-50% over today's figures -- and in the process, finally offer an integrated GPU that would hit "Good enough" for the low-end of the enthusiast market.

Now read: 14nm, 7nm, 5nm: How low can CMOS go? It depends on whether you ask the engineers or the economists