BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

IBM Has An AI Training Advantage

This article is more than 6 years old.

This article focuses on the “under the hood” machine learning (ML) work IBM announced last week at Think 2018, work that will soon accelerate Watson and PowerAI training performance even further. This coincidentally highlights IBM’s partnership with NVIDIA and access to NVIDIA’s NVLink interconnect technology for GPUs.

TIRIAS Research

Last year, I discussed an IBM paper that described how to train a ML image classification model in under an hour at 95% scaling efficiency and 75% accuracy, using the same data sets that Facebook used for training. IBM ran that training benchmark in the first half of 2017 using the 64 POWER8-based Power System S822LC for High Performance Computing systems. Each of those systems had four NVIDIA Tesla P100 SXM2-connected GPUs and used IBM’s PowerAI software platform with Distributed Deep Learning (DDL).

IBM’s new paper, “Snap Machine Learning,” describes a new IBM machine learning (ML) library that more effectively uses available network, memory and heterogeneous compute resources for ML training tasks. It is also based on a new platform—IBM Power Systems AC922 server. IBM’s AC922 has four SXM2-connected NVIDIA Tesla V100 GPUs attached to dual-POWER9 processors via NVIDIA’s latest NVLINK 2.0 interface.

IBM Research

I talked with Hillery Hunter, an IBM Fellow and Director of Accelerated Cognitive Infrastructure in IBM Research. Ms. Hunter described IBM’s breakthroughs contributing to Snap ML improves performance as:

  • More effective mapping of ML training algorithms to the massively parallel GPU microarchitecture
  • More efficient scaling from a single server chassis to a cluster of servers
  • Improvements in memory management by minimizing communications between heterogeneous processing nodes (classic processors and GPUs) with a dynamic memory scheduler that speculatively moves data from processor to GPU memory (and vice versa)
  • IBM’s integration of NVIDIA’s NVLink interconnect technology for much faster IBM POWER9 to NVIDIA Tesla V100 communications, which is currently shipping in the IBM Power Systems AC922

The result is training tasks that took hours can now be done in seconds with no loss in accuracy. This is possible because Snap distributes and accelerates those tasks more efficiently. Snap will accelerate many kinds of logistic and linear regression analysis, including deep learning (DL) tasks.

TIRIAS Research

IBM claims a 46x speedup over the previously published record for ML training using Criteo Labs’ online advertising training dataset, with no loss in training accuracy. That earlier result was published over a year ago, but more importantly, Google used processor-only cloud-based virtual machine (VM) instances for that result. Google can easily assemble the 60 worker VM instances and 29 parameter VM instances (that’s 89 total cloud processor instances) for a training run.

However, even a year ago, Google’s results were a brute force demonstration of scaling processor resources and a poster-child for asking “why would you train using only processor cycles?” Google had not yet published its initial TensorFlow processing unit (TPU) disclosure at the time it published results for the Criteo Labs' training dataset. Perhaps that is why Google did not use TPUs to generate the (at the time) record results.

A year later, IBM pummeled Google’s record by using only four Power System AC922 servers containing two POWER9 processors and four NVIDIA Tesla V100 GPUs each. The results compare 89 cloud-based VM instances to a total of 24 compute elements housed in four server chassis (8 processors and 16 GPUs) that performs 46x faster.

Buried in the same Snap paper, IBM also directly compared the AC922 server to a mainstream Intel-based server using a subset of Criteo’s Terabyte Click Logs (the first 200 million training examples, a reasonably sized subset). The systems tested were:

  • Dual-socket Power System AC922 server with POWER9 processors connected to four NVIDIA Tesla V100 GPUs via NVLink 2.0, but only using one of the GPUs for the comparison
  • Dual-socket server with Intel Xeon Gold 6150 processors connected via PCIe 3.0 to a single NVIDIA Tesla V100 GPU

IBM measured an effective bandwidth of 68.1GB/s over NVLink 2.0 in the AC922 system and 11.8GB/s over PCIe 3.0 in the Intel-based system. That is a raw 5.8x processor to GPU interconnect advantage for the AC922 system using NVLink 2.0.

Snap ML carefully manages data movement between processors and GPUs. Because data transfer speed is almost six times faster using NVLink, Snap can hide most of the data copy time between processor and GPU behind the time it takes the processors and GPU to process the data.

The result was a 3.5x speedup in actual, measured training time using the Power Systems AC922, server to server, using only one GPU per system.

Data movement severely impacts ML performance; measuring performance of a single GPU across NVLink vs. a single GPU across PCIe should scale to comparing multiple GPUs using those connections. However, this test is a good indication that there should still be measurable advantage using IBM POWER9 processors connected by NVLink to a cluster of four or more NVIDIA Tesla GPUs. This test also highlights the need for standard ML benchmarks that enable direct comparisons between servers using real-world applications.

A 3.5x speedup for training time is just as important a metric as claiming the performance leadership for an overall benchmark.

No other processor manufacturer has integrated NVIDIA’s NVLink interconnect directly into the processor package—all the competing server ecosystems depend on PCIe interconnects. Direct access to NVLink and Snap ML’s software architecture both contribute to the measured training time speedup.

Also last week at Think 2018, IBM and Apple announced IBM Watson Services for Core ML. This latest phase of the nearly four-year-old Apple/IBM partnership gives Apple iOS software developers access to IBM’s leading-edge AI and ML development environments and cloud training support. It reciprocally extends IBM Watson’s reach into running ML inference tasks on Apple’s highly successful consumer device ecosystem. Training ML models faster translates into delivering more up-to-date models for inferencing on edge devices, such as iOS-based smartphones.

Snap ML will be available as part of IBM’s PowerAI Tech Preview portfolio later this year.

3.5x speedup for ML training time is just as important as claiming the performance leadership for overall benchmark.

-- The author and members of the TIRIAS Research staff do not hold equity positions in any of the companies mentioned. TIRIAS Research tracks and consults for companies throughout the electronics ecosystem from semiconductors to systems and sensors to the cloud.

Follow me on Twitter or LinkedInCheck out my website