Why the Cloud Needs a Network Decongestant

English: A gigabit HP-ProCurve network switch in a nest of Cat5 cables. (Photo credit: Wikipedia)

Every part of the world of business computing is getting cloudier. By that I mean that computing resources are becoming virtual. Instead of running on a physical machine, our applications run on virtual machines, ones created by software and controlled by APIs. This is happening in data centers that deploy VMware and other virtualization technologies to increase utilization and flexibility, and in public and private clouds created by the likes of Amazon and Rackspace that provide on-demand computing resources.

(A note to IT purists: I am aware that not all virtualization counts as cloud computing according to the strict definition put forth by NIST. I am conflating clouds and virtualized environments because most virtualized environments face the network optimization challenges explained in this article. As a practical matter, most large data centers that are not virtualized face the same challenges as well.)

In the face of all of this hardware becoming software, it is easy to think that we can forget about the physical world. But this is not true, especially when it comes to the network layer. I recently started studying a company called Gnodal, which creates a new kind of Ethernet-based, ASIC-accelerated switch that can dynamically optimize the flow of network traffic. Learning about this technology has convinced me that more and more companies are going to face a large bump in the road as their infrastructure becomes more virtual.

All Computing Is Eventually Physical

Remember, no matter what, at some point, all computing becomes physical. For anything to happen, at some point, a CPU has to load instructions and process data. That data has to arrive at the computer containing the CPU, be processed, and then be sent somewhere, often to another computer. A physical network that connects many computers together must come into play. It is in the switching layer that the bump in the road will appear as the computing world becomes more virtual.

Here’s how it will play out:

As computing becomes virtualized, it will be harder to predict which workload is running on which computer. The point of much of this virtualization is to allow workloads, that is applications, to be moved around to increase utilization or for load balancing.
As a result, the traffic through the physical switching layer will also vary widely. This traffic varies both as workloads move around and as the amount of work being done by each application changes.
In addition, as more and more applications become high performance and serve up immediate responses, the network layer cannot be a bottleneck.
In such a dynamic environment, it will be impractical to plan a switch configuration that is optimized for any possible traffic pattern. Depending on how workloads move around, bottlenecks will appear (and potentially disappear) and dramatically effect performance.

As virtualization increases therefore and more applications enter the realm of high performance computing, an increasing number of companies will hit a bump in the road: dramatic variability in application performance due to network congestion. When they hit that bump, they will have to find a way of adaptively optimizing the flow of network traffic across potentially complex switch hierarchies.

Amazon as a Case in Point

For proof that we are all headed for such a bump, we only have to look at the performance characteristics of Amazon Web Services Elastic Compute Cloud (aka EC2). Various studies have documented the variability of performance of EC2 instances. Some practitioners allocate 20 or 30 EC2 instances, test them all to see which are running fast, and then only use those and shut down the rest.

There are two reasons that performance is variable. First, it is not possible to predict what load the other applications running on the same physical device have. Second, it is extremely difficult to predict network performance and to optimize it adaptively. Companies like Virtustream are working on the first problem. Gnodal is working on the second.

How An Adaptive Switch Acts as a Network Decongestant

Modern data-centers consist of racks-upon-racks of servers and storage, typically known as devices. Such computer systems require interconnectivity to provide access to centralized storage and as discussed above increasingly require the need to communicate with each other to not only provide virtualization properties already discussed but also deliver performance in an acceptable time-frame, often in the order of microseconds.

The aggregation of such devices, requires the provision of switches, typically placed in the Top-of-Rack. These switches are then connected together in some hierarchical manner to provide global communication across the data center floor. The ubiquitous standard for such communication is based on Ethernet. However, this standard imposes a number of checks and balances to ensure that such communication can be delivered reliably and this has traditionally limited performance and required expensive core routers to constrain the solution and cause congestion

A single Ethernet, Top of Rack switch, can often deliver effective point-to-point communication within the confines of devices that are connected to itself. However, when such switches are inter-connected the underlying communication protocols can disturb local traffic and more importantly fall-short of scaling this performance to devices connected multiple hops away. The problem that Gnodal seeks to solve is therefore simple. How can numerous computers that are all connected by multiple Ethernet switches communicate in a high performance manner, avoiding delays caused by congestion, and deliver the necessary high throughput and low latency required for modern applications. In essence, provide the properties of a single physical Ethernet switch formed from multiple Ethernet switches.

An example of the problem can be described as follows. If two devices spread across multiple switches across a data center wish to communicate, then traffic between them can proceed undisturbed. If another two devices connected to the same switches now wish to do the same, then traffic must take a different route in order to deliver the same performance. If you scale these communication patterns and now add to this, traffic which requires every device to talk to every other device across the whole switch network (which is often the case), multiple routes within the network are required to avoid congestion. These routes are changing rapidly reflecting the often short-lived global communication patterns and longer term storage traffic. The ability to dynamically re-route traffic and exploit all the paths in the network is therefore critical to deliver an equitable share of all possible resources in such a complex environment avoiding congestion. (For more on why networks get congested see: “What Causes Enterprise Networks to Slow Down? 7 Seven Deadly Sins of Network Congestion”)

Gnodal’s technology is based on the premise that you cannot solve this problem without smarter switches and intelligence baked into hardware. They are the only vendor in the industry today offering adaptive load balancing technology that allows on the fly redetermination of path to work around network congestion (See “How Gnodal Solves the High Performance Switching Problem Inherent in the Cloud” for the details of how the technology works.) In a typical Gnodal configuration, you connect lots of computers up to however many Gnodal switches you need. Gnodal acts as what is called a switch fabric, essentially an aggregator of other Layer 2 switches in the network. Then, because Gnodal has intelligence built into the switches and a central nervous system sensing traffic flows, Gnodal’s proprietary ASIC (application-specific integrated circuit), network traffic can be rerouted in real time during a conversation to avoid congestion and preserve application availability during periods of peak traffic.

While Gnodal switches solve problems in cloud computing and high performance environments, they can also help improve the performance of most data centers that are becoming more complex. More servers are being connected to each other and to internal resources such as databases. It’s becoming difficult to do what came easily in the past: plan how the components will talk to each other and optimize the communication through switches. Even when that strategy worked, it still by nature overcompensated, creating excess capacity sitting around, underutilized, waiting for a spike. Using Gnodal’s approach, a data center can support more traffic with better performance with fewer switches and no time spent on optimization and reconfiguration, essentially leveraging nearly 100% of the fabric’s available bandwidth.

Can Software Defined Networks Become an Adaptive Switch?

One question that immediately occurred to me after looking into Gnodal was: Can software defined networks (SDNs) do the same sort of optimization that Gnodal does?

After talking to Fred Homewood, CTO of Gnodal, I’m convinced the answer is no, not unless they imitate Gnodal’s approach of putting the intelligence in hardware. Here’s the problem: Gnodal does the rerouting immediately when congestion is recognized. As soon as one packet is delayed, the next packet is rerouted in milliseconds. With a software defined approach, you are not able to affect specific conversations but would instead be able to change the topology of the network to some extent, within the bounds of the physical connections. This reconfiguration may help you stop the next traffic jam, but not the one you just saw happening. And the reconfiguration would take place 1,000 times slower. In addition, it is not likely to make use of network capacity as efficiently as Gnodal’s approach.

“We’re anticipating an elongated transition to commodity switches to fully enable SDN, so it will require a few hurdles before enterprises realize the real solution to their virtual problems was the underlying physical infrastructure all along,” said Fred Homewood, CTO of Gnodal. “Meanwhile, Gnodal can continue to innovate around its ASIC and OS to the degree that ultra low latency and high throughput demands converge over the same network connection.”

It will be interesting to seek how many bumps in the road it takes before companies recognize the physical solution to their virtual problems.

Follow Dan Woods on Twitter

Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visit www.CITOResearch.com.

Gallery: Top 10 Strategic Technology Trends For 2013

10 images

View gallery

More From Forbes

Why the Cloud Needs a Network Decongestant

Gallery: Top 10 Strategic Technology Trends For 2013