Botched Software Update To Networking Gear Caused One Of GitHub’s All-Time Worst Outages

A botched software update to networking gear caused one of GitHub’s all-time worst outages last weekend, the second major disruption that customers of the popular social coding platform have suffered through in the past several weeks.

In a blog post, Github’s Mark Imbriaco explained that the December 22 outage came during a software update to its aggregation switches that were recommended by GitHub’s unnamed network vendor. The update was supposed to take care of the problem that led to the last major outage the company had in late November.

The problem was further exacerbated by the subsequent impacts on the fileservers which, combined with the other problems, forced the company to put the service into maintenance mode. Service disruptions lasted about 19 hours. GitHub was not out the entire time but latency and loss of access did persist over an extended period.

Data center networks can have a variety of switches to keep the data flowing. GitHub uses access switches to connect to the servers. Those in turn connect to a pair of aggregated switches that “use a feature called MLAG to appear as a single switch to the access switches for the purposes of link aggregationspanning tree, and other layer 2 protocols that expect to have a single master device.”

Having a pair allows room for error. It means that there should be back up. But in the upgrade, a timing error occurred that caused confusion between the two switches. That led to a lot of churn and the disruption of the service.

Imbriaco explained it this way:

This is where things began going poorly. When the agent on one of the switches is terminated, the peer has a 5 second timeout period where it waits to hear from it again. If it does not hear from the peer, but still sees active links between them, it assumes that the other switch is still running but in an inconsistent state. In this situation it is not able to safely takeover the shared resources so it defaults back to behaving as a standalone switch for purposes of link aggregation, spanning-tree, and other layer two protocols.

The fileservers also suffered from the confusion with the network switches. In the process, there came what we often see in outages involving major services. In the confusion comes an attempt by the machines to gain control, which then leads to more problems and persistent issues for customers.

The GitHub outage is a lesson for all those growing startups and enterprise shops exploring ways to keep a scaled-out service stable. But it also shows what inevitably happens when major outages happen. Systems start relaying information that is incorrect and subsequently cause a cascading effect that impacts customers and their ability to use the service.