by Peter Sayer

Inside Cisco’s IT monitoring makeover

Feature
Dec 09, 2019
IT LeadershipNetwork MonitoringSystem Management

When Cisco’s IT team set out to replace a homegrown monitoring system, changing minds was as much of a challenge as changing code.

Cisco sign
Credit: Cisco

Cisco Systems offers its customers an array of options for monitoring their networks and IT systems — but how does Cisco’s internal IT team monitor the infrastructure used by its 75,000 employees around the world?

The answer to that is changing as the company moves from a homegrown monitoring system to one built on top of commercial products, says Radhika Chagarlamudi, senior director for business collaboration and software platforms with Cisco’s IT team.

From a core of on-premises systems, Cisco’s IT infrastructure — like that of other enterprises — has grown to encompass cloud and SaaS systems.

“We have a mixture, a hybrid ecosystem,” says Chagarlamudi. “We also have a lot more dynamic infrastructure than we’ve had in the past.”

That complexity means that, if problems do arise, they can be hard to fix. “We had to figure out how to get the mean time to resolution faster,” she says.

Another complicating factor for Cisco has been that every team and every domain has tried to solve its problem individually, she adds.

“What we wanted to do was simplify,” she says, by developing an overall architecture, a blueprint for how Cisco’s internal software ecosystem should work.

Start at the beginning

Rather than zero in on the technology, the key to a successful transformation, says Chagarlamudi, is to first draw up the blueprint for the capabilities that you need: “Don’t focus on the tool; don’t focus on the platform.”  

For Cisco’s IT monitoring system, this meant focusing on a foundation of specific monitoring capabilities with a consistent data model across applications, whether they were running on-premises, in the cloud or in SaaS mode.

“If you don’t have your foundation and your data architecture aligned, the ability to correlate events or add any intelligence is doing to be even more difficult,” she says.

Once you’ve determined how you want to modernize operations, and what capabilities you need, you can then map the capabilities of your existing software ecosystem to find overlaps and gaps, and to see what you can retire, what you can reuse, and what you must replace.

Selling the change

Chagarlamudi had to convince two groups that changing the monitoring system was necessary: her bosses, and her team.

For the former, the arguments were in large part financial: not just the direct savings in support costs through having a more modern system, but also in reduced down time.

“It’s back to that mean time to resolution,” she says. “If we think about our mission-critical applications, even cisco.com where a significant amount of our annual revenue is transacted, … if there are any quality, performance or scale issues our ability to identify and resolve them is critical, because we could be impacting costs or revenue generation.”

Replacing the monitoring system would also bring improvements in scalability and extensibility, she also argued. These two factors together sold her bosses on the idea of shifting away from the company’s homegrown approach.

Selling the idea to IT team members was a different matter, however. Not only were they used to working with the old system, some of them had helped build it.

“We spent a lot of time doing a lot of homegrown things within Cisco,” she says. “We built solutions because solutions weren’t out there when we were enabling certain types of capabilities.”

Not only that, but with a general move towards software-defined infrastructure in the industry, Cisco had been hiring a lot more people well versed in software. “And if you’re a software engineer, your tendency is, ‘Hey, I want to build something cool.’”

But the market has moved on. “Maintaining those systems and those applications wasn’t core to us. There are now platforms and solutions out there in the industry that we could bring in that would be more cost effective,” she says.

The trick, then, was in persuading those programmers not to reinvent the wheel, but to take existing ones and build a racing car around them.

“You don’t want to build something that’s already out there: That’s not a good use of our time. But if we can build on top of what’s already out there, the things that we need to augment, now we’ve got a value proposition,” she says.

Chagarlamudi says Cisco evaluated monitoring system from half a dozen vendors, looking at how their capabilities and product roadmaps aligned with the company’s needs: Could they support the multiple domains Cisco needed to monitor in database, compute, storage, containers, cloud and collaboration? Did they provide the APIs necessary to extend and automate the platform and interconnect it with other tools Cisco uses, such as ServiceNow? And could they be scaled and localized for use by operational teams at data centers around the globe?

“I don’t believe there is a single product out there that solves all the problems,” she says, but the process did allow Cisco to identify a platform from ScienceLogic that could then be extended through the development of custom applications or integration with others.

A progressive roll-out

The new platform was deployed region by region and domain by domain. “We started with basic up-down monitoring, and then we added capabilities as we got stability,” she says.

There were initial problems with monitoring specific pieces of infrastructure, and with the visibility of the information across the globe: “Some of these clusters we had to log in to individually,” she says. “That’s not really sustainable for our operations team.” And, inevitably, tuning was required to filter out unnecessary alerts.

Chagarlamudi admits that she and her team probably underestimated the difficulty of integrating some of the third-party platforms they needed to work with ScienceLogic. Another area that proved more complicated than expected was deciding which tools to enable for a given function when two tools in her software ecosystem offered overlapping capabilities.

And then there’s growth, both in the scale of the business and in the number of domains that the monitoring system needs to cover. For example, should they continue to use the same platform for collaboration infrastructure and cloud-registered video- and voice-calling endpoints? Or should they integrate another third-party solution and integrate?

“It’s a continuing challenge for us. It’s that ongoing continuous evaluation of the architecture and then making sure that we’re making the right investment decision longer term,” she says.