On Thursday, networking company Cisco ironically suffered network issues lasting several hours, with the index of their company blog displaying the WordPress setup interface for some time, as first noticed by Cybereason researcher Amit Serper, and reported by TechCrunch.
SEE: Change control management: 10 critical steps (free PDF) (TechRepublic)
Five hours following the first acknowledgement of the outage, Cisco indicated that the issue was due to an “internal system change,” providing no further details of the issue.
The outage, which persisted for at least five hours, prevented users from using Cisco’s single sign-on, accessing Cisco’s learning portal, accessing Cisco’s security advisories, and downloading software, as well as created issues with creation of and response time to support tickets. Fortunately, Webex was unaffected by the outage, leaving the issues visible mostly to engineers.
The situation led to a fair bit of schadenfreude on Slashdot, Reddit, and other social networking services, with widespread (sarcastic) speculation that Cisco’s production systems failed to contact Cisco’s smart licensing server, or someone plugged an Ethernet patch cable into Port 1, or that the PG&E blackout affecting nearly a million people in California caused a partial network outage.
Reddit commenters indicated that the parts of San Jose affected by power outages were “only the parts near the foothills,” with Cisco’s facilities “on the other side of two freeways from the power outages,” likely ruling out the potential that the issue impacted Cisco directly.
Lessons to learn from the Cisco network outage
For a company that claims a need for three layers of redundancy on network topology designs, a network outage is embarrassing. That said, having a bad deploy bring down as much of Cisco’s network as it did may be a compelling case for distributing systems–essentially, a polite way of saying “maybe use the cloud.”
Given Cisco’s market position, the amount of bits that need to be stored and pushed around as system administrators update firmware is not negligible, and it is something that AWS, Azure, or Google Cloud Platform assuredly can do at a lower cost than handling this internally. The same is likely true of the company blog, having this hosted offsite or completely separated from production systems allows for a method of communicating network status during outages, and is standard practice among network operators.
This isn’t to say that Cisco should run their own network with Aruba switches, though there is a certain level of dogfooding that is not advisable, particularly when using production systems to service customers with hardware identical to that of the production system. This is a problem that can easily become magnified when using phone-home license activation.
Problems with Cisco kit make headlines roughly every two weeks, with a 9.9/10-severity flaw affecting Cisco IOS-powered routers last month, and a 10/10-severity flaw discovered the month before, as well as other authentication bypass, remote code execution, and command injection flaws discovered in Cisco hardware weeks prior.