Some popular websites experienced degraded performance or stopped working altogether the week of Sept. 1, 2014, but the reason behind the outages may be more complex than the official responses.
Users of eBay, PayPal, Facebook, Tumblr, LinkedIn, and the Apple iTunes Store, Mac App Store, and App Store vented their frustrations on social media when issues plagued the sites this week, mostly on September 3, 2014.
eBay and PayPal
eBay and its subsidiary PayPal experienced a protracted and somewhat puzzling outage on September 3, 2014. eBay users in the US, the UK, and India reported that they were unable to sign in to their accounts; they were informed that their password was incorrect, or found out that their account did not exist. Similar error messages were offered up at PayPal. Affected seller accounts with active listings were unable to sell items, and attempts by prospective buyers to view their items resulted in being redirected to search results for similar items.
Although some users reported no issues signing in to their accounts, others (mostly in the UK) were unable to access the website at all. The outage lasted approximately six hours.
This outage comes on the heels of another major outage on August 12, 2014, which was the 10th outage eBay has experienced this year. eBay suffered an attack in May 2014 in which email addresses, encrypted passwords, physical mailing addresses, and other personal information (excluding financial information) was compromised. The website was defaced temporarily.
Facebook experienced an outage of approximately 15 minutes on September 3, 2014. Representatives from Facebook stated: "Earlier today we encountered an error while making an infrastructure configuration change that briefly made it difficult for people to access Facebook. We immediately discovered the issue and fixed it, and everyone should now be able to connect."
In response to Facebook's outage, the statement from the staff of Tumblr stated: "We thought Facebook's outage today was pretty cool, so we wanted to have one too. They did 15 minutes, and we topped them with a full 20."
According to the Tumblr staff, the culprit is: "Our primary data center's connection to the internet was interrupted during routine maintenance, leaving Tumblr and all blogs briefly inaccessible. The issue was corrected, and our engineering team will be thoroughly reviewing these procedures."
Bloomberg reports that LinkedIn on September 3, 2014 faced unspecified issues with its website that "Our team is working hard to resolve." This follows a protracted issue that left LinkedIn unavailable between August 15 and August 18, 2014, in several countries, on several ISPs, with difficult to pinpoint consistency in failures.
Apple iTunes Store, Mac App Store, and App Store
From approximately 5:30 PM to 11:00 PM on September 2, 2014, some users on the iTunes Store were unable to make purchases. Reports on Twitter indicate that some users of the Mac App Store were also unable to make purchases and were given the error message "This item is temporarily unavailable." Issues were reported around the world. On September 4, 2014 some users were unable to access the App Store between approximately 3:30 and 6:45 PM, according to Apple's system status page.
The official explanations from these companies about the failures being a result of botched "routine maintenance" sounds like a standard, non-technical catchall excuse to be trotted out in the event anything goes wrong.
The seemingly likely technical culprit behind these outages -- considering the timeframe in which they happened, and the inconsistent results obtained from users throughout the world -- is that at some level, a failure of Border Gateway Protocol (BGP) is to blame. BGP was widely attributed to outages that occurred on August 12, 2014 with the overflow of entries in the BGP table.
As a cursory overview, on a great deal of (slightly aged) networking hardware, the default size of the BGP table (i.e., the directions that tell routers what route to take) is 512,000 entries. Depending on the hardware presently deployed, the fix for this issue as it occurs can range from applying a patch and restarting a critical piece of networking hardware to buying new hardware and swapping it out. Due to the differences in hardware and system configurations on systems around the world, not all systems will fail simultaneously. However, unlike other issues such as IPv4 depletion, this issue has not been publicized enough that a great deal of preemptive measures have been taken to correct this issue.
I am not saying with absolute certainty that BGP is the culprit in these outages, but the pattern of multiple, independent partial failures in these cases, and the recurring nature of this type of failure in the past several weeks has a number of fingers pointing toward this conclusion.
Can you hear me now?
Have you been personally inconvenienced by these outages? Have you dealt with fixing these or similar outages firsthand? Let us know your experiences in the comments.