How innovative design allowed one cloud company to withstand Amazon's recent outage

The criticism of the recent Amazon outage was nothing short of bountiful. One cloud solution company managed to avoid any interruption due to the Amazon failure, however. IT pro Rick Vanover explains how ShareFile's robust design allowed it to route around outages.

Cloud solutions are available as some form of alternative for almost every area of IT. While GigaOM’s recent list of the top 50 cloud innovators did not include ShareFile, I believe they have a great story on how their design avoided interruption with the recent outage with select services from Amazon Web Services. I have introduced ShareFile a few times here and there in my blogs at TechRepublic, once providing an overview and another time showing their mobile device support.

In the course of understanding how ShareFile works, I asked token questions like, “Where does the data that users put in ShareFile actually reside as data at rest?” The answer I got at the time was that ShareFile uses Amazon Web Services (AWS) to deliver the application as well as provide the storage. At the time, I didn’t pursue this too deeply.

After the recent AWS service outage, I did observe that ShareFile was not among the impacted web-based applications that were featured in the news and media reports. From that observation, I contacted ShareFile to ask if and how they were impacted by the AWS outage. ShareFile does leverage S3, EBS and EC2 in AWS. The response I got was more robust than I expected. As it turns out, ShareFile has fully leveraged AWS and has a built-in design that accommodates failures in not only an availability zone but also a region within the AWS clouds.

Given that robust design, I inquired as to some specifics of the configuration. Jesse Lipson, CEO of ShareFile addressed my questions quite thoroughly outlining the architecture of ShareFile. Lipson offered this statement when I inquired about which AWS availability zones were leveraged for ShareFile:

“ShareFile is spread across multiple availability zones on Amazon’s EC2 data center and uses all five of their major data centers in Northern Virginia, California, Ireland, Singapore and Japan. In addition, we have a whole farm of servers, spread across availability zones, that handle our customers’ uploads and downloads, and the servers are more or less interchangeable so that if one, or a handful of servers go down, our customers are not affected by any downtime."

If an availability zone were to incur an issue, ShareFile has a monitoring system that will constantly monitor servers for a bi-directional data transfer heartbeat. Should that server become unavailable, it is dropped from the aggregated server farm automatically. I inquired specifically as to the sequence of events that happened in the US-East region and how ShareFile accommodated the outage, to which Lipson responded:

"When Amazon experienced it’s outage in one of the availability zones on the East Coast, the affected servers were automatically dropped from ShareFile’s server farm without any human intervention and the upload/download success rates were normal. The next day our team added some extra server capacity on the West Coast as a precautionary measure in case the issue got worse on the East Coast, but our customers didn’t experience any downtime. Since we are focused on businesses that share large and sensitive files externally and internally, there’s an expectation that these files reach the right people at the right time and we’ve been pretty conscious, since ShareFile’s inception, to provide continuous service for our customers.”

All in all, the design is quite impressive because the issue was non-impactful to ShareFile customers. Getting details such as this is difficult for the everyday cloud consumer, but when made available, they can add serious validity to solutions that leverage cloud computing.

What do you think of ShareFile’s approach to leveraging AWS in this fashion? Share your comments below.