Thoran Rodrigues describes a recent experience that involved spinning up 70 cloud servers for an intensive data-processing project. Here is what he learned from this real-world experiment.
Scalability is probably the greatest promise of cloud computing. In the infrastructure level, it translates to being able to quickly deploy new virtual servers from existing machines and then drop these servers when they aren't needed anymore. Not only that, but it should be simple and easy to scale each individual server up and down as needed, adding or removing processors, RAM, and storage space. If we look at the whole cloud stack, scalability at this level is more than a promise: it is a necessity.
Without scalability at the infrastructure level, there can be no auto-scaling cloud platforms that transparently increase available resources to accommodate application needs, nor can we have applications that have a large variability in the number of users at any time without the need for peak load provisioning.
Over the course of the past month, I had the opportunity to test the limits of infrastructure-as-a-service scalability by running a computing and network intensive process on several servers. I'd like to share the key points of this experience and the lessons learned throughout.
This experiment wasn't really a test, but rather a process I was running for a client. This actually makes it more interesting, because it's a real production environment, rather than a simple or controlled test. This means it was under all the traditional pressures and requirements of a production environment, such as availability, redundancy, and so on. The process consisted of running several web searches (on both search engines and regular websites), followed by heavy HTML, XML, and JSON processing, string matching, file format adjustments, and so on.
I made an estimate that running this process on a single 1 CPU, 1GB RAM server would take more than a year, but the client wanted the results in less than a month. The only way to deliver was to break the process down into smaller blocks that could be run on separate servers at the same time: enter parallelization and cloud scalability. By saving a basic machine image and replicating it dozens of times, then processing each small block on a separate server, I'd be able to finish everything up much faster.
In the end, my input data was broken down into 70 different blocks, so I set out to deploy 70 cloud servers using a standard cloud provider (Rackspace in my case) as fast as I could. In this case, I opted to deploy the cloud servers through the control panel, instead of via the API, just to see what would happen. The first thing I did was create the simplest possible Windows server (1 processor, 1 GB RAM, 40GB Disk), prepare the image and save it, so that I could later quickly create new servers from this image.
For those who are new to this virtual server thing: creating the image correctly can save you a lot of time. If your servers are all going to have the same directory structures with the same installed programs and so on, preparing the first image properly means that you don't have to worry about it with any of the others. And if, like me, you are going to run a process on several files but have excess disk space, copy everything to the first server. Since the images are full disk images, all files get copied, and you can actually save that setup time.
Facing the unexpectedSo I had my machine image created and started deploying new servers. The first 37 went up without a hitch, in less than an hour. That's more than one new server every two minutes, an impressive rate. Upon trying to spin up server number 38, however, I got an interesting surprise: the Rackspace console started returning a failure upon creating the server with the following message: "Account has exceeded update limit. Try again at [YYYY-MM-DD HH:MM]. Please call [X-XXX-XXX-XXXX] if you have any questions." I got in touch with their excellent customer service, who quickly replied that all accounts come with a built-in limit of 50GB of RAM usage. While it was easy enough to increase that limit (just open up a support ticket), this limitation should be more visible. In fact, Rackspace support informed me that the only way to get at the current limit was through their API, which makes no sense.
This limitation is not exclusive to Rackspace. All cloud providers impose some sort of limit on the maximum number of servers anyone can create, to try and restrict potential abuses of their services. My only concern is that they don't make this more transparent to end users. It can be quite frustrating to suddenly come up against this limit without any clear error message, and without any clear restrictions on the contract.
Results and conclusions
A few more caution points: of the 70 servers that I left running for a little over a week, two had issues and were restarted. While that may not seem like a large number, if you aren't ready to deal with problems on one of your cloud servers, the process might break or you might end up with less results than you expected. Finally, spend a little time on automating as many of the steps of your process as possible. You really don't want to have to manually go into each server to start a process or copy files unless you have a lot of free time and nothing better to do.
Killing the servers was actually much easier than creating them, since Rackspace's control panel allows you to select and delete multiple instances at once. So, while bringing the servers online took me about an hour, I took them all down in less than five minutes.
I think this process really highlights the advantages of cloud scalability. If I was running a standard in-house setup, it would be very unlikely that I could simply spin up 70 servers in an hour unless I had a lot of excess capacity. While not everyone has such elastic resource needs, tasks such as these are an excellent use case for the cloud. You can actually get something that traditional IT infrastructure isn't going to be able to deliver easily, and since the servers are going to be used only during a short period of time, the security concerns are much smaller.