Everyone is talking about big data and cloud computing, and about how the cloud can enable us not only to access a much larger number of data sources, but also to process and extract information from a lot more data. The cheap and simple availability of large scale computing resources on-demand is a huge step forward for heavy, data-intensive computing projects, as evidenced by the recent announcement by CERN and other initiatives. Amazon, one of the leading vendors in the infrastructure-as-a-service (IaaS) space, already has special server instances that are focused on large-scale, data intensive tasks, such as high-memory, high-CPU, cluster, or even GPU-cluster instances. And several vendors already offer special packages for handling big data, such as pre-installed Hadoop software or other Map-Reduce packages.
With this on-demand availability, it's possible for companies, universities, or anyone else who may be interested in processing Big Data to avoid spending a lot of money on a supercomputer or on a large data center that would be underutilized most of the time, and that runs the risk of becoming obsolete. It is also possible to easily scale up processing power, so that running intensive tasks becomes restricted only by the available budget, not the available machines. The cloud, then, is an excellent platform for running Big Data tasks. What people often forget, however, is that it can also be a producer of Big Data.
In whatever layer of the cloud you may find yourself - infrastructure, platform, or software - either as a provider or as a customer, the level of monitoring and control that is employed today goes beyond anything that existed with on-premises hardware and software. Unlike what happens with on-premises solutions, on the cloud it is possible to monitor not only the activities of a single user, but of all users across all servers. As the scale grows, so does the data that is generated.
Let's take cloud software as an example. As is the case with any web-based software, it is possible to monitor every aspect of how users are interacting with the software, from the configuration of their machines to how long they spend on each screen. If on traditional software it was possible to collect this information through user opt-in, it meant depending on user acceptance of this monitoring, and meant having additional software burdening the client. If the software is cloud-based, the extra processing load becomes a burden on the server, and it can be done transparently for the end user. It is also much easier to do, and companies don't have to rely on user opt-in to collect this data.
But data collection goes even further on the cloud. For cloud platforms and infrastructure, API calls and resource usage can be closely monitored and measured, and it can be later employed to develop new service offerings or improve existing conditions. It is possible today for a platform provider to measure, in real time, what calls are being made from where, and even to employ this information to improve the APIs themselves, either by introducing new functionality, improving documentation, or removing unused features.
From a customer perspective, this is not necessarily a bad thing. This constant monitoring can lead to important service improvements that can make us more productive on the long run. Imagine an operating system such as Windows or Linux: it can have a lot of old, unused functionality that is kept in for the sake of compatibility. Whoever is developing the system has no real way of knowing what internal functions people are using at any given time, nor how many people use each thing. On the cloud, this knowledge is instant, and systems can be adapted accordingly.
The power of data
All this data doesn't come for free. It brings with it the need for more processing power and more storage, so that it can be captured and stored. It also brings the need for better tools to extract useful information from it. Fortunately, the cloud itself offers these tools. The ability to collect data can also bring with it the expectation from customers that it will be put to good use, and won't be abused. One of the greatest pitfalls regarding this kind of data collection is how you are going to employ it. If users feel like companies are infringing on their privacy, problems will ensue.
The possibilities, however, are limitless. Applications can be adjusted to better fit the usage patterns of different users; servers can be upgraded or downgraded dynamically, as the need arises and without user interference (as an opt-in service, for instance); data storage can be optimized according to access patterns; platform APIs can be improved according to the interactions made against them; and so-on. It is even possible to share usage data between users, so that they can benchmark their patterns against others, or to present global usage statistics.
There are a lot of benefits in using all this cloud-generated data, and a few things to keep in mind. If you are involved in a cloud project, ensure that monitoring and tracking are an integral part of it. Take into account not only what you would normally monitor, but also anything else that may allow you to make improvements in the future. If you are a cloud customer, understand that you will be tracked and monitored, and that the information will be used to improve the service you are using. More than that, make this constant improvement an expectation. Demand that service providers make proper use of all the available information, so that the true power of cloud data can be realized.
After working for a database company for 8 years, Thoran Rodrigues took the opportunity to open a cloud services company. For two years his company has been providing services for several of the largest e-commerce companies in Brazil, and over this time he had the opportunity to work on large scale projects ranging from data retrieval to high-availability critical services.