Analytics are what helps social gaming firm stay afloat: while more than 60 million users are registered to play its games via Facebook and or their mobiles, Mats-Olov Eriksson, director of data warehouse at the London-headquarted firm, says that fan base can quickly slip away.

”For a player there will always be apps from three other companies that are just one click away. You’re not selling subscriptions for one year or two years: the user always has another option,” he said.

To help keep players interested, and recommending games to their friends, the firm measures near as everything – from how each game is played to the success of its marketing.

”Our competitors are doing the same thing with general analytics. If we start to become less efficient in marketing, vitality or creating a good and frictionless user experience then somebody else will.”

The difficulty for is the amount of data produced from its web logs has grown with its users, and by last year reached a point where it needed to find a way to reduce the complexity of analysing such large datasets.

In 2012 began using the big data platform Hadoop to help it deal with the rise in data and the consequent growth in complexity.

What is Hadoop?

Hadoop is a set of software tools designed to allow analysis of large datasets – running up to petabytes in size – using clusters of commodity servers. It breaks up data storage and processing so it can be parcelled up and distributed between multiple servers. This separation is handled by the Hadoop Distributed File System (HDFS), which splits the data between available servers, and MapReduce, which divides up processing jobs to be carried out in parallel in a fault tolerant manner.

Hadoop clusters can scale up as the size of datasets grow, as handling more data simply requires adding more servers to the cluster. Problems with latency when moving large amounts of data over networks can also be minimised by setting up HDFS to allow an application to process data using a node near to where it is stored. Another advantage of Hadoop over traditional relational data stores is its ability to accommodate both structured and unstructured data – data that doesn’t easily fit into traditional business data models or relational tables, such as text in an email body or audio or video files.

How uses Hadoop uses Hadoop, provided by Cloudera, to ingest data from daily currency exchange rates from the European Central bank; multiple metadata feeds; and game, advertising and platform servers’ log files on an hourly basis.

The social gaming company uses the Hadoop Distributed File System to store the data. The Hive database engine running on MapReduce is then used to structure and link the data so it can be queried, using the SQL-like language HQL (Hive Query Language). also uses InfiniteDB on top of Hive to provide a low-latency querying from the fast column-oriented relational database.

Prior to adopting Hadoop the company had relied primarily on data spread between various MySQL database shards, with a rolling archiver function. Sharding databases is necessary in many data platforms to overcome performance and storage issues once a database reaches a certain size. In contrast Cloudera claims Hadoop is the only commercially available platform to reach 100PB in size.

Staff outside of the data warehouse use the QlikView BI Dashboard to view the results of queries made against data in the warehouse, allowing them to answer questions like ‘What’s the revenue from this country?’or ‘What’s the player experience of this particular game’, or to show the ROI for an investment in a particular marketing channel.

”We use QlikView as a reporting tool. We would categorise all our users as data driven, they’re very analytical and we need to bring as much information as possible to the end user as they have all the domain expertise and are the best ones to make a decision based on data,” said Mats-Olov Eriksson, director of data warehouse at

”It’s a good way to make sure analytics works for the company and has some sort of business value rather than a bunch of smart guys with a beard and sandals sitting and deciding for the business what they should know.

“I don’t see a conflict between traditional discovery tools, such as QlikView, and big data, they work very well together.” relies on analytics to gain insight into game usage patterns and preferences, gaming behaviours such as when players advance or get stuck in a game level, as well as advanced gaming analytics.

This information can help tweak the game to make playing it more compelling, with the aim of retaining users, increasing revenue raised from players spending money in-game and increasing the number of new players attracted to the game by recommendations from their friends.

”We can ask ‘Do players need to be incentivised to spend more money?’. Or having built the relationship between players we can look at how to incentivise users so we can get them to want to invite their friend. The effect could be doubling the users you get from players themselves. The virality feature of our social gaming is crucial,” he said.

Doing away with the guesswork

All of these queries could be carried out without the aid of a big data platform like Hadoop, the difference, says Eriksson, is the data stored in its Hadoop cluster never needs to be archived or thrown away– removing the need to guess what questions should be asked of data up front.

”Imagine if for six months we run a report that showed the number of attempts before a player succeeds on a level and someone says ‘Wouldn’t it be a good thing to see their score when they complete the level’,” he said.

”If we’ve thrown away each game and the data points we have, we wouldn’t be able to recreate this report and add the dimension of score.

”That’s one of the many examples of premature aggregation. You’re never smart enough to know beforehand what the end user would like to know. You can never be an expert on each domain.

”My intention is to never throw away anything. Maybe that data will [turn out to] be a very valuable asset.”

By being able to scale its data across multiple commodity servers, Eriksson said that is able to keep the price low relative to using a distributed relational database management system, such as Oracle 11g. Costs will vary but figures online estimate the price of a Hadoop cluster to be below $500 per TB.


When implementing Hadoop the challenge has been how to create a single set of metadata that describes the relationship between the data extracted from’s log files.

”We’ve spent a lot of time working with a unified metadata system that everybody needs to relate to and make sure they implement the tracking of everything in a way we can use it,” said Eriksson.

There is also a shortage of people with the skill to know what data is worth collecting and the knowledge of how to extract that data in a usable form, he said.

”Good data architects are always a scarce resource. The data architect needs to understand the underlying data. For instance when we set up the metadata to decide ‘What should we track from the game?’ and having the visionary mindset to know if we add this datapoint to the tracking we can do this over there. They need to have the full picture of all the data we have and architecting the ETL (Extract, Transfer, Load) required to process this data and making sense out of it from a reporting perspective,” he said.

When it comes to querying the data he said the decision to use HQL to structure the data for queries reduces the amount of training needed for staff.

”It’s like SQL, and a lot of people know SQL. There are other [query] languages like Pig, but I am a bit reluctant to bring in more languages. It’s hard to find people already, requiring Pig skills makes it even harder,” he said.