Big Data

Migrating from a relational to a NoSQL cloud database

Matt Asay describes the approach one needs to take in migrating a standard relational database to an open source NoSQL cloud database.

By Matt Asay

The industry is on the cusp of tectonic changes in how and where data are stored and processed. For over 30 years, the venerable relational database management system (RDBMS), running in corporate data centers, has held the bulk of the world's data. This cannot continue. RDBMS technology can no longer keep pace with the velocity, volume, and variety of data being created and consumed. For this new world of Big Data, NoSQL databases are required.

Migrating to these open-source cloud databases, however, requires some preparation for enterprise IT that grew up with RDBMS.

How Big Data is changing everything

There's nothing wrong with the traditional RDBMS. It simply doesn't fit the world we live in anymore. Mobile, social, cloud: these and other trends complicate the variety of data and dramatically increase the volume of data being stored in the enterprise.

As RedMonk analyst James Governor argues:

The database market is back in play after a 30-year old freeze in which Oracle dominated the high end, and Microsoft the midmarket. Then along came open source, the cloud, NoSQL, in memory and everything changed....The idea that everything is relational? Those days are gone.

This isn't something that only concerns so-called web companies like Google and Foursquare. It's equally relevant for "old school" organizations in the Finance, Healthcare, Government, Retail, and other vertical industries, and projected by Gartner to drive $28 billion in IT spending in 2012. As organizations grapple with their Big Data problems, when data grow beyond one server or start in a distributed fashion, they generally find themselves on the same road as the web companies: open-source, NoSQL databases.

While Big Data often gets associated with data analytics technologies like Hadoop and Storm, it's actually much broader than this, and far more concerned with data storage than analytics. After all, if an enterprise can't scale storage effectively, it will never have a "Big Data" problem to analyze. Hence, of the $30 billion global database market, only 25 percent is analytics, with the rest being OLTP or operational databases. Ironically, the recent rise of data analytics innovations like Hadoop stems from RDBMS failure to cope with Gartner's three V's of Big Data: high-volume, high-velocity, and high-variety of data.

Migrating from RDBMS to NoSQL

It's clear that the database is critical to successfully managing the explosion of data. What's less clear is how to transition from legacy RDBMS to modern NoSQL databases. Successfully migrating from a relational world to a NoSQL world requires careful planning.

In fact, one of the biggest dings against NoSQL databases like MongoDB or Neo4j is that they're so easy to work with that developers end up jumping in headfirst, without bothering to properly construct their data model, thereby causing problems later. While NoSQL databases do provide significantly more developer agility and flexibility, they still  shouldn't be used willy-nilly.

This is particularly true for those starting from an RDBMS background, as NoSQL differs markedly from relational. In the RDBMS world, an engineer designs the data schema from the outset, and SQL queries are then run against the database. If business/application changes then require changes to the database, a DBA must get involved. It's not an easy process, as the DBA must navigate complex joins (i.e., inter-table relationships). NoSQL databases better fit modern application development, and provide significant database performance and developer agility benefits, albeit at the expense of some functionality.

Training

NoSQL databases are new enough that many database engineers will be RDBMS experts, but NoSQL neophytes. This shouldn't deter developers hoping to use NoSQL in a new project. After all, most NoSQL databases are open source and come with built-in communities, happy to help new users get up to speed.

Part of this shift is one of nomenclature. For example, as relational database expert Chris Bird points out, the syntax in NoSQL Land differs greatly from SQL, and may require some mental gymnastics for new users.

According to Daniel Doubrovkine, Art.sy's head of engineering, both NoSQL and RDBMS databases impose a learning curve on new users. The difference, he argues, is that NoSQL databases like MongoDB are simple to start with and get more complex over time, which works because a developer's expertise with the database matures over time, too. With SQL, Doubrovkine says, it's hard from the start and only becomes more complex at scale, if the requisite scale is even possible with RDBMS.

Of course, getting everything "perfect" from the start is difficult no matter what database technology you're using. As noted, Mozilla and XEmacs developer Jamie Zawinski opines,

The design process is definitely an ongoing thing; you never know what the design is until the program is done.

One of the great things about NoSQL, in fact, is the ability to iterate on one's data model as one's business requires it.

That's not to say that developers should go in blind. For some, checking forums and online documentation is enough. No matter the NoSQL database a developer may prefer, there is plenty of online documentation for each of them.

For others, hands-on training is preferred. In addition to standard, classroom-based training offered by DataStax, Basho, and other vendors that sponsor open-source NoSQL databases, there is also free online training. As just one indication of how strong demand is for NoSQL training, 10gen registered over 30,000 people for its inaugural online training.

Migration

Armed with information on how to best develop an application using NoSQL technology, the next step for many new users is to migrate away from the decades-old relational world they know.

But the bigger issue is careful planning of one's migration.

With over 25 million users and 2.5 billion check-ins, Foursquare runs at serious scale. But it didn't start that way. Though Foursquare now logs check-ins on Mars, just a few short years ago it logged its first check-in on Earth. As the company grew, Foursquare's development team had to scramble to ensure its data infrastructure could keep up with its user adoption.

Foursquare originally started with MySQL. When Harry Heymann, Foursquare's vice president of Engineering, joined in 2009, he moved Foursquare to PostgreSQL because it better suited the tools he was using. That all changed when the service took off with users, as Jon Hoffman, Foursquare's storage infrastructure engineering lead, has indicated. Scaling PostgreSQL promised to involve significant work, so Heymann started reviewing other options, including MongoDB, Cassandra, CouchDB, and sharded MySQL.

Once Foursquare determined that MongoDB best fit its requirements, the company moved into MongoDB slowly given the requisite code changes and potential risk of breaking things in the transition. For a period of time, Foursquare duplicated data, storing the data in PostgreSQL and MongoDB in two parallel, synced sets. For one collection, this process took several months, as it already had one million users and significant traffic. For smaller collections, the data migrations were faster.

Doing so, however, paid off. One of the biggest, early wins was moving the geographic query functions from PostgreSQL to MongoDB, which enabled Foursquare to handle the same load with fewer resources.

Art.sy, which indexes and makes searchable high-quality images of 30,000-plus works of art from over 3,000 artists, also transitioned from RDBMS to NoSQL, though the transition process was much more straightforward than Foursquare. Its migration from relational to NoSQL happened while the company was still in the midst of a closed beta. One cold restart later, the company had moved from its relational beginnings to NoSQL.

Importantly, the process for data migration will depend on which NoSQL technology a company chooses. The process for moving RDBMS data into a columnar database like Cassandra differs from data migrations to key-value stores like Riak, or to MongoDB.

The process, in each case, largely involves the same four steps, as Kristina Chodorow and Michael Dirolf identify:

  • Get to know your NoSQL database. Download it, read the tutorials, try some toy projects.
  • Think about how to represent your model in its document store [or key/value, column, graph, as appropriate].
  • Migrate the data from the relational database to your NoSQL database, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your NoSQL document [or key/value, column, graph] model using the language of your choice.
  • Rewrite your application code to query your NoSQL database through statements such as insert() or find().

This process will look different depending on the style of NoSQL database, but it's a good rough guide.

Concluding remarks

Foursquare's and Art.sy's applications may be somewhat unique, but their need to embrace a flexible, scalable data infrastructure is not. Smart companies architect for scale from the very beginning, which generally will mean turning to NoSQL. For those that start with a relational database, all is not lost: the process for migrating from an RDBMS to NoSQL is now well-trod, with a great deal of information available online and offline to help with the process.

Matt Asay is Vice President of Corporate Strategy at 10gen, the company behind MongoDB, the leading NoSQL database. With more than a decade spent in open source, Matt is a recognized open source advocate and board member emeritus of the Open Source Initiative (OSI). You can follow Matt on Twitter here.

3 comments
stuartcook
stuartcook

Wonderful article. The possibilities in a NoSQL database are explained in a very simple way. this article is surely helpful for those who are planning to migrate fromĀ a relational to a NoSQL cloud database. I've also found another article on Data Center migration by Delta Computer Group which may also be useful.

http://www.deltacomputergroup.com/it-services/data-center-migration

whitewash
whitewash

Great article in terms of describing various possibilities with NoSQL databases,


However I must disagree with the point that RDBMSs have suddenly become legacy and unsuitable for anything solid and that they should all be replaced.


NoSQL DBMSs are GREAT when you want to use them in the way they were designed to be used. For instance, we are now using neo4j on one project where endless join-ing would be a huge pain in the ass. We have data model, that really is suited for using with NoSQL. But for the non-graph portion of our data, we still use traditional RDBMS with its nice ACID (for handling financial operations/transfers, for instance) and everything and are satisfied with it.


But the data model you have is not always suitable for e.g. a Cassandra (or other big-table-based) database. The argument that Google, Facebook etc... uses it is quite solid, but these are all use cases, that have following in common:

- their data model in terms of entities and their relationships is quite simple. Facebook for instance, in its core, is all about posts, status updates, likes (that can be viewed as special form of status update with an additional "index") etc that only grow in time in predefined way. So does Twitter. So does Foursquare. In their nature, these are really simple services, whose selling point is their massiveness, their technical concern is "how to scale out something really simple, but in massive volumes")

- they mostly work in "append only" mode - such as system logs (for which BigTable is also perfect choice). Facebook rarely deletes something or changes something, that is already there (compared to how often it gets a new status update, hence, an append)

- because their data model is quite simple, they can survive a little bit of denormalization and can afford it (duplication of used disk space etc). In such large datacenters, few (hundreds) of disks more or less does not really matter.

- "eventual consistency" is pretty good for them. Or even a low percentage of inconsistency. Nobody really cares for one lost Twitter update, if it does not happen noticably often. Missing one money transfer, on the other hand, would be quite serious :)


On the other hand, in a typical enterprise application (such as some enterprise information system, CRM, ERP, etc...) items get changed/updated frequently. Those systems have many different entities, with many complicated relationships, where one update can trigger many other consequences, where workflows change from time to time. Denormalizing in such case would be a suicide, updating some entity or relation in a denormalized data model would be a harakiri. Not speaking about complicated relation graph between entities (and even their relations) which quite often can have cycles of some type. Denormalizing in this case would lead to an infinite loop with infinitely large data-sets :)

In these systems, you can still use RDBMS for day-to-day OLTP system and then use a Hadoop cluster with Hive (and/or other tools, such as Pig) on top of it for an OLAP system. You feed OLTP data into OLAP from time to time (e.g. once a day) and have a nice distributed DWH, which you can scale-out as you wish. And, if, for some reasons, the requirements for the OLAP system change drastically in the future, you don't have to worry about incompatible data structure, you just alter OLAP data scheme, rewrite your script for OLTP -> OLAP feeding and rebuild the whole thing from scratch.


To conclude this, I am a big fan of "using the right tool for the right job". While NoSQL certainly has its uses (and a very interesting ones), it is not a silver bullet and never should be considered as such. Everyone should consider his use cases and requirements for the system he is going to build and then decide for the best. Not automatically use some fancy new NoSQL DBMS, because it's cool right now (and, NoSQL term is actually a bit of a hype these days) and then, after a few months, run away screaming from the project.

Editor's Picks