Matt Asay describes the approach one needs to take in migrating a standard relational database to an open source NoSQL cloud database.
By Matt Asay
The industry is on the cusp of tectonic changes in how and where data are stored and processed. For over 30 years, the venerable relational database management system (RDBMS), running in corporate data centers, has held the bulk of the world's data. This cannot continue. RDBMS technology can no longer keep pace with the velocity, volume, and variety of data being created and consumed. For this new world of Big Data, NoSQL databases are required.
Migrating to these open-source cloud databases, however, requires some preparation for enterprise IT that grew up with RDBMS.
How Big Data is changing everything
There's nothing wrong with the traditional RDBMS. It simply doesn't fit the world we live in anymore. Mobile, social, cloud: these and other trends complicate the variety of data and dramatically increase the volume of data being stored in the enterprise.
As RedMonk analyst James Governor argues:
The database market is back in play after a 30-year old freeze in which Oracle dominated the high end, and Microsoft the midmarket. Then along came open source, the cloud, NoSQL, in memory and everything changed....The idea that everything is relational? Those days are gone.
This isn't something that only concerns so-called web companies like Google and Foursquare. It's equally relevant for "old school" organizations in the Finance, Healthcare, Government, Retail, and other vertical industries, and projected by Gartner to drive $28 billion in IT spending in 2012. As organizations grapple with their Big Data problems, when data grow beyond one server or start in a distributed fashion, they generally find themselves on the same road as the web companies: open-source, NoSQL databases.
While Big Data often gets associated with data analytics technologies like Hadoop and Storm, it's actually much broader than this, and far more concerned with data storage than analytics. After all, if an enterprise can't scale storage effectively, it will never have a "Big Data" problem to analyze. Hence, of the $30 billion global database market, only 25 percent is analytics, with the rest being OLTP or operational databases. Ironically, the recent rise of data analytics innovations like Hadoop stems from RDBMS failure to cope with Gartner's three V's of Big Data: high-volume, high-velocity, and high-variety of data.
Migrating from RDBMS to NoSQL
It's clear that the database is critical to successfully managing the explosion of data. What's less clear is how to transition from legacy RDBMS to modern NoSQL databases. Successfully migrating from a relational world to a NoSQL world requires careful planning.
In fact, one of the biggest dings against NoSQL databases like MongoDB or Neo4j is that they're so easy to work with that developers end up jumping in headfirst, without bothering to properly construct their data model, thereby causing problems later. While NoSQL databases do provide significantly more developer agility and flexibility, they still shouldn't be used willy-nilly.
This is particularly true for those starting from an RDBMS background, as NoSQL differs markedly from relational. In the RDBMS world, an engineer designs the data schema from the outset, and SQL queries are then run against the database. If business/application changes then require changes to the database, a DBA must get involved. It's not an easy process, as the DBA must navigate complex joins (i.e., inter-table relationships). NoSQL databases better fit modern application development, and provide significant database performance and developer agility benefits, albeit at the expense of some functionality.
NoSQL databases are new enough that many database engineers will be RDBMS experts, but NoSQL neophytes. This shouldn't deter developers hoping to use NoSQL in a new project. After all, most NoSQL databases are open source and come with built-in communities, happy to help new users get up to speed.
Part of this shift is one of nomenclature. For example, as relational database expert Chris Bird points out, the syntax in NoSQL Land differs greatly from SQL, and may require some mental gymnastics for new users.
According to Daniel Doubrovkine, Art.sy's head of engineering, both NoSQL and RDBMS databases impose a learning curve on new users. The difference, he argues, is that NoSQL databases like MongoDB are simple to start with and get more complex over time, which works because a developer's expertise with the database matures over time, too. With SQL, Doubrovkine says, it's hard from the start and only becomes more complex at scale, if the requisite scale is even possible with RDBMS.
Of course, getting everything "perfect" from the start is difficult no matter what database technology you're using. As noted, Mozilla and XEmacs developer Jamie Zawinski opines,
The design process is definitely an ongoing thing; you never know what the design is until the program is done.
One of the great things about NoSQL, in fact, is the ability to iterate on one's data model as one's business requires it.
That's not to say that developers should go in blind. For some, checking forums and online documentation is enough. No matter the NoSQL database a developer may prefer, there is plenty of online documentation for each of them.
For others, hands-on training is preferred. In addition to standard, classroom-based training offered by DataStax, Basho, and other vendors that sponsor open-source NoSQL databases, there is also free online training. As just one indication of how strong demand is for NoSQL training, 10gen registered over 30,000 people for its inaugural online training.
Armed with information on how to best develop an application using NoSQL technology, the next step for many new users is to migrate away from the decades-old relational world they know.
But the bigger issue is careful planning of one's migration.
With over 25 million users and 2.5 billion check-ins, Foursquare runs at serious scale. But it didn't start that way. Though Foursquare now logs check-ins on Mars, just a few short years ago it logged its first check-in on Earth. As the company grew, Foursquare's development team had to scramble to ensure its data infrastructure could keep up with its user adoption.
Foursquare originally started with MySQL. When Harry Heymann, Foursquare's vice president of Engineering, joined in 2009, he moved Foursquare to PostgreSQL because it better suited the tools he was using. That all changed when the service took off with users, as Jon Hoffman, Foursquare's storage infrastructure engineering lead, has indicated. Scaling PostgreSQL promised to involve significant work, so Heymann started reviewing other options, including MongoDB, Cassandra, CouchDB, and sharded MySQL.
Once Foursquare determined that MongoDB best fit its requirements, the company moved into MongoDB slowly given the requisite code changes and potential risk of breaking things in the transition. For a period of time, Foursquare duplicated data, storing the data in PostgreSQL and MongoDB in two parallel, synced sets. For one collection, this process took several months, as it already had one million users and significant traffic. For smaller collections, the data migrations were faster.
Doing so, however, paid off. One of the biggest, early wins was moving the geographic query functions from PostgreSQL to MongoDB, which enabled Foursquare to handle the same load with fewer resources.
Art.sy, which indexes and makes searchable high-quality images of 30,000-plus works of art from over 3,000 artists, also transitioned from RDBMS to NoSQL, though the transition process was much more straightforward than Foursquare. Its migration from relational to NoSQL happened while the company was still in the midst of a closed beta. One cold restart later, the company had moved from its relational beginnings to NoSQL.
Importantly, the process for data migration will depend on which NoSQL technology a company chooses. The process for moving RDBMS data into a columnar database like Cassandra differs from data migrations to key-value stores like Riak, or to MongoDB.
The process, in each case, largely involves the same four steps, as Kristina Chodorow and Michael Dirolf identify:
- Get to know your NoSQL database. Download it, read the tutorials, try some toy projects.
- Think about how to represent your model in its document store [or key/value, column, graph, as appropriate].
- Migrate the data from the relational database to your NoSQL database, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your NoSQL document [or key/value, column, graph] model using the language of your choice.
- Rewrite your application code to query your NoSQL database through statements such as insert() or find().
This process will look different depending on the style of NoSQL database, but it's a good rough guide.
Foursquare's and Art.sy's applications may be somewhat unique, but their need to embrace a flexible, scalable data infrastructure is not. Smart companies architect for scale from the very beginning, which generally will mean turning to NoSQL. For those that start with a relational database, all is not lost: the process for migrating from an RDBMS to NoSQL is now well-trod, with a great deal of information available online and offline to help with the process.
Matt Asay is Vice President of Corporate Strategy at 10gen, the company behind MongoDB, the leading NoSQL database. With more than a decade spent in open source, Matt is a recognized open source advocate and board member emeritus of the Open Source Initiative (OSI). You can follow Matt on Twitter here.