Graph databases aren't niche, argues Neo4j founder Emil Eifrem. And they're hella fast. Matt Asay explains.
NoSQL databases are clearly on the rise, but not all NoSQL is created equal.
After all, 451 Research recently discontinued its longstanding tracking of NoSQL database popularity, arguing that since "none of the top 10 look like changing places any time soon, and none of the players outside stand any chance of breaking into the top 10, the time has come to retire the NoSQL LinkedIn Skills Index."
That is, there's a heck of a lot of MongoDB and Cassandra still to come, but RavenDB...? Not so much.
But what about graph databases like Neo4j? Often the unsung hero of NoSQL, graph databases have a bright future, argues Neo4j founder Emil Eifrem in an interview. According to Eifrem, "The era of the one-size-fits-all database is over and ultimately most applications of a decent size and scope will use multiple databases," one of which will almost certainly be a graph database.
Big data, big graphs
A graph database is a general purpose database management system where data relationships are treated as first-class citizens. While that sounds great, what does it mean?
Let's break it down.
A graph database stores individual data points (such as key-value pairs or documents), but it also stores the data relationships between them. In fact, these relationships are critical. What Andy Oliver wrote in 2013 remains true today: in a graph database "Relationships matter as much as, if not more than, the data itself."
This has led to a misconception, according to Eifrem:
"The idea that graph databases are just good for social is a common misconception. It was even more widespread three years ago. It's a particularly dangerous myth, because it's partially true! Social data is a good fit for a graph database. It's just not the only fit, nor even necessarily the best fit. It's just the most intuitive for a lot of people."
Hence, Eifrem claims Neo4j increasingly finds itself assuming the role of The Everyman, happily taking on the role of general purpose database. Perhaps so, but Gartner analyst Nick Heudecker argues that "General purpose workloads are atypical" for graph databases, and it's a (mis)conception that persists.
But that's not to say they're niche.
This relationship-centric view of data results in a number of benefits, according to Eifrem, with "the most spectacular benefit being performance."
When you have a highly connected dataset--for example, in a fraud detection system or a recommendation engine or an identity management application--then a graph database can run significantly faster than a relational database.
How much faster? Eifrem claims that "a graph database can easily be a million times faster than a relational database."
That sort of claim sounds fantastical, and when I pushed back, Eifrem explained: "It's basically 1000 times performance improvements, despite a 1000 times increase in data size." In other words, a graph database accelerates transversals and also maintains performance, even as the database size grows.
This type of performance improvement, as Eifrem suggests, is the "red pill" that allows graph databases to help solve problems like:
- Track and stop fraud in real time rather than merely detect it after the fact
- Offer rich real-time recommendations based on current session and historical data (e.g., Adidas uses Neo4j to offer highly customized content based on current session behavior to make their online pages "stickier" and increase sales)
- Provide a 360-degree view of customers
- Use up-to-the-moment network topology to make critical operations decisions (e.g., eBay and other well-known parcel delivery providers benefit from the graph's ability to detect and change route connections in real time)
- Onboard new customers in an instant with real-time identity and access management
One graph to rule them all
I asked Eifrem how graph databases are typically used. After all, there are plenty of other NoSQL databases that more or less apply to the same types of problems, and some of them (like document databases) aim to be a one-stop-shop for most data needs.
In his response, Eifrem starts diplomatic:
"The era of the one-size-fits-all database is over, and ultimately most applications of a decent size and scope will use multiple databases. The role of the data architect is going to be to look at their big dataset (because all datasets are or will be big) and identify shapes in the data and the workloads. And for the tabular parts, put that in a relational database. For simple, high volume key-value pairs ('tall skinny tables'), put that in a key value store. For the messy, constantly changing or highly connected parts, put that in a graph database."
In other words, graph databases can't do it all, but Eifrem argues that a "do-it-all database" is actually a bad idea.
This jibes with Martin Fowler's prediction that we're heading into an era of "polyglot persistence" for databases, "where any decent-sized enterprise will have a variety of different data storage technologies for different kinds of data."
This is true... up to a point.
That point is developer fatigue. As nice as it sounds to learn 20 different databases to handle 20 different types of data, the reality is that it's simply impractical. On the RDBMS side, we ended up with Oracle, Microsoft SQL Server, MySQL, and Postgres (and IBM DB2 for a dwindling population). In NoSQL land, as indicated by 451 Research's decision to discontinue the NoSQL skills index, we're settling on a few general-purpose NoSQL databases (particularly MongoDB and Cassandra), with graph databases filling an increasingly important, though still specialized, role.
But Eifrem isn't content to be a small part of a larger data story.
First among equals
As he told me, "There's a concept of the first database for your project. Most projects will start out small, and then you will start out with a single database. We increasingly see Neo4j being used in those situations."
So, first among equals, as it were. "Lord of the databases."
That's an optimistic view, but is it credible? Sure, Neo4j can credibly claim five to 10 times as many production deployments as any other graph database, but can a graph database truly be general purpose like a document or wide column NoSQL database?
Here Eifrem's response is a little less diplomatic:
"In the long arc, I believe that Neo4j will see a wider adoption than systems like MongoDB and Cassandra. There's a lot of databases out there that can store and retrieve isolated data elements, be they shaped as key value pairs or documents or rows. But while a graph database can do that too, with Neo4j, we have higher aspirations than just flipping bits on disk: we want to make sense of data and provide insights in real-time through the connections in your data."
Though some will dispute the idea of a general-purpose graph database, part of the problem, it seems, is the fixation on graph. "To date, we've emphasized the graph-only use cases for Neo4j because we wanted people to understand how it's different from all other databases," Eifrem notes, "And some of those use cases are very strong and highly differentiated: they basically can't be solved with any other technology. That's an amazing foundation for growing a market."
A future built on graph?
But it's not the end game. As Eifrem acknowledges, this strategy leads people to thinking that "graph databases are today probably seen as only useful when dealing with highly connected data. The truth is, it's already being used as the primary data store for transactional, operational business applications around the globe."
In other words, "Neo4j solves lots of problems because it's a graph database, but that does not mean it can only solve 'graph problems.'" Eifrem sees Neo4j becoming a general-purpose database of sorts. It's not there yet.
This is similar to the aspirations of OrientDB, as its president (and my former colleague) Luca Olivari told me in an interview. Graph databases are somewhat niche, but they're adding capabilities (moving to multi-model, in the case of OrientDB) to make them more general purpose.
This doesn't mean that they'll make it.
After all, 451 Research's Matt Aslett points out that "there is only one trend: the total dominance of MongoDB." Cassandra is also on a tear, and both are already general purpose without significant improvements.
Still, it's nice to see Neo4j taking the leading graph database beyond its graph roots. That may push all NoSQL databases to emphasize data relationships to a higher degree, which would be a very good thing.
- The machines are eating your BI, says DataStax CEO
- Amazon wants to eat your database, too
- A new breed of database hopes to blend the best of NoSQL and RDBMS
- Apple's secret NoSQL sauce includes a hefty dose of Cassandra