Data is something that companies grapple with every day - after all, we are in the era of Big Data. How to gather it, analyze it and interpret it. But one important part of dealing with data is figuring out how and where to store it. Below are 10 things to think about when choosing the right data storage technologies for your enterprise or project.
1. Consider all your options.
Relational databases may still be dominant, but their hold has slipped. While they have been a successful, leading data storage technology for 20 years, IT architects have been challenged by the impedance mismatch between the relational model and the in-memory data structures, and the unstructured nature of the data. Now, there is a movement away from using databases as integration points as the need to support large volumes of data by running on clusters results in a change in data storage. Relational databases still provide advantages and, for now, will continue to be used in most cases. However, multiple database options are now available depending on the nature of the data stored and how it will be manipulated.
2. How big is your data?
When evaluating data storage technologies, it's important to know how much data you're dealing with and in what format. With organizations grappling with massive amounts of unstructured data, a new data storage technology has emerged as "king" of Big Data, NoSQL. The growing need for rapid access to lots of unstructured data has led to the growing use of NoSQL databases, which process large volumes of data on clusters of machines more efficiently than relational databases.
3. If developer productivity and large-scale data are your pain points, NoSQL may be a good choice.
NoSQL is generally applied to a number of recent non-relational databases such as Cassandra, Mongo and Riak. The common characteristics of NoSQL databases include:
- Not using the relational model
- Running well on cluster
- Built for 21st century web estates
- Horizontally scalable
The two main reasons for using NoSQL technology are to improve programmer productivity by using a database that better matches an application's needs and to improve data access performance via some combination of handling larger data volumes, reducing latency, and improving throughput.
4. Different business problems need different solutions.
Storing user activity on websites is totally different than finding out which of your users is most connected to other users or dealing with huge write volumes such as capturing live stream of data. These different problems need different solutions. IT architects should make sure to understand the problem and choose the right solution before making the default choice.
5. If you're working with NoSQL databases, consider the data model types.
There is a common approach to categorizing NoSQL databases according to their data models. These include:
- Key-Value Databases - Key-value stores are simple hash tables, primarily used when all access to the database is via a primary key. These are the simplest NoSQL data stores to use from an API perspective. Some of these databases include: Riak, Redis or MemcachedDB.
- Document Databases - Document Databases store and retrieve documents. These are self-describing, hierarchical tree data structures, which can consist of maps, collections, and scalar values. Some of these databases include MongoDB, CouchDB, Terrastore and others.
- Column-Family Stores - Column family stores, such as Cassandra, HBase and Amazon SimpleDB, allow you to store data with keys mapped to values and the values grouped into multiple column families, each column family being a map of data.
- Graph Databases - Graph databases such as, Neo4J, Infinite Graph or OrientDB, allow you to store entities, also known as nodes, and relationships between these entities
6. Scale solutions to suit growth of data.
The rate of growth of data is no longer predictable. Gone are the days when we could plan for three-year cycles to upgrade hardware and do capacity planning. NoSQL allows scaling for performance and volume without any downtime by allowing expansion of clusters transparently.
7. You may need more than one data storage technology.
The most important outcome of the rise of NoSQL is the acceptance of database technologies beyond relational databases. However, NoSQL is only one set of data storage technologies, and other data storage technologies should be considered whether or not they bear the NoSQL label. Other options include file systems, event sourcing, memory image, version control, XML databases and object databases. This has led to a new era of "Polyglot Persistence."
Polyglot persistence is about using different data storage technologies to handle varying data storage needs. It can apply across an enterprise or within a single application.
8. NoSQL solutions can be introduced in existing applications.
In existing applications, functions that don't need relational databases such as searching, indexing content, relationship between customers and products, can be moved to use NoSQL solutions allowing the applications to scale and react to emerging customer needs.
9. Remember to consider the complexities.
Employing more data storage technologies increases complexity in programming and operations, so the advantages of a good data storage fit must be weighed against this complexity before moving forward with a specific technology.
Only by working with NoSQL and others - and discovering their strengths and weaknesses - can IT architects understand these new data storage technologies. In the future, organizations will use many data technologies. Data professionals will need to be familiar with these different approaches and know how to match them to different problems. When you introduce different data storage technologies, you will need to think about new ways of data modeling, data consistency, and evolution.
Learning the concepts is an important first step, but to really understand multiple storage technologies, you'll need to get the experience of building representative systems using them.
Martin Fowler, Chief Scientist, Thoughtworks and Pramod Sadalage, Principal Consultant, Thoughtworks. To learn more about NoSQL and other data storage technologies, check out "NoSQL Distilled" at http://martinfowler.com/nosql.html.