Remember one year ago when a big data architecture was defined as an architecture "to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems"?
That was a large enough concept to digest—but as big data assumes its place in daily IT production, the concept of big data architecture is becoming even more complex.
First, there are the many different engines you might choose to run with your big data. You could choose Splunk to analyze log files, or Hadoop for large file batch processing, or Spark for data stream processing. Each of these specialized big data engines requires its own data universe, and ultimately, the data from these universes must come together—which is where the DBA is called in to do the stitching.
SEE: EnterpriseIoT calculator: TCO and ROI (Tech Pro Research)
But that's not all.
Organizations are now mixing and matching on-premise and cloud-based big data-processing and data storage. In may cases, they are using multiple cloud vendors as well. Once again, data and intelligence from these various repositories must be blended together at some point, as the business requires.
"This is a system integration problem that vendors need to help their clients solve," said Anoop Dawar, SVP of product management and marketing for MapR, a converged data platform for big data . "You have to not only be able to provide a platform for all of the different big data processing engines and data stores that are out there, but you must also be able to rapidly provide access to new big data processing engines and data stores as they emerge."
SEE: 60 ways to get the most value from your big data initiatives (free TechRepublic PDF)
Here's why it matters.
In today's IT shops, if you are using on-premise big data-processing, or even if you're processing data in the cloud, the tendency is to simply allocate another set of compute cluster for a big data app that requires its own engine or some type of hybrid processing you don't presently have. Every time you do this you multiply your big data clusters—and this complicates your big data architecture because you are now faced with closure silo integration.
"What you end up with is data duplication and fragmentation," Dawar said, "It becomes a big problem when you try to navigate through this data maze to facilitate data-driven decisions quickly."
How do you avoid this quandary? Here are five strategies that can help.
1. Use a single data platform
Whether you process big data on premise, in the cloud, or in a hybrid on-premise/cloud combination, your data ultimately should live on a single platform that it can be pulled from. This averts data duplications or users getting different versions of data that can result in conflicting business decisions.
2. Limit your active data and eliminate or archive the rest
"You don't necessarily need access to all of your images from the last five years, Dawar said. "Maybe you only need access to images from the last six months, and the rest can be archived." By slimming down the amount of data that your algorithms and queries operate against, you streamline performance and get to results faster.
3. Ultimately look for a cloud-based solution
Cloud-based solutions offer greater agility and elasticity to scale to the needs of your big data processing and storage. There will always be companies with highly sensitive data that prefer to keep this data on premise, but the majority of big data processing and storage can be done in the cloud.
4. Include disaster recovery in your big data architecture planning
"You don't need to do a full recovery if a disaster strikes, but you do need to restore the subset of data that your applications minimally require," Dawar said. This subset of data is what you should focus on with your cloud providers, or on premise, if you're processing your big data that way. The important thing is to actually test the DR to make sure that it works the way you think it will."
5. Include sandbox areas for algorithm experimentation
Whether your users are experimenting with algorithms for new product development, financial risk analysis, or market segmentation, your big data architecture should have adequate sandbox areas where proofs of concept can be tried and refined before they are placed into production. The cloud is a great place to deploy your sandboxes since it can be easily scaled upward or downward based upon demand.
Have your big data initiatives become complicated and difficult to manage? What strategies have worked best for you? Share your advice and experiences with fellow TechRepublic members.
Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President of Product Research and Software Development for Summit Information Systems, a computer software company; and Vice President of Strategic Planning and Technology at FSI International, a multinational manufacturing company in the semiconductor industry. Mary is a keynote speaker and has more than 1,000 articles, research studies, and technology publications in print.