At the 2017 Dell EMC World conference, Dell EMC systems engineer Cory Minton explained how IT leaders can better think through their big data deployments.
Big data promises much in terms of business value, but it can be difficult for businesses to determine how to go about deploying the architecture and tools needed to take advantage of it.
Everything from descriptive statistics to predictive modeling to artificial intelligence is powered by big data. And what an organization wants to accomplish with big data will determine the tools it needs to rollout.
SEE: Open source big data and DevOps tools: A fast path to analytics applications (Tech Pro Research)
At the 2017 Dell EMC World conference on Monday, Cory Minton, a principal systems engineer for data analytics at Dell EMC, gave a presentation explaining the biggest decisions an organization must make when deploying big data. Here are six questions that every business must ask before getting started in the space:
1. Buy vs. build?
The first question to ask is whether your organization wants to buy a big data system or build one from scratch. Popular products from Teradata, SAS, SAP, and Splunk can be bought and simply implemented, while Hortonworks, Cloudera, Databricks, Apache Flink can be used to build out a big data system.
Buying offers a shorter time to value, Minton said, as well as simplicity and good value for commodity use cases. However, that simplicity usually comes with a higher price, and these tools usually work best with low diversity data. If your organization has an existing relationship with a vendor, it can be easier to phase in new products and try out big data tools.
Many of the popular tools for building a big data system are cheap or free to use, and they make it easier to capitalize on a unique value stream. The building path provides opportunities for massive scale and variety, but these tools can be very complex. Interoperability is often one of the biggest issues faced by admins who go this route.
2. Batch vs. streaming data?
Batch data, offered by products like Oracle, Hadoop MapReduce, and Apache Spark, are descriptive and can handle large volumes of data, Minton said. They can also be scheduled, and are often used to build out a playground of sorts for data scientists to experiment.
Products like Apache Kafka, Splunk, and Flink provide streaming data capabilities that can be captured to create potentially predictive models. With streaming data, speed trumps data fidelity, Minton said, but it also offers massive scale and variety. It's also more useful for organizations that subscribe to DevOps culture.
3. Kappa vs. lambda architecture?
Twitter is one example of lambda architecture. Data is split into two paths, one of which is fed to a speed layer for quick insights, while the other path leads to batch and service layers. Minton said that this model gives an organization access to both batch and streaming insights, and balances lossy streams well. The challenge here, he said, is that you have to manage two code and app bases.
Kappa architecture treats everything as a stream, but it's a stream that aims to maintain data fidelity and process in real time. All data is written to an immutable log that changes are checked against. It is hardware efficient, with less code, and it is the model that Minton recommends for an organization that is starting fresh with big data.
4. Public vs. private cloud?
Public and private cloud for big data require many of the same considerations. For starters, an organization must consider what environment their talent is most comfortable working in. Also, data provenance, security and compliance needs, and elastic consumption models should also be thought of.
5. Virtual vs. physical?
Years ago, the debate around virtual vs. physical infrastructure was much more heated, Minton said. However, virtualization has grown to become competitive with physical hardware in a way that they have become similar in regards to big data deployments. It boils down to what your administrators are more comfortable with and what works for your existing infrastructure.
6. DAS vs. NAS?
Direct-attached storage (DAS) used to be the only way to deploy a Hadoop cluster, Minton said. However, now that IP networks have increased their bandwidth, the network-attached storage (NAS) option is more feasible for big data.
With DAS, it is easy to get started, and the model works well with software-defined concepts. It's driven to handle linear growth in performance and storage, and it does well with streaming data.
NAS handles multi-protocol needs well, provides efficiency at scale, and it can address security and compliance needs as well.
- Dell EMC makes 6 key investments in the future of the data center and HCI (TechRepublic)
- Big Data 2017: The future is cloudy (ZDNet)
- 5 ethics principles big data analysts must follow (TechRepublic)
- AI: the promise of big data (ZDNet)
- 5 big data trends that will shape AI in 2017 (TechRepublic)