Creating an integrated pipeline for big data workflows is complex. Read about several factors to consider.
Organizations that elect to develop their own big data analytics are fast discovering what they learn on every other IT project: vendors' inability to agree on standards in a timely manner can obstruct integration and delay the smooth throughput of data through big data pipelines.
What is the big data pipeline?
The pipeline is an entire data flow designed to produce big data value. It begins with data collection and proceeds with the cleaning or filtering of this data, followed by the structuring of data into data repositories for easy access and the development of effective query tools that can get to the bottom of the data to uncover uncommon answers for pressing business problems.
Putting together this pipeline isn't easy, nor is it necessarily straightforward.
For starters, incorporating the rush of incoming data is overwhelming. Much of the big data originating from Internet of Things logs must be siphoned off from "just noise" signals that chronicle communications handoffs and other systemic processes, but that yield no data.
Then, there are the end users. Finance, for instance, might want to merge systems of record data with data streaming in from the internet. Pulling all of this data together to yield the expected results can be complicated, with each task along the data pipeline requiring different sets of skills and tools.
"The challenges for many organizations working with their own big data pipelines is finding the 'glue' to stick all of these different skills and toolsets together into a cohesive fabric," said Ion Stoica, CEO of Databricks and a University of California Berkeley professor. "When you use Hadoop, Storm, GraphLab, and other big data solutions, each comes with its own set of tools. Each also uses its own programming language, whether it is Java, C++, or something else. For enterprises constructing their own big data pipelines, it is left up to them to stick these tools together into a collective infrastructure that can manage all of the various areas of the pipeline."
Databricks' own product uses Apache Spark, which Stoica says is a single processing and storage engine capable of supporting a diversity of computing platforms and tools. "With Apache Spark, a site has to worry about managing only one API (application programming interface), which makes managing the data pipeline easier," he said.
Another approach to data pipeline management that more organizations are warming to is the concept of managing the entire pipeline in the cloud, and using third-party cloud services providers like Databricks and others to do the heavy integration work. "A cloud-based solution is easy to implement and to manage," said Stoica. "It is literally possible to instantiate an end to end data pipeline in seconds." This enables executives through the enterprise to get their big data results in real-time through the use of interpretative dashboards that also give them the flexibility to look at the data in different ways, and to drill down into the data.