Apache Spark rises to become most active open source project in big data

Adoption interest in Spark has topped MapReduce, says a new survey. What's supporting interest is the need for speed, boosting agility, and revenues.

Apache Spark continues to attract attention in the big data world, where it's expected to help drive the next wave of innovation.

A survey on Hadoop from big data company Syncsort showed that 70% of survey participants are most interested in Spark, higher even than MapReduce, the current adoption leader, at 55%.

Syncsort surveyed 250 IT professionals. From that group, 66% were from firms with more than $100 million in annual revenue.

A healthy interest is not a surprise. In Apache Spark's relatively short life, there's been much discussion of its ascendancy. In September, Databricks, the company behind Spark, released results from a survey showing that Spark is the most active open source project in big data with more than 600 contributors within the past year, which is up from 315 in 2014. Plus, Spark is in use not just in the IT industry, but areas like finance, retail, advertising, education, health care, and more. That survey also showed that 51% of Spark users are using three or more Spark components.

It's also helpful to have backing from a company like Cloudera, which announced in September its own initiative to improve Spark. What that initiative also did was replace MapReduce as the default processing engine for Hadoop— a real-life example of the much-buzzed shift from one to the other.

SEE: Spark promises to up-end Hadoop, but in a good way

A few months later, Cloudera published a Year in Review for Apache Spark in November 2015 saying that Spark has 50% more activity than the core Apache Hadoop project itself. The report also said that Cloudera has more clients running Spark than all other Hadoop distributions combined.

So, yes. Spark is popular.

Last year, TechRepublic's Matt Asay spoke with Databricks' Ion Stoica

"Stoica feels strongly that Spark will replace MapReduce as Hadoop's default execution engine but was equally sure that it would continue to complement the rest of the Hadoop ecosystem," Asay wrote.

Asay boiled down the reasons for Spark's popularity to these: simplicity, performance, and flexibility.

Gartner analyst Neil Heudecker wrote in a 2015 blog post that companies realize they can't take a wait-and-see approach, like they did previously with Apache Hadoop. Wait-and-see can mean missing the bus.

Hype is a powerful driver. Though, as Asay pointed out in another article, big data changes quickly, and it wasn't so long ago that MapReduce was the "it" technology.

TechRepublic spoke via email with Syncsort's general manager of big data, Tendü Yoğurtçu, about Spark development and the other main trends in the survey. Along with her insights, she shared what IT leaders need to know about each of the trends.

It is useful to look at what's driving those trends in terms of the three Vs of big data: volume, variety, and velocity. The need for speed—velocity—is evident in the shift to real-time data. Volume and variety are on display in the enterprise efforts to pool different types of data, from streaming and legacy sources, to cut costs, and boost agility.

And there may even be a fourth V—visibility—when it comes to Hadoop-based approaches to data governance and security.

SEE: Job description template: Data scientist (Tech Pro Research)

Trend: Apache Spark will move from a talking point into deployment

Tendü Yoğurtçu: While MapReduce will remain the prevalent compute framework in production, we can expect to see more Spark deployments in 2016 as the need for real-time insights increases. Several factors are contributing to this trend, one of which being vendors like Cloudera, Hortonworks, IBM and MapR backing Apache Spark as a compute framework over Hadoop. Apache Spark has the promise of being the single compute framework for variety of workloads for real-time and batch, for interactive queries and predictive analytics, etc., [which is] a clear benefit for enterprises.

What IT leaders need to know: Organizations that are considering Apache Spark today and want to keep their options open in the future should look for tools that allow them to visually design data transformations once and run them in a standalone or distributed manner across multiple compute frameworks including Hadoop, MapReduce, and Spark, on premise or in the cloud - without the need to recompile or rewrite applications. While Apache MapReduce is very much designed for batch workloads, Apache Spark can also be used for streaming and interactive queries. It's important to clearly identify the goals and specific use case for your big data project, and then select the products that best fit your needs.

Trend: More organizations will leverage streaming, real-time data sources

Tendü Yoğurtçu: Important business decisions often require the most recent data available. What good is monitoring for fraud detection or pulling data for insurance claim validation if the data itself is outdated? This is why enterprises are looking to leverage streaming and real-time data analytics to make informed decisions, quickly. Also, with the increase in number of connected devices and IoT use cases, more and more organizations are looking into having a single data pipeline where they can accommodate both batch and streaming data sources.

What IT leaders need to know: Simplifying and unifying the interface for batch and streaming data while taking advantage of platform optimizations will be a main focus for these initiatives. The ability to transform and prepare data in flight will be more important, eliminating the need for staging increasing volumes of data. Though challenging, this will also create an opportunity to deliver next generation data integration products, future proofing users' applications while taking advantage of highly scalable and distributed platforms including Apache Hadoop and Apache Spark.

Trend: Offloading from expensive platforms into Hadoop will continue to increase in numbers and scope

Tendü Yoğurtçu: Operational efficiency use cases continue to be the low hanging fruit for big data initiatives. Offloading expensive workloads from legacy platforms into Hadoop helps organizations liberate data and budgets while gaining business agility.

What IT leaders need to know: The first step in offloading these expensive workloads is to build the enterprise data hub. This requires the ability to easily access all enterprise data whether it is from mainframe or relational databases, or click stream applications which can be a daunting task. Having a single view of all enterprise data, simplified access with point and click interfaces and security integration with the Hadoop stack will be critical for maximizing ROI and successful implementation of these initiatives.

Trend: Data governance and security will be major areas of focus as organizations move to production deployments

Tendü Yoğurtçu: The emergence of new and more efficient tools for data management has opened the door for businesses to adopt a Hadoop first approach to data management. With the enterprise data hub and data-as-a-service implementations, the requirement for data governance and security has become more critical than ever. As Hadoop matures as a data platform, enterprise focus will shift more to governance and compliance.

What IT leaders need to know: The IT leaders will need to get a 360-degree view of the data across the enterprise and also partner with the lines of business they are servicing to define the right level of access for that data. The security integration over Hadoop will be critical.

Trend: Companies will seek to drive ROI for big data projects by simplifying the technology infrastructure

Tendü Yoğurtçu: As the number of applications in production deployments increases, the ability to leverage a single software environment for accessing all enterprise data—batch and streaming—has become more important than ever. This simplified infrastructure allows organizations to more efficiently allocate resources and scale workloads for whatever their needs may be. This helps maximize ROI on big data projects, especially with regard to real-time analytics that create more insights for businesses.

What IT leaders need to know: To realize the full strategic benefits of Hadoop and big data, businesses need a streamlined approach to managing all the different tools. This includes looking for an easy graphical interface for analysis and relying on developers that have a deep understanding of these tools and their underlying frameworks. It also means insulating the applications from the rapidly changing technology stack, and integration with highly scalable and distributed frameworks like MapReduce and Spark.

Also see:


Brian Taylor is a contributing writer for TechRepublic. He covers the tech trends, solutions, risks, and research that IT leaders need to know about, from startups to the enterprise. Technology is creating a new world, and he loves to report on it.

Editor's Picks

Free Newsletters, In your Inbox