Over the years, the Apache Cassandra community has demonstrated the best and worst of open source collaboration. But a funny thing happened on the way to Cassandra’s 4.0 (beta) release: A sometime fractious family of contributors came together to deliver something truly exceptional. Already one of the world’s most popular databases (currently ranked #10 on DB-Engines.com), Cassandra’s 4.0 beta release promises new levels of stability while also rediscovering flair. As Instaclustr CTO Ben Bromhead put it, “I’m an absolute sucker for process and quality improvement and Cassandra 4.0 has this in spades, but the improvements around Netty and Zero Copy Streaming also look super cool.”
To learn more about the release, and why enterprises comfortable in their relational data models should care, I talked to Josh McKenzie, Apache Cassandra Committer and PMC (Project Management Committee) Member.
SEE: Big data management tips (free PDF) (TechRepublic)
An open source community grows up and together
Of course, if you’ve been paying attention to the world of databases over the years, you know that data is not sitting comfortably in the tidy rows and columns of relational databases. Modern data often doesn’t fit. Asked about this, McKenzie noted, “We don’t know what the data of tomorrow looks like,” making it critical to rely on open source while also exploring non-relational approaches to data management.
While Cassandra has long been a popular option with enterprises, for years the community neglected key stability issues. What had once been a strength became a weakness.
But this is also where Cassandra becomes such an interesting success story. For years I’ve argued that unless users of open source contribute back, open source won’t achieve its maximum impact. Vendors are nice, but open source users have unique perspectives on how to improve software.
In the case of Cassandra, some of its key users include Apple, Netflix, and Instagram, who increased their participation in the project even as some vendors reduced their participation. But the 4.0 release represents a near-perfect confluence of vendors and users coming together to make Cassandra dramatically better, as McKenzie pointed out:
The Cassandra community is incredibly robust at this point. While it’s somewhat bimodal between contributors employed by DataStax and Apple with regards to the lines of code in the 4.0 release, the number of humans involved and contributors scratching their own itch represents the majority of commits on the project. While committers are of course involved in every merge to the code-base (as per the Apache Way), on 60+% of the tickets the other side of that work is someone that’s contributing their time and energy into the project. That kind of diversity is crucial to the long-term resilience of an open-source community and we’re quite happy with how things look on that front coming up to 4.0.
One key area that users, in particular, have contributed is toward Cassandra stability.
Making Apache Cassandra stable…together
As McKenzie related, Cassandra 4.0’s improved stability comes, in large part, from “a significant amount of real-world workload testing” going on at big contributors that replay real use-cases through the system to ensure both mixed-version (i.e., during upgrade) clusters are healthy as well as post-upgrade. For example, Netflix engineers have done some scale performance testing.
The result? As McKenzie related, the 4.0 release has over 30% more bug fixes and improvements in it than the 3.0 release and “is the best tested, most stable .0 release of Cassandra ever.” The addition of Zero Copy Streaming, mentioned above, means scaling clusters will be up to 5x faster without vnodes (virtual nodes), and recovery from hardware failure should be 5x faster, as well. “We’ve never seen the community really rally around quality and stability in this way,” he said.
At the same time, the addition of full query, real-time audit logging and workload replay adds a significant new element of visibility into the administration of and introspection into what people are doing in the database. Ultimately, therefore, “4.0 is targeting everyone that runs Cassandra, making all the core basics of how it’s used more robust, visible, and elastic,” said McKenzie. The result? Better than 20% performance improvements in many of the workloads the community has been using to regression test.
As for what comes next (in Cassandra 5.0), we’re “moving towards a pluggable, modular storage engine and adding new ways to visualize and explore the data in your system, all while keeping the scale and availability guarantees users demand from the database,” McKenzie noted. Furthermore, he stressed, “We’re keenly aware that Cassandra needs to keep evolving to keep up with innovation in other adjacent and complementary spaces and meet users where they are, helping them solve the interesting, fast-paced problems they’re looking to solve in modern, cloud-native application development.”
Because the Cassandra community has learned how to blend both vendors and users together, it’s well-poised to deliver on this promise.
Disclosure: I work for AWS but the views expressed herein are mine, and don’t reflect those of my employer.
How to become a data scientist: A cheat sheet (TechRepublic)
Big data’s role in COVID-19 (free PDF) (TechRepublic)
Power checklist: Local email server-to-cloud migration (TechRepublic Premium)
Volume, velocity, and variety: Understanding the three V’s of big data (ZDNet)
Big data: More must-read coverage (TechRepublic on Flipboard)