Why is machine learning finally real? It’s the data, stupid. Lots (and lots) of data.

That’s a key message from Cloudera co-founder Mike Olson’s Strata + Hadoop World keynote earlier this week in San Jose, California. As he declared: “The algorithms that early researchers and current practitioners use are ravenous for data and we finally have enough data on the planet to feed them. They also need scale-out computation and storage at low cost.”

In fact, the mountains of data that we now enjoy are a direct result of high-quality open source software running on commodity hardware: More applications churning out more data for more people.

A game only the rich can play

Despite this low-cost hardware and software, and its impact on machine learning, let’s be clear: Big enterprises are the primary beneficiaries. Why? As Olson went on to explain, among enterprises doing over $1 billion a year in revenue–Cloudera’s target customer–“the appetite for these [machine learning] capabilities is insatiable” as they “absolutely have the data at scale.”

SEE: Why AI and machine learning are so hard, Facebook and Google weigh in

Data, after all, is necessary to train the machines. A small company could have big plans but without big data to feed those plans, it’s a losing battle. As such, large enterprises are in a prime position to use big data to enrich themselves and effectively hold off would-be, smaller competitors.

(As a side note, as useful as open source has been, we really need to have open data sets. Stanford has been exemplary in this, annotating data to make it more readily useful for machine learning. This is a new frontier in “open source,” and we need to explore it more.)

Sparking ML

One thing that aids these big companies has come from an egalitarian source: Apache Spark. I’ve written about Spark’s impact on multiple occasions, but it’s easy to understate just how important it has been. Indeed, though Cloudera recognized the importance of Apache Spark early on, Olson noted in his keynote, one aspect of it has “taken them by surprise.”

[Spark] allowed people to build and deploy scale-out machine learning applications much faster than they had previously done. [Why?] Its flexibility and ease of programming meant that you could build machine learning apps, train up models on massive data very, very quickly. That has led to huge interest in the ecosystem.

People thought Spark was simply a better Hadoop. It turns out that it offers much, much more.

Additionally, this pace of innovation isn’t going to slow down. The opposite is true, according to Olson. He went on to explain that, as impressive the pace of innovation has been in machine learning and other big data software, “fasten your gravity belts” because it’s about to get much faster given the investments from hardware vendors like Intel in optimizing chips and other hardware to improve computational speed, among other things.

In theory, such benefits could accrue to any company. In practice, only those in the billion-dollar club can really afford to play, because only they have the right currency: Data. Copious quantities of data.