Data science is all about experimentation, which is why the cloud is increasingly the go-to platform for big data, as Amazon's GM of Data Science told TechRepublic.
Two years ago, everyone was talking about big data, but few had the slightest clue how to do it productively. Today, that's changing, but arguably not for everyone.
The common denominator for big data success is cloud, suggested Matt Wood, general manager of Data Science at Amazon Web Services (AWS), in a conversation this week. Given the importance of experimentation in big data, anyone still trying to wrangle fixed data center assets on ever-changing data sets and business problems is doomed to fail.
Have data center, will wrangle data?
Not that everyone believes this, of course. Forrester analyst Richard Fichera, for example, argued that Hadoop requires dedicated, datacenter infrastructure to yield adequate performance.
While Fichera made some good points, he missed the key attribute of big data success: experimentation and iteration.
Part of the reason so many organizations fail to get much value from their data is because they approach it in the wrong way. When Gartner asked for the top big data challenges, the common theme was "We don't know what we're doing" (Figure A):
Big data challenges.
That's not surprising to Matt Wood, however. As he told me, the biggest inhibitor to big data productivity has been inflexible data infrastructure.
"The skills were there two years ago" to analyze and derive value from data, he suggested, "but the cost of running a big data project was too high." It's hard to get a manager or purchasing department excited about buying expensive hardware or software for an experiment, but by their very nature, big data projects are experimental.
As he told me, "You're going to fail a lot of the time, and so it's critical to lower the cost of experimentation." After all, you don't necessarily know which data you need to collect in advance or the questions you should ask of it.
You need to be able to scale your infrastructure as your experiment justifies it.
Cloud and big data: Bosom buddies
The cloud and, in particular, AWS has dramatically lowered the cost of scaling infrastructure, which "enables and reduces the blast radius of experimentation," declared Wood.
Again, consider running a big data project in your data center. While all that dedicated hardware and software sounds like a great idea, the question becomes, "dedicated to what?"
As Wood described,
"Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on. You need an environment that is flexible and allows you to quickly respond to changing big data requirements. Your resource mix is continually evolving - if you buy infrastructure it's almost immediately irrelevant to your business because it's frozen in time. It's solving a problem you may not have or care about any more."
This has led a bevy of organizations of different sizes, and across all verticals, flocking to AWS to buy into its full suite of data services:
But it has also led to a new generation of data-savvy professionals with real experience running big data projects.
Experimentation builds expertise
When I asked how real this shift was, Wood told me the cloud has led to "an exponential curve" of big data talent growth. In other words, as more people can work with their data on a trial-and-error basis in the cloud, the level of expertise has grown profoundly.
AWS has helped to change things, lowering the bar to getting started and becoming productive with an organization's data:
"It was true 18 months ago that people didn't know what they were doing with big data. Now, however, because customers can start experimenting easily at low cost, the growth in skills has been astronomical over the last two years."
This isn't about petabytes of data or any particular definition of "big data." According to Wood, "We want to enable customers to work productively with data at any scale. It's not about favoring one technology over another. It's not about petabytes of data. It's about offering high-quality, easy-to-use services at low cost to drive adoption."
It's working. Take Redshift, for example. The data warehouse service is AWS' fastest growing service, both because it makes hitherto complicated enterprise data warehousing easier and cheaper for those shackled with expensive, complex EDW solutions, but also because Redshift "brings data warehousing to a greater breadth of customers."
This could become AWS' lasting legacy: while the world talked about big data for years, AWS made it a reality for much of the market.