Commentary: At last, storage is fueling the big data hype, which is also fueling artificial intelligence.
We spent a lot of time talking about big data in the early 2010s, but much of it was just that: talk. A few companies figured out how to effectively put large quantities of highly varied, voluminous data to use, but they were more the exception than the rule. Since then, more companies are finding success with AI and other data-driven technologies. What happened?
According to investor Matt Turck, big data finally became real when it became easy. Whereas early efforts to store and process massive quantities of data like Apache Hadoop were more of a "headfake," he suggested, more modern "cloud data warehouses...provide the ability to store massive amounts of data in a way that's useful, not completely cost-prohibitive and doesn't require an army of very technical people to maintain."
Big data, in other words, became truly "big" the moment it became more usable by mainstream enterprises. Think of this more approachable, affordable data as the fuel. The question is what we'll use it to power. Oh, and who will sell the big data pickaxes and shovels?
Raining on the clouds
On this last question, it's fascinating to note that some of the most important companies in this data infrastructure world aren't the clouds. Even more interesting, companies like Databricks and Snowflake happily run on top of the compute from AWS, Google Cloud and Microsoft. The cloud providers have massive quantities of data (no one has done more to modernize how enterprises run than Amazon's S3 storage service), run their own data warehouse services and yet still have ceded ground to comparatively tiny competitors.
If you're a startup, this should give you hope.
SEE: Hiring kit: Data scientist (TechRepublic Premium)
As I've pointed out, while some cloud providers may not like customers to consider "multicloud," these data infrastructure startups increasingly hedge their cloud bets by ensuring they run equally well across the big three cloud providers. Given that data is the critical component of strategic advantage by giving customers easy ways to move application data between clouds, they ensure that they, not the underlying clouds, steer their customers' data destinies.
This is one reason that venture funding for AI startups is on an absolute tear. As Turck mentioned, CB Insights pegged AI funding at $36 billion in 2020; in just the first six months of 2021, AI startups funding topped $38 billion. Few seem to be betting on the big clouds scooping up all the returns on AI investments. Nor are VCs leaving the clouds to define data infrastructure.
So where does Turck see data infrastructure and AI heading over the next year?
Where the money goes
In data infrastructure, Turck called out the following trends:
Data mesh: Like microservices in software development, the idea is to "create independent data teams that are responsible for their own domain and provide data 'as a product' to others within the organization."
DataOps: Like DevOps but for data, it involves "building better tools and practices to make sure data infrastructure can work and be maintained reliably and at scale."
Real time: We've been talking about this for years, but Confluent's IPO and continued success indicate a desire to work with real-time data streaming across a broader range of use cases than originally supposed.
Metrics stores: Building trust in enterprise data by "standardiz[ing] definition of key business metrics and all of its dimensions, and provid[ing] stakeholders with accurate, analysis-ready data sets based on those definitions."
Reverse ETL: "[S]its on the opposite side of the warehouse from typical ETL/ELT tools and enables teams to move data from their data warehouse back into business applications like CRMs, marketing automation systems, or customer support platforms to make use of the consolidated and derived data in their functional business processes."
Data sharing: Helps companies to "share data with their ecosystem of suppliers, partners and customers for a whole range of reasons, including supply chain visibility, training of machine learning models, or shared go-to-market initiatives."
SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic)
And what about the world of AI that emerges from this data infrastructure?
Feature Stores: "It acts as a centralized place to store the large volumes of curated features ['an individual measurable input property or characteristic'] within an organization, runs the data pipelines which transform the raw data into feature values, and provides low latency read access directly via API."
ModelOps: "[A]ims to operationalize all AI models including ML at a faster pace across every phase of the lifecycle from training to production."
AI content generation: Like GPT-3, it's used for "creating content across all sorts of mediums, including text, images, code, and videos."
Continued emergence of a separate Chinese AI stack: "With nationalist sentiment at a high, localization to replace western technology with homegrown infrastructure has picked up steam"
Of course, not all of Turck's predictions will pan out. But if history proves a reliable guide, we'll continue to see explosive growth in data infrastructure and AI, supported and nurtured by the big clouds but not controlled by them. That's good for customers, and it's good for those who want to try to build the next Databricks.
Disclosure: I work for MongoDB, but the views expressed herein are mine.
- Geospatial data is being used to help track pandemics and emergencies (TechRepublic)
- 4 steps to purging big data from unstructured data lakes (TechRepublic)
- How to become a data scientist: A cheat sheet (TechRepublic)
- Top 5 programming languages data admins should know (free PDF) (TechRepublic)
- Data Encryption Policy (TechRepublic Premium)
- Big data: More must-read coverage (TechRepublic on Flipboard)