With Google, Facebook, and other big web companies handing out the world's best data infrastructure for free, it has become a truism that all essential data infrastructure is now open source.
Except when it's not.
I spent several years at MongoDB, a leading open source NoSQL database. MongoDB's success has attracted scores of venture-backed market entrants, one of which is MemSQL. Where most of these new NoSQL entrants followed the traditional open source development and business model route, MemSQL moved in a totally different direction. MemSQL's software is (gasp!) not open source and (gasp again!) they embraced SQL. Even so, it keeps managing to grow in popularity, as measured by DB-Engines.
I recently caught up with MemSQL founder and CEO Eric Frenkiel to find out more about the company's latest moves. Frenkiel claims to be building the industry's first real-time data warehouse.
Faster than fast
TechRepublic: What are the trends you see driving the industry towards your concept of a real-time data warehouse?
Frenkiel: We like to describe this as saying if you are running your analytics daily, how much better would it be for your business to run those analytics intra-daily, meaning multiple times per day? Nearly every business would say, "Yeah, I'd like to get more insights to my business every hour, rather than once every 24." It's no longer sufficient to wait a day, or even half a day, to find an answer. So, our customers come to us interested in real-time business intelligence, for example, and also telemetery and monitoring what's going on in their business. If anything changes, they can jump on it and react. That's the real shift we see in customers today.
A new kind of data warehouse
TechRepublic: How does this real-time data warehouse from MemSQL fit in?
Frenkiel: We melded a transactional capability into the data warehouse so we can ensure that any data loaded in real time is done so with 100% accuracy and consistency. It's definitely finer-grained than just saying an order of magnitude boost in operations per second, or something to that effect. Typically, that kind of consideration is just how fast can you write something and then you would read it later. But that type of thinking is old hat and no longer relevant in an era where you want to be analyzing what is happening now, as well as correlating that against what happened in the past, at any given time boundary.
SEE: How MemSQL is helping push streaming data forward (TechRepublic)
So, when we talk about the need to ingest in real time we're introducing a new notion of updateable, fast ingestion. Data warehouses can only append, they cannot update. Since MemSQL can update in real time, that means we can be ingesting lots of data--millions of events per second--and we can still ensure that we are writing and updating and reading all at once.
Even the notion of how to describe this in a conventional sense of operations per second is really only scratching the surface of what customers want to derive value from, which is business intelligence and the ability to respond in real time to a change in their business.
Let me illustrate with a couple of customer examples that showcase how performant our system is. One popular social network site customer is ingesting 1 Gigabyte per second into MemSQL for real time processing. That's a high water mark (to say the least) to get that type of structured data ingested into our system. But we have others that really like the update capability. A content delivery network customer is processing nearly 10 million updates per second with MemSQL as well.
But, just loading faster is not the point. If we can compress an ETL window substantially that's great, but really what you want to do is get into streaming, into pipelining, into continual loading of data--in which case you're no longer thinking of writing to disk and then analyzing later.
Real-time for the rest of us
TechRepublic: What's a typical MemSQL customer use case?
SEE: Why some of the fastest growing databases are also the most experimental (TechRepublic)
Frenkiel: One of our core use cases is Customer 360. A company wants to track everything about a customer's experience--from a clickstream, to previous purchase, to intent, and what they might do next. That requires joining many different datasets together, as well as the need to write very quickly when you're talking about millions of users. We do this for one global customer in a traditional industry to provide real time Customer 360 to all 5,000 of their sales reps. When reps walk through the doors of their own customers, they know exactly the latest activity that's happened in the account. We have some of the world's largest Customer 360 customers using us to better service multi-million end user accounts.
TechRepublic: Are your early customers mostly these corner cases, so-called early-adopter enterprises that have big data and impose atypical performance demands?
Frenkiel: No. This is a huge theme for us. When people talk about streaming and real-time pipelines, they can have these connotations about petabytes of data and milliseconds of latency. That's absolutely not the case for most customers. Our customers are businesses doing things in daily batches or weekly batches who just want to get to hourly or minutes, or better yet, continuous. You don't want "batch" updates, you want "continuous" updates. Whatever timescale your business runs on, you want your analytics to be in sync with that timescale.
One of the risks of pointing to extreme types of maximum throughput types of use cases, is that you might think you don't need this type of technology. On the contrary, if you can get to a real time data warehousing use case, you don't necessarily need web-scale data volumes. It really depends on how much is the business delayed by their ETL, by their conventional batch process, by their conventional batch reporting.
- How Apache Kafka promises to be your enterprise's central nervous system for data (TechRepublic)
- How MemSQL is helping push streaming data forward (TechRepublic)
- Why one big data CEO sees Gartner's Magic Quadrant as a blessing and a curse (TechRepublic)
- Why some of the fastest growing databases are also the most experimental (TechRepublic)
- Why MemSQL's Pokemon-themed testing solution dumped VMs for containers (TechRepublic)