Big data isn't just for Silicon Valley's elite.
Though the Valley still leads in terms of both the creation and consumption of weird-sounding big data projects like Hadoop and Mesos, mainstream enterprises are increasingly catching on.
Take, for example, Markerstudy Limited.
Markerstudy, an insurance company based in the United Kingdom with over 4,000 employees, won't be a household name for some, but that's what makes its adoption of Hadoop and Spark so interesting. It's not trying to optimize ad clicks. Instead it's saving $7.5 million annually through better fraud detection, and lowering customer cancellation rates by 50%.
You know, boring real-world business problems that help enterprises solve mainstream customer problems.
Hadoop as a competitive advantage
In 2013, Markerstudy launched an ambitious platform called the Insurer-Hosted Rating Hub (IHR or "The Hub") where its 3,000+ insurance underwriting partners could share a common portal for centralized rating.
The Hub eliminated the need to publish rates in multiple places, ensuring timely information and consistency across intermediaries. The Hub also allows underwriters to change rates easily, giving the insurer control over every aspect of the rating process.
The following July, Markerstudy launched a new company IT initiative to complement The Hub, called the Big Data Insight project that shifted the Hub's data collection, analysis, and reporting to a Cloudera Enterprise Data Hub/Apache Hadoop-based solution, leveraging Spark for near real-time processing and Zoomdata for big data visual analytics. (The New York Times reported that Zoomdata runs under the hood of Amazon's recently-announced QuickSight big data analytics service).
As I learned in an interview with Markerstudy's enterprise data solutions architect Nicholas Turner, the payoff was immediate.
Markerstudy's new platform for the Hub could now analyze hundreds of millions of quotes in seconds, processing 100% of insurer quotes rather than sampling, as Turner told me:
We analyze more data much faster than ever before so we can react to market changes almost as they happen. Today we analyze hundreds of millions of quotes in seconds compared to seven hours to analyze 400,000 quotes under the previous system. It also gives us a data platform that can be integrated with other components for easier analysis and presentation to other parts of the business and senior managers.
Importantly, Markerstudy's big data stack now enables it to look at far more data, not just snapshots.
Letting all that data in
Before Markerstudy moved to Cloudera's Enterprise Data Hub, capacity, storage, and traffic restrictions meant that only a 3% sample of quote data could be handled by its existing systems. Data was managed and analyzed using separate platforms and largely by different teams.
This resulted in a disconnected view of the customer, which required high levels of effort, resources, and cost to maximize its value. Records were shifted between the SQL Server and SAS in a process that ran overnight, often for more than seven hours.
After five months of development and testing, the new platform was able to process 100% of the company's quote data and make it available for analysis, reporting, and visualization within seconds. Testing shows us that the architecture can ingest over 20,000 messages per second at peak if required—more than 50 times the current volume and with more than enough capacity for projected growth rates.
What's even more interesting is this is all running on a 10-node cluster with four ingestion nodes and additional virtual machines for visualization. Not too shabby.
By far the largest volumes of data are produced by more than 20 million insurance quotes each day and these are all individually rated to produce a premium based on various pricing factors. These include risk details provided by the customer, external enrichment (such as credit scores, identity checks, vehicle and license information, and fraud data) and customer information including existing customer data and profiling.
But it's not just Hadoop.
As Turner told me, Markerstudy turned to Spark to help identify previously hidden differences between customers for important new insights.
"You really gain new insights into your business by linking internal and external data," he said. "We can now identify previously undiscovered patterns between groups of customers that we can analyze against external factors, customer behavior, and profiles."
For example, Markerstudy can now highlight fraud indicators earlier in the customer journey and make pricing decisions based on behavior. In particular, Spark is a key component of the real-time quote manipulation detection service that identifies changes in risk factors (such as postcode, occupation, or no claims discount) to reduce a premium.
To make all that data readable by people, Markerstudy turned to Zoomdata to enable users to filter and drill down into data to identify complex patterns and trends and display them in easily understood visualizations.
Some of that work is traditional retrospective data work across very large data sets to perform various modelling and scenario analysis, but Turner also has been able to use Zoomdata to develop what he calls "speed of thought" analysis for business users to explore data in meetings, to raise and answer questions as they came up, and to monitor and report on operational data.
Big data for the rest of us
There are hordes of data geeks at Facebook and Google figuring out click-optimization algorithms, but that's an alien planet compared to most enterprise needs.
Even so, companies like Markerstudy are showing how Silicon Valley's preferred tools can be used to deliver significant customer benefit and competitive differentiation. In fact, as I've argued before, it is in the hinterlands that we can expect to uncover the biggest benefits of big data.
Matt Asay is a veteran technology columnist who has written for CNET, ReadWrite, and other tech media. Asay has also held a variety of executive roles with leading mobile and big data software companies.