It's almost a joke at this point — data scientists are essentially janitors.
The New York Times reported last year that data scientists spend anywhere from 50-80% of their time cleaning up data sets in order to find usable insights— the kinds of insights that can help businesses run more efficiently and provide better services.
H2O.ai wants to cut through that tedious data wrangling by giving data scientists, developers, business analysts, and the enterprise, in general, access to open source machine learning on which to build their applications.
For the unfamiliar, a smart application is a "new category of application software designed to support business activities that are people-intensive, highly variable, loosely structured, and subject to frequent change," according to Forrester Research.
Oleg Rogynskyy, H2O's vice president of marketing and growth, chatted via email with TechRepublic about the evolution of machine learning platforms, simplifying data wrangling, and the business need for predictive modeling.
TechRepublic: How can enterprises benefit from developing smart applications? What new capabilities are possible?
Oleg Rogynskyy: Smart applications are applications that predict or anticipate user behavior, events, or data points. A good example of machine learning in action is the product suggestion system on Amazon. Amazon takes your past purchase history and predicts what you'll want to buy next. However, a system like Amazon's required teams of data scientists spending years of work to produce. In other words, machine learning and smart applications have been around for a while, but they weren't easy to use. The benefits, however, are tremendous, allowing enterprises to look into the future with a certain degree of accuracy.
TechRepublic: What are the major trends in your competitive space, being machine learning platforms?
Oleg Rogynskyy: What we're seeing is a flattening and standardization of the machine learning stack. As machine learning tools evolve, they form better platforms for data scientists, developers, and business analysts to work with. As a result of this simplification, non-PhD developers and business analysts are able to truly explore their data for the first time. In addition, data scientists are able to be more efficient in learning new insights or detecting anomalies like fraud. We're seeing the appearance of more and more smart applications on the market, which let developers skip the data science and quickly put together applications that take advantage of leading machine learning models.
TechRepublic: In your experience what are the common challenges that companies face when implementing a machine learning and predictive analytics platform?
Oleg Rogynskyy: The main challenge is that it's not easy to get to actionable insights. Only the largest companies are able to make the capital investments necessary to get results, while small to mid-size companies get left out. Part of this is due to the fact that developing a smart application requires you to string together a series of diverse technologies. The complexity of the technology means that you're reliant on data scientists to get the job done, of which there aren't enough to go around. The massive size of the data sets organizations are working with today also means that they need a system that scales easily to accommodate more complex workflows, Extraction, Transform, and Load (ETL) and automation. Finally, the process of cleaning up data and preparing it for analysis, data munging [or data wrangling], is a nightmare. Ninety percent of the work that data scientists do today is data munging, which is hardly a good use of their time.
TechRepublic: How would you describe machine learning to a room full of executives?
Oleg Rogynskyy: Simply put, machine learning allows you to predict and anticipate future events or detect patterns. For example, by inputting all of your data on the customers you've lost, machine learning can tell you why they've left and, more importantly, who is likely to churn next.
TechRepublic: What differentiates your main product, H2O, an open source predictive analytics platform? Could you share one of the many use cases?
Oleg Rogynskyy: H2O was written from scratch in Java to work on any cloud or on-premise infrastructure. Most of our users are running H2O on commodity hardware, AWS, and Azure. Honestly, since it's written in Java, we could even run on your toaster! H2O also has re-implemented classic algorithms like GBM, GLM, Random Forest, Deep Learning, accessibility, that allow for true parallelization; a cluster of computers can be scaled up or down at will.
In addition, H2O has simplified the data munging process so that data scientists can focus their time on the real work: getting answers from their data sets. H2O is also integrated with all popular data science languages including R, Python, and Scala and can work with any data repository, be it SQL, NoSQL, Hadoop, or Spark. Finally, H2O is easy to deploy out of the box, allowing you to bypass months of development by deploying smart applications at the push of a button.
To give you an example of H2O in action, Cisco maintains 60,000 models that it uses to forecast product demand. By using H2O to score their models, Cisco Principal Data Scientist Lou Carvalheira found that they were able to get to insights 10 to 15 times faster with three to seven times better accuracy.
TechRepublic: What is the business need for having a predictive model factory, and how does your solution fulfill that?
Oleg Rogynskyy: Organizations have massive amounts of data, and working with these large data sets is incredibly slow and limits the amount of time that data scientists can spend on actual model building and predictive analytics. A tool like H2O helps accelerate the speed of analysis and modeling by several orders of magnitude, thus freeing up data scientists to build more models and ask more questions. A predictive model factory, like what Cisco built, lets them segment their user base and customize their models for each type of customer. It also helps them update their models at much shorter intervals to take advantage of newer data. The ability to do something like that is an extremely powerful differentiator for organizations in today's competitive landscape.
Brian will do client work for AtTask.
Brian Taylor is a contributing writer for TechRepublic. He covers the tech trends, solutions, risks, and research that IT leaders need to know about, from startups to the enterprise. Technology is creating a new world, and he loves to report on it.