While there are well-established tools for building analytics and data-mining models to help businesses spot fraudulent transactions or recommend follow-up purchases to customers, plugging these models into applications can be a painful process.
As more businesses call upon these models to interrogate increasingly large datasets, it will become necessary to have an easy way to export and share these models between applications.
Sean Owen, director of data science at Hadoop specialist Cloudera, expects the next big growth area in big data will be in tools that make it simpler to share these models between applications.
“It seems to be the common problem, the wheel that keeps getting reinvented by customers,” he said.
“The default thing to do is someone makes a model in [the statistical modelling language] R and they say ‘Here’s a bunch of coefficients, go program this into some Java code and use this on the website’.
“That requires some expertise on behalf of the developer too, it’s very manual.
“They need something that the web service can ask in some standard simple way ‘Here’s a new data point, classify it for me’.”
One candidate for a standardised way to share these models is the Predictive Model Markup Language (PMML) – an XML-based language for representing data mining and statistical models .
PMML can represent not only the statistical techniques used to learn patterns from data, such as artificial neural networks and decision trees, but also pre-processing of raw input data and post-processing of the model output.
A wide range of data mining tools can import or export models as PMML, and the standard itself is developed by the Data Mining Group, a vendor-led consortiums whose members include IBM, MicroStrategy, SAS and SPSS.
Developing a standard way of representing and interacting with these models would be a “big deal” in the coming year said Owen.
“You would think there would be a server for this and there really isn’t. SAS has an expensive proprietary tool that does that and there’s one open source package that kind of does it,” he said.
“If I’ve got a model, surely I should be able to load it up in something and then query it with standard APIs and client libraries? We need to standardise and have a suite of mature solutions to do this.”