Decent data scientists are reportedly so rare that they are sometimes referred to as unicorns.
However, firms struggling to source data science skills may soon be able to turn to machine learning.
While deep learning and automated processing are today used to automatically tell who and what is in an image, or to pluck answers to questions from a piece of text, that same automation has not been applied to key elements of data science.
Now researchers at MIT have developed what they call the Data Science Machine – a system that can compete with some of the best data scientists in the world and in a fraction of time.
The Data Science Machine performs the intuition part of big data analytics, work which today is usually carried out by humans. This work involves deciding which variables in a dataset should be studied in order to make a prediction.
MIT gives the example of company data showing sales promotions and weekly profits. When predicting which promotion it would be profitable to repeat it would normally fall to human analyst to decide which data it would be most useful to look at.
To test the Data Science Machine’s intuition, the team from MIT entered it in three data science competitions, the KDD cup 2014, IJCAI, and the KDD cup 2015. In these challenges the machine competed against human teams to find predictive patterns in unfamiliar datasets.
Of the 906 teams participating in the contests, the Data Science Machine finished ahead of 615. In two of the three challenges, the predictions made by the machine were 94 percent and 96 percent as accurate as the winning submissions. In the third competition this figure was 87 percent.
Where the machine decisively beat human competitors was in how rapidly it completed its work. While it typically took teams of people months to devise prediction algorithms, the Data Science Machine took between two and 12 hours to produce each of its entries.
“We view the Data Science Machine as a natural complement to human intelligence,” says Max Kanter, whose MIT master’s thesis in computer science is the basis of the Data Science Machine. “There’s so much data out there to be analyzed. And right now it’s just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving.”
How it works
The workings of the machine are based on what MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has learned from applying machine-learning techniques to practical problems in big-data analysis.
“What we observed from our experience solving a number of data science problems for industry is that one of the very critical steps is called feature engineering,” said Kanter’s thesis advisor, Kalyan Veeramachaneni, a research scientist at CSAIL. “The first thing you have to do is identify what variables to extract from the database or compose, and for that, you have to come up with a lot of ideas.”
The Data Science Machine first correlates data, for example linking data stored in different database tables via common numerical identifiers, such as product item numbers. The machine can then generate potential data features that may be useful for making a prediction about those items, such as total cost per order, average cost per order, minimum cost per order, and so on.
The machine also looks for categorical data, that which is restricted to a limited range of values, such as days of the week or brand names. It then generates further data features by dividing up existing features across categories.
After generating these possible features of the data that might prove useful in making predictions it then begins to whittle them down, by identifying features whose values appear to be correlated. It then begins testing this reduced set of features on sample data, combining them in different ways to optimise the accuracy of predictions they yield.
Margo Seltzer, a professor of computer science at Harvard University who was not involved in the work predicts that the group’s findings will have a wider impact on the field of data science.
“I think what they’ve done is going to become the standard quickly – very quickly.”