With big data analytics comes a new role for IT workers, the data scientist. But just what is a data scientist and how do you become one?
The data scientist role is designed to help organisations make sense of the large number of disparate data sources that analytics platforms like Hadoop can interrogate.
Data scientists help organisations work out which of the many different internal and external datasets they should link together and query to generate useful insights for their business.
Big data analytics is sometimes sold as a boon for IT workers, with analyst house Gartner predicting that within three years there will be 4.4 million staff working on big data projects. But data scientist isn’t a role just for computer science graduates, it is also suited experts in other scientific and mathematical disciplines.
Kirk Dunn, chief operating officer at Hadoop specialist Cloudera, said the job requires someone who on the one hand understands large-scale machine learning algorithms and programming and on the other is a statistician.
Dunn said that Cloudera has been training up scientists from inside and outside the IT industry to become data scientists - teaching statistical experts the necessary computer science and computer scientists the necessary statistical skills: “You can’t hire this generation of data scientists, you have to build them.”
Academics that specialise in data analytics and research, such as econometrics or epidemiologists, are well suited to the job, he said.
What’s important for a data scientist is an ability to know how and why businesses should be looking to link up data, he said.
“The data scientist takes a higher order view of things: for example, if they’re at a retailer, correlating weather data with point of sale information and looking at their relationship to the supply chain.
“There are these differing types of data that aren’t normalised for the same use but a data scientist should be able to architect something that says ‘When this happens over here let’s look over here to see if there’s a result’.
“It’s understanding the relationships between data and how they interact with each other.”
Courses for aspiring data scientists are growing in number, Cloudera runs its own introduction to data science course, as do Columbia University, The University of Washington, and UC Berkeley, and storage giant EMC. As well as full paid for courses costing more than $1,000 there are also a variety of online and DVD courses offered by the likes of EMC and Cloudera.
Brendan Moran, data scientist at EMC, said that software developers who want to become a data scientists need to be willing to reappraise how they approach problem solving.
“It is about the mind set and the difference between being an engineer and a scientist. A coder (engineer) will take a problem, and then start building the solution. A scientist will start questioning if it is possible, and if it is valid. Developers will therefore need to move away from the defined solution mindset,” he said.
While CIOs may be holding off investing in big data, Cloudera’s Dunn said the scarcity of data scientists means that companies are willing to pay well for qualified candidates, and that demand will only continue to grow.
“It’s a very rare and scarce commodity. As scarce as it is that makes it precious,” he said.
Not just data scientists
Big data analytics requires more than data science skills and Cloudera has also trained 15,000 Hadoop developers and administrators.
While queries can be written for Hadoop in SQL, taking full advantage of the system does require existing database administrators to update their skills to take in the breadth of uses Hadoop can be put to.
“Hadoop has 15 open source projects - Pig, Hive Zookeeper etc,” said Dunn.
“Each of those has their own particular capability and use. To be certified on Hadoop is to understand all those things and have some proficiency in all or some of them.”
Cloudera offers a range of certification and training programmes for Hadoop administration and development - none longer than one week.
Prices vary, but as an indication training as a Hadoop-certified administrator costs in the region of $3,000.
Certain courses have pre-requisites, such as familiarity with SQL or database concepts, but there are no entrance exams yet.