Mention big-data analytics to Bob Harris, CTO at UK broadcaster Channel 4, and he immediately produces a list of pet hates. Top of the bugbears are IT people moaning about the technology’s inherent difficulties.
“All the time I come across people who tell me why they cannot do things. I don’t know about you but my job is to do things, not can’t do things. In reality, people hide behind the complexity,” Harris said.
“If you use the communities, you can meet people who are doing the same stuff. It’s just about finding out how people are overcoming problems and what people are using the technology for,” he said.
Vast quantities of data
According to Harris, rapidly collecting and analysing vast quantities of data is at the heart C4’s drive to improve the experience of viewers and differentiate the channel from rivals.
“It starts in the R&D strand within Channel 4 and I’m always playing with the next thing. For me right now that tends to be that real-time Storm stuff and things like that,” he said.
Business intelligence has been well established at C4 for years, Harris said, with industry-standard proprietary models and real-time data warehousing. But now Hadoop and Amazon’s Elastic MapReduce are the organisation’s primary big-data platform and Harris is also experimenting with the R statistical analytics language and Mahout for machine learning.
At the recent Whitehall Media Big Data Analytics conference in London, Harris set out his take on a list of preconceptions that bedevil the technology:
1. Relational databases can do big data
“I meet people who tell me this is nothing that can’t be done on RDBMS. If you believe that fundamentally, quit now. This cannot be done on an RDBMS and I’ve been working with those since they started. If you think you can do it with last-generation technology, you’re probably not doing big data.”
2. Big-data analytics is a completely different approach
“When I started in IT, it was called data processing and we did everything in batch. We crunched the data, we printed it out and we did it again. You look at the way Hadoop works, it takes that big dataset, breaks it into little pieces, rips through them sequentially, puts them into a shuffled sort and then whacks them through the reducer and out come the results. It is actually a batch pipeline. People like me started there originally.”
3. Open source is the only option
“No, it isn’t but I am amused by how many products from companies that are more associated with proprietary products are actually using open source in there somewhere. I’m a cloud man, I’m an open-source man but for me open source is largely the future.”
4. It’s really difficult
“Well, it has got a steep learning curve, that’s certainly true. To sell it to our own teams, I spent a very long weekend hacking Python code in MapReduce just to demonstrate that I could rip through a few millions lines of data very quickly. When I was confident enough to think, ‘I can write this stuff’, that’s when you go and find the people in your teams who really want to move forward with this technology.
5. Big data is immature and lacks tools
“That’s true. In reality Hadoop, which we’re pretty much all hanging our futures on, went 1.0.0 in 2011. So if you’ve got a policy that says you do nothing before the 3.0 version, you’re in trouble. Hive is 0.11, Pig is 0.11.1, so most of this stuff hasn’t even got to 1.0 yet. It is immature.”
6. It’s totally incompatible with your BI platform and tools
“Most importantly, this is not incompatible with what you’ve already got. When you crunch through 20 billion rows of data and get 10 million rows of results out the end, what is the best place to put that? It’s in an RDBMS. You put it back into an RDBMS, you put it back into your current reporting system and you use your sunk investment in your current reporting to make use of that. Think of this as ETL on steroids.”
7. It’s difficult to find skilled and experienced staff
“Yes it is. So go hang out where they hang out. Go to the Meetups, go to the user groups. Dress down a bit, put your baseball cap on, go mix with them – it’s a lot of fun. I was at the Storm user group Meetup and we had Nathan Marz, the author of it on Skype from the US. You get a chance to say to the guy, ‘How are we meant to be doing this?’, ‘Did you think about that?’ And it’s brilliant depth you can get into. With the best will in the world, you can’t do that with proprietary products.”