If using DevOps methods makes projects more agile, and working with big data requires agility, then marrying the two can ensure that big data initiatives meet the needs of the business before the application becomes obsolete. Thanks to the open source movement, developers have more tools at their disposal to iterate analytics applications quickly and deliver value to business users.
Past methods of building and deploying big data applications may have involved weeks or months of selecting and installing Hadoop distribution software and big data analytics tools. In the meantime, data scientists didn’t have access to the Hadoop clusters, or if they needed another cluster, they would have to go through the IT department. It’s a time-consuming process.
However, DevOps is the catalyst to helping data scientists do their job, according to Bob Brodie, chief technology officer of SUMO Heavy. In his experience, data scientists will ask for an instance to run queries or conduct analysis, and the DevOps and IT teams are tasked with getting it set up.
Open source tools like Chef and Puppet make it easier to provision these tools, Brodie said. In addition, AWS CloudFormation can be used to set up and define an environment in AWS using infrastructure as code. Chef, Puppet, and Ansible are all good tools for provisioning, he added.
Build tools are also critical to ensure that instances run correctly. Brodie cited Jenkins, CircleCI, and Travis CI as three examples. These build tools allow developers to take code out of the repository and into Mercurial, which will run commands and bring out artifacts. “So if you need to deploy an environment, you could use Chef, install MySQL on it, then use Jenkins or Circle CI to pull code in and deploy to the environment,” he said. Meanwhile, no one has to push buttons because of the automation involved.
Automation reduces time to deployment
“As part of the DevOps process, automation is key. Being able to automate everything you can is really important,” said Craig Warsaw, principal solution architect at Booz Allen. One of the main tools he uses is Jenkins, which provides the automation framework.
Unit testing is also automated, making it easier to check code before it’s deployed to end users. Among the open source testing tools Warsaw uses are Selenium WebDriver, which makes direct calls to a browser using the browser’s native support for automation, and JMeter, an Apache project that can be used for load testing to analyze and measure performance.
Digging into big data and analytics-specific tools
The open source community has also been hard at work on tools specific to analytics and big data. After the foundation is built using open source tools, big data scientists have the option of continuing to use open source to drill into data and produce reports.
DataTorrent, which was created by the team behind Apache Apex, is one such tool. It allows for real-time streaming data processing and batch processing. It’s built on Hadoop and was named as one of the seven leaders in big data streaming analytics solutions in The Forrester Wave: Big Data Streaming Analytics, Q1 2016. According to the report, DataTorrent provides a visual development tool and a library of more than 400 operators in addition to streaming analytics.
Genie is another open source big data tool, this one created by Neflix and also running on Hadoop. While Netflix created Genie for itself originally, according to the Netflix blog it was released as open source to get community feedback. In his blog, Forrester Research analyst Brian Hopkins said Genie is a critical part of the middleware Neflix has developed to create analytic workloads on demand. Netflix also released Lipstick as open source, which allows for graphical representation of Hadoop Pig jobs.
Meanwhile, LinkedIn developed Pinot, a distributed OLAP datastore with real-time scalable analytics, as well as Taiga, which relies on the Kafka messaging system and Hadoop YARN to conduct distributed stream processing. EMC and VMware spun off Pivotal GemFire, Greenplum Database, and HAWQ, part of the two large companies’ data suite that help build data-driven applications, create analytics-enabled databases, and use SQL analytics in Hadoop.
As evidenced by the plethora of open source tools released by large companies, big data and analytics aren’t limited to strictly enterprise-created tools. The open source community is a vast resource for both building environments and creating data lakes, databases, and applications, and these are just a few of the ways DevOps and data scientists can approach it.