Is Extreme Programming (aka XP) a good match for data science and big data analytics?

It seems like a leading question, and if you know me well, you might be right. I absolutely believe the values and practices provided by the Extreme Programming movement of the early 2000s deliver the best opportunity for data science teams to build high-quality analytic solutions with big data. But it’s not easy.

SEE: How to build a successful data scientist career (free PDF) (TechRepublic)

My journey with Extreme Programming and agile concepts started in 2004. I was consulting at Hitachi Data Systems, and we had just completed a grueling three-month build of a custom reporting system that tracked the schedule and results of its interoperability testing. We were up against a hard deadline, and we needed to get a lot of moving parts working together (data warehouse, transactional database, web interface, transformation mappings, etc.). We delivered on time, but we nearly killed ourselves in the process–long, frustrating days of continuous development and testing. We knew another build and release was coming next, but there was no way we would go through that experience again. One of our teammates suggested Extreme Programming, and that’s the direction we went. Looking back, that was one of the most significant inflection points of my life.

Over the years, I’ve developed a passion for agile database development. And when data science hit the scene, things got even more interesting.

With all its benefits, Extreme Programming has never been easy to implement, and when you throw persistent data into the mix, it gets even harder. When you add big data, sophisticated algorithms, machine learning, and data visualization, it becomes a formidable feat.

All that said, Extreme Programming practices can work on a data science team, but you must be vigilant about doing them right.

SEE: Big data policy (Tech Pro Research)

Why bother with Extreme Programming?

But why even bother with Extreme Programming? Wouldn’t it be much easier to just set up a traditional software development lifecycle with a structured and phased movement through requirements, design, build, test, and deployment?

The answer is yes–it would be easier to execute this way. However, that was the thinking back in the mid 1900s, and there are compelling reasons why Extreme Programming and the other agile methodologies were born. The bottom line is: If you follow a more traditional development methodology, you’ll likely miss the optimal solution from the end user’s perspective. This will manifest in a number of ways: misalignment on business requirements, compromise on features, and production level bugs, just to name a few. So, although the traditional approach is easier to practice for the data science team, an agile approach will ultimately deliver a better solution to the end user, and that’s what it’s all about, isn’t it?

In addition, there’s a lot to be said for the experience everyone goes through when Extreme Programming is done right. If you properly embrace the philosophy, it’s a better experience for everyone, including the data science team. No more punishing long hours trying to hit an unrealistic deadline. No more frustration with trying to find an end user to clarify a requirement. No more unpleasant surprises during system and user testing. And let’s not forget about management. Extreme Programming makes your progress extremely transparent so you know exactly what you have at any point in time. And everything is time-boxed with flexible scope–this is much easier to manage than forcing a time/cost/scope equation to meet stakeholder expectations.

Perhaps the biggest benefit of Extreme Programming is the most obvious one: flexibility. In this fast-moving world, it’s difficult for an end user to predict what they’ll need in six months to a year (or more), which is the overall development cycle for a typical analytic solution. With a traditional approach, the end user must lock in their requirements far ahead of the actual deployment. When you compare that with Extreme Programming, in which the suggested iteration length is two weeks, that’s a lot of flexibility to move wherever you need to go at any given point throughout the development cycle.

How to do Extreme Programming the right way

If you like the case for Extreme Programming, you’ll need to do it right–that means embracing all the values and following all the rules. The most common mistake data science teams make when experimenting with Extreme Programming is selective practicing of the rules. It’s very easy to convince yourself that certain rules don’t apply. For instance, you might decide that pair programing isn’t necessary, or that iterations can be a little flexible on time if you’re close to finishing a major feature set–don’t do this.

The Extreme Programming environment is a delicate ecosystem. It doesn’t take long for everything to get completely out of balance when you start tweaking the rules. This is especially true with a data science team where you’ve got so much more to integrate than a web developer (data, algorithms, transformations, visualizations, etc.).

Here is a listing of all the core practices of Extreme Programming and some quick tips for data science teams.

  • Test Driven Development: This is difficult with big data. You should use small test data sets and use database views as an interface between production and test tables.
  • Planning Game: Have fun with it. Treat it like a game, but make sure to follow the rules.
  • Whole Team: Physically locate the end users with the data science team or the other way around.
  • Pair Programming: Don’t worry about the way it looks–it works.
  • Continuous Integration: Create an automated test harness. Drop and rebuild the solution with every test
  • Design Improvement (Refactor): Freeze test scripts when refactoring. You’re not changing the functionality. You should have at least one refactoring integration every week.
  • Small Releases: Set up 13 week releases and two-week iterations. Use the first week for release planning and the first two hours of every iteration for iteration planning. Build and test new code every day.
  • Simple Design: Never think ahead because you don’t know what tomorrow will bring. Just build for today.
  • System Metaphor: Be creative and have fun. Tap into the creativity of your data visualization experts.
  • Collective Code Ownership: Make sure to move people around the code so they don’t get attached to one function or module.
  • Coding Standard or Convention: Create a set of naming conventions as an early team building exercise. Make sure to cover all components of the solution, including algorithms and data visualizations.
  • Sustainable Pace: A 30-hour week is reasonable. Schedule three two-hour blocks per day when everyone (including the end users) is working together.


Extreme Programming can be an asset or a liability on a data science team. Although there are many benefits to using Extreme Programming, it’s very difficult for a data science team to implement properly.

Integrating extremely large sets of data with sophisticated analytic techniques is difficult under any conditions, so the thought of pushing the envelope with Extreme Programming can be very intimidating, but don’t let this dissuade you. Find a way to follow all the rules–without compromise. And with enough practice of the right techniques, you can have an extreme data science team.