As data science becomes critical to every organization, it has become just as important to determine the right tools to help master it. The two most popular languages for tackling data science problems are Python and R. Both programming languages are open source with big communities. But, Python and R also bring their own unique strengths to data science, making it harder to decide which to use.
R vs. Python: The main differences
R is an open-source, interactive environment for doing statistical analysis. It’s not really a programming language at all, but it includes a programming language to help with analysis.
As outlined on the R project’s site, “R is an integrated suite of software facilities for data manipulation, calculation and graphical display [which] includes … a large, coherent, integrated collection of intermediate tools for data analysis … .” While not the first such tool, R was early to data science and has been a staple of academia for some time.
SEE: Hiring Kit: Python developer (TechRepublic Premium)
Python, by contrast, is an open-source, “interpreted, object-oriented, high-level programming language with dynamic semantics,” according to the project’s website. This doesn’t really do it justice, however. Python is an easy-to-learn, general-purpose language that is often the first language a developer will learn, as it has long been a teaching language.
“It’s easy to use, easy to pick up, kids use it, non-programmers pick it up in a weekend,” Anaconda CEO Peter Wang once related. “This is not accidental [but rather] has been a hardcore part of the design from the very beginning and quite intentional.”
As a close corollary, Python has also always been great as a glue language. As RedMonk analyst Rachel Stephens has stressed, “In that sense, it makes a lot of sense for enterprises to invest in Python as a way of investing in their established code.” Python, in other words, helps enterprises make legacy code part of their more recent aspirations to do data science.
This is perhaps where Python’s primary benefit for data science stands out: Everyone knows it.
“Python is the second best language for everything,” said Van Lindberg, general counsel for the Python Software Foundation. “R may be the best for stats, but Python is the second … and the second best for ML, web services, shell tools, and (insert use case here).”
Lindberg might be understating Python’s strength in some areas; it’s clearly not always second best, but his point is directionally correct: “If you want to do more than just stats, then Python’s breadth is an overwhelming win.”
In other words, Python is good enough that developers and others choose to use it for a wide array of use cases. Python, like Java, is a general-purpose programming language; however, unlike Java, it’s pretty easy to learn and to use. As such, it gets used for all sorts of things, leading to “explosive growth,” as Wang once described it. Small wonder, then, that if we analyze the relative growth and decline between Python and R in data scientist job postings, from 2019 through 2021, as Terence Shin has, then it’s clear that Python is gaining at R’s expense.
R vs. Python: Which is better for data science?
Though Python has proved more popular than R, that doesn’t mean it’s always better. As with most things in technology, it depends on what you’re hoping to accomplish. Though Python has a lower bar to learning and becoming productive, and R’s non-standard approach can be cumbersome to learn, for some tasks, it pays to invest in learning R. And, of course, for some things, like data mining and basic data visualization, you’re probably fine choosing either.
What you choose, however, should flow from the problem you’re trying to tackle as well as the long-term investments you and your company plan to make.
For example, R is a better fit for statistical calculation and data visualization because R is purpose-built by statisticians for statistical and numerical analysis of large datasets. You don’t need to write much code in R to drive deep statistical analysis and data visualization.
It’s also the case that, for some areas like life sciences, the R packages might be particularly well-developed, making R a good choice. Much depends on what you’re building and your background. As Align BI partner Ryan Hobson said in an interview, “I think R is an easier language for statisticians who might not have a programming background.”
But it’s precisely that “programming background” that makes Python the clear winner for developers or others interested in big data, artificial intelligence (AI) and deep learning algorithms.
“Python had a broader scope [than R] from the beginning [with engineering and science] DNA baked into the Python core,” said Wang. It’s objectively true that Python is dramatically more popular, across a much wider array of use cases, than R, and becomes more so every day.
Then, there’s the reality that the very nature of data science is changing.
“There has also been an expansion beyond what was traditionally purely a data science team; for example, at Netflix, we have the role of Algorithms Product Manager,” noted Christine Doig, director of innovation for personalized experiences at Netflix. There’s more integration with the design team, with creative teams.”
That expansion of data science specialization argues for a wider variety of people helping with the data science workload, which in turn favors a language like Python that is more broadly used.
Hence, there’s a very real question as to whether it’s worth investing in R to solve a relatively narrow set of use cases versus Python, which allows an organization to meet a broad array of use cases. The answer might be yes, but you need to carefully consider.
Or perhaps you just need to wait. After all, the R and Python communities are both actively improving their relative capabilities, adding packages and libraries to deepen and extend their utility. In this area, however, the advantage goes to Python, both because of the relative size of its community, but also because of its glue code pedigree.
According to Wang, it’s very possible that rather than replace R for some use cases, “maybe someone will build a nice Python wrapper to expose a thin shim to expose some R capabilities.” In other words, it’s not hard to imagine Python embracing those native elements of R, so developers and data scientists don’t have to choose.
Both R and Python serve their respective constituencies well. Yes, the Python community is much bigger and is more likely to pull R packages into the Python ecosystem than the reverse, but which you’ll use may ultimately be a question of and, not or.
Disclosure: I work for MongoDB, but the views expressed herein are mine.