R vs. Python: Which is a better programming language for data science?

The Python vs. R debate rages on in the data scientist community. Here's how the two coding languages match up.

Is a data science degree worth it? Data scientists are in demand, but a master's degree in the field may not open as many doors as you think.

Python vs. R is a common debate among data scientists, as both languages are useful for data work and among the most frequently mentioned skills in job postings for data science positions. Each language offers different advantages and disadvantages for data science work, and should be chosen depending on the work you are doing.

To help data scientists select the right language, Norm Matloff, a professor of computer science at the University of California Davis wrote a GitHub post aiming to shed some light on the debate.

SEE: Python is eating the world: How one developer's side project became the hottest programming language on the planet (cover story PDF) (TechRepublic)

Matloff compared R and Python across the following 10 domains to determine which programming language was the better choice:

Elegance

  • Winner: Python

While this is subjective, Python greatly reduces the use of parentheses and braces when coding, making it more sleek, Matloff wrote in the post.

Learning curve

  • Winner: R

While data scientists working with Python must learn a lot of material to get started, including NumPy, Pandas and matplotlib, matrix types and basic graphics are already built into base R, Matloff wrote.

With R, "the novice can be doing simple data analyses within minutes," he added. "Python libraries can be tricky to configure, even for the systems-savvy, while most R packages run right out of the box."

Available libraries

  • Winner: Tie

The Python Package Index (PyPI) has over 183,000 packages, while the Comprehensive R Archive Network (CRAN) has over 12,000. However, PyPI is rather thin on data science, Matloff wrote.

"For example, I once needed code to do fast calculation of nearest-neighbors of a given data point. (NOT code using that to do classification.)" Matloff wrote. "I was able to immediately find not one but two packages to do this. By contrast, just now I tried to find nearest-neighbor code for Python and at least with my cursory search, came up empty-handed; there was just one implementation that described itself as simple and straightforward, nothing fast."

When you search the following terms on PyPI, nothing comes up, Matloff added: log-linear model; Poisson regression; instrumental variables; spatial data; familywise error rate.

SEE: Six in-demand programming languages: getting started (free PDF) (TechRepublic)

Machine learning

  • Winner: Python (but not by much)

Python's massive growth in recent years is partially fueled by the rise of machine learning and artificial intelligence (AI). While Python offers a number of finely-tuned libraries for image recognition, such as AlexNet, R versions can easily be developed as well, Matloff wrote.

"The Python libraries' power comes from setting certain image-smoothing ops, which easily could be implemented in R's Keras wrapper, and for that matter, a pure-R version of TensorFlow could be developed," Matloff wrote. "Meanwhile, I would claim that R's package availability for random forests and gradient boosting are outstanding."

Statistical correctness

  • Winner: R (by far)

Professionals working in machine learning who advocate for Python sometimes have a poor understanding of the statistical issues involved, Matloff wrote. R, on the other hand, was written by statisticians, for statisticians, he added.

Parallel computation

  • Winner: Tie

The base versions of R and Python do not have strong support for multicore computation, Matloff wrote. Python's multiprocessing package is not a good workaround for its other issues, and R's parallel package is not either, he added.

"External libraries supporting cluster computation are OK in both languages," Matloff wrote. "Currently Python has better interfaces to GPUs."

C/C++ interface

  • Winner: R (but not by much)

R's Rcpp is a powerful tool for interfacing R to C/C++, Matloff wrote. While Python has tools like swig for doing the same, it is not as powerful, and the Pybind11 package is still being developed. R's new ALTREP idea also has potential for enhancing performance and useability, Matloff wrote; however, the Cython and PyPy variants of Python can sometimes remove the need for explicit C/C++ interface at all, he added.

Object orientation, metaprogramming

  • Winner: R (but not by much)

Though functions are objects in both R and Python, R takes that more seriously, Matloff wrote.

"Whenever I work in Python, I'm annoyed by the fact that I cannot print a function to the terminal, which I do a lot in R," he wrote. Python has just one OOP paradigm. In R, you have your choice of several, though some may debate that this is a good thing. Given R's magic metaprogramming features (code that produces code), computer scientists ought to be drooling over R."

Language unity

  • Winner: Python (by far)

While Python is transitioning from version 2.7 to 3.x, this will not cause very much disruption. However, R is changing into two different dialects due to the impact of RStudio: R and the Tidyverse, Matloff wrote.

"It might be more acceptable if the Tidyverse were superior to ordinary R, but in my opinion it is not," Matloff wrote. "It makes things more difficult for beginners."

Linked data structures

  • Winner: Python (likely)

"Classical computer science data structures, e.g. binary trees, are easy to implement in Python," Matloff wrote. "While this can be done in R using its 'list' class, I'd guess that it is slow."

When it comes to job postings, there is significantly less demand for data engineers proficient in R compared to those proficient in Python, according to a 2018 Cloud Academy report. Nearly 66% of data engineer job postings mentioned Python, compared to just 18% of postings that mentioned R.

Outside of R and Python, other in-demand skills for data engineers include SQL, Spark, Hadoop, Java, Amazon Web Services (AWS), Scala, and Kafka, according to Cloud Academy.

For more, check out How to become a data scientist: A cheat sheet on TechRepublic.

Also see

Programmers and developer teams are coding and developing software

Image: iStockphoto/ijeab