Commentary: The data science technology landscape is changing, but not always as fast as we might think. Also, it's time to master Python.
While the demand for data science skills keeps rising, the nature of that demand has remained roughly constant, according to a Jeff Hale analysis. Given how fast technologies in the data science space seem to rise and fall (remember Hadoop?), even over the course of a year we might expect to see more variance in technology preferences. Instead we find a (somewhat) remarkable stasis, one that continues to remind us: It's never a bad time to learn Python.
SEE: Data analytics: A guide for business leaders (free PDF) (TechRepublic)
It's Python's world...
Starting on October 10, 2018, Hale pulled data science-related job listings from LinkedIn, Indeed, SimplyHired, Monster, and AngelList. In 2019, due to the difficulty in scraping LinkedIn data, Hale removed that source. If you look at the most popular data science technologies listed in job postings and resumes, and compare 2018 to 2019, it's remarkable just how much has not changed. Python was and is the most dominant programming language for data science, while R has slipped in popularity over the past year.
And yet there are changes between 2018 and 2019. For example, PyTorch is exploding in popularity, while more traditional, proprietary tools like SAS and Matlab continue to decline:
If you're not familiar with PyTorch, you soon will be. TensorFlow, developed by Google, is often top of mind for those looking at data science frameworks, but PyTorch, developed by Facebook, is popular for much the same reason that non-relational databases like MongoDB have grown in popularity: Flexibility.
The most important difference between the two is the way these frameworks define the computational graphs. While Tensorflow creates a static graph, PyTorch believes in a dynamic graph. So what does this mean? In Tensorflow, you first have to define the entire computation graph of the model and then run your ML model. But in PyTorch, you can define/manipulate your graph on-the-go. This is particularly helpful while using variable length inputs in RNNs.
Additionally, PyTorch is like Python in that it's easier to learn than TensorFlow and "building ML models feels more intuitive," according to Jain. Still, as a relatively new ML framework, PyTorch lags TensorFlow in terms of community and other resources.
Speaking of community, that's the other obvious conclusion from Hale's findings: Open source as a whole is on the ascendant in data science. Yes, there are a few plucky, proprietary tools that keep putting in an appearance, but open source dominates the leaderboards. Whether individual projects grow or fall in popularity, open source as a category just goes from strength to strength.
So what should you do?
According to Hale, rather than trying to master the list of technologies above, it's best to "focus on learning one technology at a time." Which order does he recommend?
Python (for general programming)
Pandas (for data manipulation)
Scikit-learn library (for learning ML)
SQL (for querying)
Tableau (for data visualization)
Cloud platform (for running models/applications)
TensorFlow (most popular) or PyTorch (growing fastest) (for deep learning)
Fortunately, most of these are open source and/or easily accessible at low to no cost. That's one of the things that offers the most promise for a data science-driven future: The cost of entry is relatively low compared to what it was in the past.
Disclosure: I work for AWS, but nothing herein relates directly or indirectly to my work there.
- How to become a data scientist: A cheat sheet (TechRepublic)
- 60 ways to get the most value from your big data initiatives (free PDF) (TechRepublic download)
- Feature comparison: Data analytics software, and services (TechRepublic Premium)
- Volume, velocity, and variety: Understanding the three V's of big data (ZDNet)
- Best cloud services for small businesses (CNET)
- Big Data: More must-read coverage (TechRepublic on Flipboard)