The data science skills gap is not here because there aren’t enough people who can train and analyze data models. There are plenty of talented data modelers who understand conceptual data modeling, logical data modeling and more. The real challenge is finding people who can gather data, prepare it, cleanse it and put their models into production.
I am referring to professionals who understand how to query and connect to databases, know how to implement an object store and can containerize models, convert them into APIs and embed them into edge devices. In short, people who can apply practical applications to their data sets.
This is where the shortage lies: Data scientists who are nearly as skilled in software engineering as they are in data modeling. Enterprises need people who know how to productize their output so it can be used in real-world use cases, not just people who can build an effective model. That’s why Gartner identified AI engineering as a top strategic technology trend for 2022, wherein IT professionals focus on operationalizing AI models.
Fortunately, colleges and universities have the tools required to provide fantastic environments for learning the engineering side of data science, and they hold the key to minimizing the current data science skills shortage.
SEE: Hiring kit: Data scientist (TechRepublic Premium)
It’s time for them to use it to open doors for the next generation of data science professionals.
Playing catch up
So far, they’ve only propped the door open a little bit.
Too many professors still focus a lot on the theoretical and mathematical aspects of data science and not so much on the practical expertise required to put data science into practice. Maybe that’s because they feel their roles are to advance science, not necessarily train people for a profession. While that’s important, there needs to be a balance between the two. Indeed, things are getting better, and more colleges and universities are beginning to offer some limited courses on how to apply data science and modeling to applications.
But they need to evolve their curriculum more quickly to meet demand. That’s difficult, as it can sometimes take a few years to create and get a single new course approved. That’s not acceptable when technology is rapidly advancing every few months. The disconnect between what is taught and what is needed continues.
Meanwhile, companies that have the appropriate resources and knowledge are attempting to compensate. Many are hiring experienced database administrators and recent college graduates and training them on practical model deployment and data engineering.
There are drawbacks to this approach. First, an organization that is short on practical model deployment skills will not have the expertise necessary to train an incoming group of scientists on those skills. After all, they can’t teach what they don’t know. Second, training can be time-consuming, drain resources and undermine organizational efforts to become faster and more efficient.
This is not sustainable or feasible for most companies, particularly smaller organizations that may not have the means to properly train their employees. It’s also not fair for students, who are already coming into the workforce at a disadvantage.
But colleges and universities do not need to spend years creating new courses. Instead, they can use the open source tools they already have at their disposal to incorporate hands-on practical learning into their existing computer science courses.
Creating a data engineer
Higher education institutions have invested heavily in open source technologies for several years and are using the software to creatively solve a variety of challenges. They’re attracted by its interoperability, security and cost-effectiveness, among other benefits.
But they also understand that more companies are leveraging open source than ever before. In fact, 95% of respondents to a recent survey by Red Hat said that open source is important to their organization’s overall enterprise infrastructure. Indeed, open source is the new normal for IT. This makes teaching and using open source technologies vitally important.
We’re already seeing some colleges and universities teaching courses on topics like learning how to use Python or Jupyter Notebooks. Some have even incorporated these tools into their daily classroom settings. Now, it’s time to take things even further by creating a framework that brings together these and other tools and ties the theoretical aspects of model training to the more practical aspects of software development.
That’s not difficult to do, thanks to the open and flexible nature of open source software. Different technologies can easily be strung together to create a cohesive whole and give students a more complete view of how their work can be used to practical effect in an application.
For example, a college teaching and using Python and use of Jupyter Notebooks can combine the use of the tools in a single classroom setting. Professors can create a specialized section of the course that shows students not only how to work with Jupyter Notebooks, but also how to transfer that work to a developer. They can also show how an application developer using Python might incorporate their data models into their applications. Students can even be taught the basics of how Python works without being trained to be application developers themselves.
Essentially, colleges and universities can apply the principles of both science and engineering in a single class. Students can learn how to experiment with their models and how to put those models into motion, taking them from idea to deployment.
Filling the skills gap
The competition among enterprises to find talented data scientists is showing no signs of slowing. According to EY, organizations are still having trouble filling data-centric roles due to ineffective upskilling programs, a shortage of talent and more. Even powerhouse organizations like NASA are struggling to find the right people for the right data science roles.
The easiest and fastest way to fill this ever-widening skills gap is for colleges and universities to broaden the scope of some of their current courses. They should consider incorporating software engineering and operational teachings alongside their current data science offerings. This will provide students with a more well-rounded – and useful – perspective that will help them better prepare for what lies ahead while giving enterprises the talent they’re looking for.
Guillaume Moutier is a Senior Principal Data Engineering Architect in Red Hat Cloud Storage and Data Services, focusing his work on data services, AI/ML workloads and data science platforms. A former project manager, architect, and CTO for large organizations, he is constantly looking for and promoting new and innovative solutions, always with a focus on usability and business alignment brought by 20 years of IT architecture and management experience.