How NVIDIA hopes to industrialize AI with MLOps

Commentary: We still spend far too much time training machine learning models rather than deploying them. Could MLOps--which brings data scientists and DevOps together--be a solution?

Abstract futuristic on the two sides between a digital communication of neural network and an artificial intelligence robotic face.

Image: Chinnawat Ngamsom, Getty Images/iStockphoto

We're definitely beyond the science project years of artificial intelligence (AI) deployments, and we've got work to do before we get past the "brittle" phase. Some retailers, like Walmart, have used machine learning (ML) to improve forecasting while reducing their cost of inventory. But many others spend lots of time building models that never get deployed, said Tony Paikeday, director of product marketing for the NVIDIA DGX portfolio of AI supercomputers and NVIDIA accelerated data science platform, in an interview. 

What we need, he said, is to "industrialize AI," in part by bringing together data scientists with DevOps. NVIDIA seems to be all in on this new approach, called "MLOps," and the company might be onto something.

SEE: Artificial intelligence ethics policy (TechRepublic Premium)

Not just a box

But first, an admission, one that I'll offer in case you've made the same mistake. I've always thought of NVIDIA as a chip company--you know, that company that created a graphics chip way back when that made video games cool. Since then, NVIDIA and others have expanded the use of these graphics processing units (GPUs) for areas like high-performance computing and artificial intelligence.

But still, hardware, right?

According to Paikeday, in the past customers would buy a GPU, stick it in a server, and apply open source (like TensorFlow) to it. This worked (sort of), but broke down as the size of data sets grew. It became ever more difficult to parallelize such workloads across processors. It became obvious that these customers needed cluster-aware software. 

SEE: Prescriptive analytics: An insider's guide (free PDF) (TechRepublic)

Even more, NVIDIA figured out it needed a full stack that is optimized from the driver level so the company could parallelize across not just a few GPUs, but rather across multiple systems. NVIDIA simply couldn't afford to delay on a training run--the company needed its models fast. 

With all this as context, for years Nvidia has been developing software. Indeed, NVIDIA now spends more time on software-related development with its customers than working on the hardware by itself. 

Not just a bunch of boxes

This brings us to another part of NVIDIA's AI-related business: Where does the customer run NVIDIA hardware/software? The short answer is "wherever they want to," since NVIDIA is somewhat agnostic to where its solutions run. The customer, however, is not.

According to Paikeday, most of its customers start in the public cloud--whether customers will stay there depends on a few factors. Using the adage "Train where your data lands," Paikeday said, customers who create data in the cloud will tend to run their AI training models there. Customers may also keep training runs in the cloud so as to be able to iterate and fail fast. Given how new AI remains for so many, the cloud is a great place to explore.

SEE: The top 10 languages for machine learning hosted on GitHub (free PDF) (TechRepublic)

But as a customer's data set size or models grow, and they have more sophisticated prototypes, "the impact of data gravity starts to be felt," he said, making it cost effective to avoid data transit fees by keeping the data local. For some of these, he says, they may feel they can get better results from fixed-cost infrastructure as well as centralizing operations in an "AI center of excellence" of sorts, where they can cross-pollinate expertise across teams, groom new talent, and more. 

Not a box at all

Wherever enterprises opt to run their training models, companies quickly need to figure out how to operationalize their data science. One problem is that data scientists tend not to be trained engineers, and don't necessarily follow good DevOps practices. Worse, data scientists, engineers, and IT operations often work in isolation. All of this contributes to make AI brittle and immature within the enterprise. 

The Holy Grail, at least to some, is MLOps, a bringing together of AI and Operations, similar to what has been done between Development and Operations (DevOps). The goals? As articulated by Kyle Gallatin, an ML engineer at Pfizer:  

  • Reduce the time and difficulty to push models into production

  • Reduce friction between teams and enhance collaboration

  • Improve model tracking, versioning, monitoring, and management

  • Create a truly cyclical lifecycle for the modern ML model

  • Standardize the machine learning process to prepare for increasing regulation and policy

Sounds straightforward, right? Well, of course it isn't, or we wouldn't have the brittle AI that we do today. NVIDIA, said Paikeday, is building a platform that allows data scientists to work closely with DevOps folks and thus reduce the friction between these sparring groups. 

It's a good goal. It's also not what you'd expect from a chip company that helps to make video game graphics sing. But then, that's not really what NVIDIA does anymore. At least, not completely. It's even more focused today on building the software that will serve as connective tissue for data scientists and DevOps within the enterprise so that AI moves from artisanal to industrial. 

Disclosure: I work for AWS, but nothing herein relates to my employment there.

Also see