High Performance Computing in the cloud: Why there's hope

Commentary: The HPC market has struggled to modernize with cloud, but startups like Rescale may be able to change this.

istock-1005377718.jpg

Image: iStock/PamSchodt

AWS is having its big re:Invent conference this month, but the High Performance Computing (HPC) industry may not have noticed. Even as organizations increasingly embrace cloud computing, HPC has been a relative laggard in cloud adoption, trailing enterprise adoption by as much as a decade. Sure, Forrester and others point to some signs of life, but the question remains: What will make this $39.1 billion HPC market embrace the cloud?

Why cloudy change matters

There is no shortage of success stories for industries getting ahead with the cloud, but where things get interesting is looking at markets that have traditionally depended on HPC. Take aerospace, for example. New market entrants enjoy a competitive advantage over legacy incumbents largely made possible by their immediate embrace of cloud computing. Aerospace depends on HPC to simulate the countless variables to make liftoff, landing, and everything in between safe and effective. 

But newer aerospace companies tend to do more with a lot less.

SEE: Cloud data storage policy (TechRepublic Premium)

At legacy companies, expensive engineering talent typically has to queue to run workloads on highly specialized HPC infrastructure. Indeed, across all science-related sectors, which may account for half of the HPC market, these brilliant, expensive minds are idling as their HPC workloads sit in a queue for a week while they wait for their job to run.

Meanwhile, startups take a shortcut through that design cycle critical path via the cloud. They iterate constantly. Their PhDs don't have to queue. By definition, startups are capital constrained; hence, to have a shot at survival and success, they need to focus engineers and money on their core business, not becoming experts in infrastructure that they have to maintain and improve constantly. 

Something has to change.

Making HPC work in the cloud

Fortunately, there are signs of hope. Startups like Rescale have created what amounts to cloud brokering middleware businesses, specializing in HPC. Rescale and its competitors are positioned to bridge the on-premises and public cloud worlds for HPC workloads. (Many of the aerospace startups achieving early success are already on platforms like Rescale.) Companies like Rescale not only find optimal resources to run HPC jobs and automate those workflows, but they also provide real-time granular bookkeeping of the kind of details that HPC customers require. For example, every HPC customer's worst nightmare is a PhD grad student who hits submit on a $50,000 cloud job by mistake.

If these "intelligent control planes" for HPC workloads are successful, they provide incentives to cloud vendors to invest more in specialized hardware and thus sell more cloud services. In turn, this could open up HPC to a much broader universe of users, making today's $39.1 billion HPC market into something much, much bigger. 

Historically, HPC rode Moore's curve down on costs, but now the performance improvements are slowing. Hyperscale players like Google put GPUs on machine learning (ML) and built custom hardware. But the reason Google will build TPU (a significant capex investment) is because TensorFlow is used for search, autonomous driving, and other core businesses for Google. Their scale and ability to spread that investment across multiple core businesses justified the TCO.

With Moore's Law slowing, traditional HPC did the same thing and built increasingly bespoke architectures over the past decade. But each HPC software/algorithm requires its own silo of specialized hardware. For individual vendors, even behemoths like Boeing, there is no sustainable way to spread those costs across other workloads over time and still stay current on the latest hardware to remain competitive. 

High-performance computers were a huge advance for aerospace, automotive, oil and gas, and other traditional HPC workloads when they came on the scene. Remember, the old way of building products and doing things was to physically make them and see how the different models worked in testing. When computers got powerful enough, companies quickly learned to do digital representations. 

Today's there's clearly a need to bridge from custom, brittle HPC systems to take advantage of cloud cost savings. Rescale is interesting because it profiles an application that an automotive company wants to run (e.g., crash simulation) and then determines the optimal combination of hardware, operating system, libraries, etc. to power that application, and then runs it across internal and external (public cloud) infrastructure. This combination of traditional HPC with the cloud has the potential to make HPC much more responsive to fluctuating enterprise needs, building on elastic infrastructure. 

In turn, Rescale offers hope to legacy incumbents that are weighed down by existing infrastructure. But it's equally useful for hungry startups that might otherwise be blocked by the hefty capital expenditures historically required to compete in markets like automotive. In both cases, these companies are working on some of the world's hardest problems. This isn't fluffy, silly Silicon Valley butt-of-the-joke startup stuff. To quote Peter Thiel (an investor in, not coincidentally, Rescale), "We wanted flying cars, instead we got 140 characters." I predict that we're going to see a lot more flying cars--and even better--much sooner. 

Disclosure: I work for AWS, but the views expressed herein are mine.

Also see