In The Hitchiker’s Guide to the Galaxy, all Arthur Dent needed to do to understand any language in the universe was slot a small, leech-like fish into his ear.

In recent years, real-world technology has begun to catch up with the imagination of Douglas Adams and started to break down language barriers.

But rather than sitting snugly in your inner ear, the tech that is starting to make real-time translation possible today would struggle to fit inside a warehouse.

At the REWORK Deep Learning Summit in London, Facebook AI researcher Shubho Sengupta laid bare the immense computing power that underpins systems devoted to understanding human language and speech, whether it’s the real-time translation abilities of services like Microsoft Translator and Google Translate, or the speech recognition in virtual assistants like Amazon Echo and Google Home.

“It’s a very, very large network. We need very high compute, very high bandwidth and very low latency,” he says.

The machine-learning systems that have powered recent AI breakthroughs in areas like speech and image recognition are underpinned by large neural networks. These brain-inspired networks are interconnected layers of algorithms that feed data into each other, and which can be trained to carry out specific tasks by modifying the importance of input data as it passes between the layers.

These networks will ingest huge amounts of data during their training, and Sengupta says that a typical “industrial-scale” automatic speech recognition (ASR) network requires distributed compute clusters that together are capable of about 20 – 50 exaflops [billion, billion floating point operations per second], to train a models with between 20 and 50 million parameters. The training is spread across the vast clouds of compute power available to the tech giants, as it requires more processing than any one system is capable of in a reasonable timeframe.

By way of comparison, the world’s fastest supercomputer, the Chinese Sunway TaihuLight, has a maximum performance of about 0.093 exaflops. Real-time translation systems like Google Translate need a combined computational power several times greater, in order to process models with “anywhere from 100 million to 400 million parameters”.

And the top-performing, speech-recognition and machine-translation systems dwarf even the specs shared by Sengupta, according to Jonas Loof, deep learning solutions architect at Nvidia.

SEE: Research: Companies lack skills to implement and support AI and machine learning (Tech Pro Research)

In 2016, Baidu’s Deep Speech 2 was lauded for its “superhuman performance” on speech recognition, he says, but training the system that underpinned it required 20 exaflops of compute power and needed 300 million parameters to train. Similarly, Google’s Neural Translation Machine is capable of producing “near-human” accuracy language translation, but required 100 exaflops of processing power and 8,700 million parameters to train.

Describing the complexity of neural networks as exploding Loof says: “The amount of parameters and compute we need to train these kind of models has been growing over the past years,” adding “this is not stopping here, obviously”.

In a recent paper, the deep-learning research group Google Brain said it has run 1,000-years’ worth of CPU computation, as part of its research into recurrent neural networks, a type of neural net particularly well suited to language processing and speech recognition.

Optimizing these systems is a key concern says Sengupta, with companies using clusters of CPUs and GPUs, but also Application Specific Integrated Circuits that are tailored to machine-learning tasks, such as Google’s Tensor Processing Unit (TPU).

“The entire TPU paper was essentially ‘How can we develop a chip that can train and infer these massive networks that we’re building?’,” he says.

At the other end of the scale, Google is also engaged in developing very simple machine-learning models for recognizing a handful of simple but common voice commands, such as “on” and “off”. Google hopes these stripped-down convolutional neural network-based systems will eventually be simple enough to run on the very low-power hardware found in cheap phones and home appliances.

SEE: Special report: How to implement AI and machine learning (free PDF)

Ultimately Sengupta remains confident that new techniques for training language and speech recognition systems will reduce the vast amount of compute needed. He cites recent Facebook research that achieved state-of-the-art machine translation using convolutional neural networks, more commonly used in image recognition and which provide a more efficient alternative to traditional techniques.

Regardless of the power needed to train machines to understand language, Facebook’s Sengupta says the rate of progress in the field has outstripped what was thought possible until very recently, highlighting the rapid rise of virtual assistants like Amazon Alexa and Google Home.

“What is very interesting to me is that these devices have only been possible for the last couple of years,” he says.

“If I came here five years back and said these devices would be a reality, a lot of you would be very skeptical about that.”