Building a slide deck, pitch, or presentation? Here are the big takeaways:
- Google researchers unveiled a deep learning audio-visual model for isolating a single speech signal from a mix of sounds, including other voices and background noise.
- The model has potential applications in speech enhancement and recognition in videos and in video conferencing.
When you find yourself in a noisy conference hall or networking event, it’s usually pretty easy to focus your attention on the particular person you’re talking to, while mentally “muting” the other voices and sounds in the area. This capability–known as the cocktail party effect–comes naturally to humans, but has remained a challenge for computers in terms of automatically separating an audio signal into its individual speech sources.
At least, until now: Google researchers have developed a deep learning audio-visual model for isolating a single speech signal from a mix of sounds, including other voices and background noise. As detailed in a new paper, the researchers were able to create videos with a computer in which specific people’s voices are enhanced, while all other sounds are toned down.
The method allows someone watching a video to select the face of the person in the video who they want to hear, or to use an algorithm to select that person based on the context. This could potentially allow business users to more easily transcribe meetings or conference presentations, especially if it was filmed in a crowded conference hall.
SEE: IT leader’s guide to deep learning (Tech Pro Research)
“We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking,” according to a recent Google Research blog post.
Google’s technique uses both audio and visual signals from the video to separate the speech, the post noted, matching the movements of a person’s mouth to the sounds produced to help identify which parts of the audio correspond to that person. This greatly improves speech separation quality when more than one person is present, according to the post.
Google trained its method on a collection of 100,000 videos of lectures and talks from YouTube, extracting video segments with clean speech and a single speaker. From this, the group gained 2,000 hours of video clips, all with a single speaker on camera and no background noise. The researchers then used that data to generate mixtures of face videos and the corresponding speech from separate video sources, along with background noises.
This data allowed the researchers to train a “multi-stream convolutional neural network-based model” to separate the audio streams for each speaker.
This method has a number of potential applications, such as pre-processing for speech recognition and automatic video captioning. Google is currently exploring how it may be integrated into the company’s products, according to the post.
