Cloud

How Google is turning its Cloud Speech-to-Text AI into a real business tool

Google's Speech-to-Text now includes improved phone call and video transcription, automatic punctuation, and recognition metadata.

Building a slide deck, pitch, or presentation? Here are the big takeaways:
  • Google has updated its Cloud Speech-to-Text API with improved phone call and video transcription to make the service more useful for businesses.
  • Google's Speech-to-Text update includes automatic punctuation and optional recognition metadata.

On Monday, Google announced a major update to its Cloud Speech-to-Text technology that will make the API more useful for businesses, including improved phone call and video transcription.

The announcement follows Google's March unveiling of its Cloud Text-to-Speech offering, which allows developers to power voice response systems for call centers, enable Internet of Things (IoT) devices to talk back to users, and convert text-based media into a spoken format. It could signal that the tech giant is increasingly interested in bringing its artificial intelligence (AI)-powered tools to the enterprise.

Cloud Speech-to-Text—formerly known as the Cloud Speech API—was first unveiled in 2016, and has been generally available for about a year. Usage of the API has more than doubled every six months, according to a Google blog post from Dan Aharon, product manager of Cloud AI.

SEE: Quick glossary: Hybrid cloud (Tech Pro Research)

The Cloud Speech-to-Text update includes speech recognition models that are tailored for specific use cases, including phone call transcriptions, and transcriptions of audio from video, according to the post. Customers can choose the model that best fits their business's needs.

The update also includes one of the industry's first opt-in programs for data logging, with a model called "enhanced phone_call" that uses customer data to improve the system. Customers who choose to participate in the program will gain access to the model, which has 54% fewer errors than the basic "phone_call model," according to the post.

Google also revealed the video model, which has been optimized to process audio from videos and/or audio with multiple speakers, the post said. The video model uses machine learning—similar to that used by YouTube captioning—and offers a 64% reduction in errors compared to the default model.

Cloud Speech-to-Text also now includes automatic punctuation in speech transcriptions thanks to a new LSTM neural network. The model—now available in beta—can automatically suggests commas, question marks, and periods in text. This could be helpful for conference call transcriptions, or taking notes by voice as well.

Users can also tap optional recognition metadata to tag and group transcription workloads, and provide feedback to the Google team to improve the product, the post noted. For example, you can describe your transcribed audio or video with tags such as "voice commands for a shopping app" or "basketball sports tv shows," and Google aggregates that information across Cloud Speech-to-Text users to determine its next project, according to the post.

"Access to quality speech transcription technology opens up a world of possibilities for companies that want to connect with and learn from their users," Aharon wrote in the post. "With this update to Cloud Speech-to-Text, you get access to the latest research from our team of machine learning experts, all through a simple REST API."

Both the "enhanced phone_call" and video model are now available for English language transcription, and will soon be available for additional languages, according to the post. In terms of costs, the API is $0.006 per 15 seconds of audio for all models, except the video model, which is $0.012 per 15 seconds. However, Google is providing the new video model for $0.006 per 15 seconds for a limited trial period through May 31.

You can learn more or try out a demo on the Speech-to-Text product page.

Also see

istock-889309652.jpg
Image: iStockphoto/chombosan

About Alison DeNisco Rayome

Alison DeNisco Rayome is a Staff Writer for TechRepublic. She covers CXO, cybersecurity, and the convergence of tech and the workplace.

Editor's Picks

Free Newsletters, In your Inbox