Voice recognition in Windows is nothing new, and Word has had multi-language dictation built in for some time. But that’s intended for when you want to say exactly what you want to see on the page. Capturing an entire conversation or meeting — so you can listen, think and talk without also feverishly typing notes to refer back to later — is a different kind of workflow.
There are already a number of options, from the meeting transcription option inside Teams (which only works for the organisation hosting the meeting, but not for any external guests) to services like Otter that record in a browser or on your smartphone and transcribe in the cloud. Developers can use the Azure Cognitive Services APIs to create their own transcription app (Azure CTO Mark Russinovich created a smartphone app to record his meetings a few years ago).
SEE: How to add hyperlinks to a Word document (TechRepublic)
All Office users can now get transcriptions using that same backend with the new Transcribe feature in Word, which turns what’s said in a meeting or conversation into a transcript that you can use as a reference while writing a document.
You can type your own notes while the transcript is being recorded, but you don’t see the transcription being done in real time (because the text appearing on-screen can be distracting or make people self-conscious about talking). When you stop the recording, the transcript appears in the Transcribe pane very quickly and you can copy it — in part or in whole — from the pane into the document itself. If you just want a section, you click the + button that appears when you hover over it; if you want the entire transcript, click the ‘Add all to document’ button at the bottom of the pane.
If you already have a recording (in WAV, MP3, MP4 or M4A), you can upload that inside a Word document to have it transcribed. There’s no limit to the numbers of meetings and conversations you can record and transcribe live, but you can only upload 300 minutes (5 hours) of audio a month and audio files can’t be more than 200MB (although that will increase). How many minutes of recording that file size covers varies with the file format and codec you use; if you’re recording in MP3 on an iPhone, 200MB will store more than three hours of speech.
According to Microsoft, uploading and transcribing an audio file will currently take about the same time as the file length, but that will get faster in the future; our test of a 51-minute MP3 took slightly less than 50 minutes.
There are also new options for the Dictate option that’s already in Word, which doesn’t need a Microsoft 365 subscription, just for you to be signed in with a Microsoft account. As well as just dictating text, you can now use Voice Commands in Word for the web to format text, create lists and add comments to a document you’re reviewing; leaving comments is particularly convenient if you’re doing it on a tablet rather than a laptop because you don’t have to pull out a keyboard. Voice commands are only in Word for the web and mobile initially, but will be in desktop Word on Windows and Mac before the end of 2020 for Microsoft 365 subscribers.
Great with some rough edges
Transcribe is a very useful feature, and having it built into a mainstream product like Word will bring it to a much wider audience. But it has a number of limitations.
For a start, it’s only available in Word on the web to start with, and then in the mobile versions of Word. “Having transcribing integrated into Word on the web means that it works on any computer with any meeting software, and it’ll also be available by the end of the year for both Android and iOS phones,” Dan Parish, Microsoft principal group PM manager for Natural User Interface & Incubation, said. That means it’s not available in the desktop Word app on Windows or macOS. “We are currently looking into when and how we might want to expand to other platforms,” Parish said.
You need to be using either Chrome or the new Edge browser and to have a Microsoft 365 subscription (that includes enterprise, consumer and education plans).
US English is the only language currently supported (the Dictate feature supports nine languages, including three English dialects), with another 12 in preview. Microsoft said it’s working to have Transcribe available in more locales and languages.
For the American and UK accents that we briefly tested Word’s transcription with, accuracy is high but not perfect. (Incidentally, Word does recognise swear-words correctly, but they are converted into asterisks in the transcript.)
For comparison, transcribing the same conversation from a high-quality Teams chat in Word (in the new Edge browser on a Surface Book 2) and in Otter on an Android phone (which is less accurate than using Otter integrated into a Zoom meeting), we saw a similar number of mistakes; there were some words both tools got wrong (‘pain’ for ‘pane’, for example) and other mistakes they made differently.
On the other hand, Word did extremely well at transcribing an uploaded recording of someone speaking through a microphone in a large room with some noise in the recording and occasional coughing and rustling going on, which is the kind of quality some transcription tools balk at. There were a couple of homonym mistakes (‘are’ for ‘our’ and ‘bill’ for ‘build’) and some very strange neologisms (‘engenh’ for ‘engine’ and ‘kenik’ for ‘can it’), but we spotted only a couple of dozen mistakes in the 10,000-word transcription.
The likelihood of occasional mistakes makes having the transcript in a pane at the side of the window where you can play back the audio to hear what was said (and correct the mistakes yourself) extremely useful. But jumping to the right portion of the audio is confusing, because you have to click on the timestamp for the section of the transcript you want, not the text of the transcript — which is what feels natural (especially if you’ve used other transcription services).
You can edit the transcript in the Transcribe pane, but spell checking doesn’t work there and neither does search. So if you want to check the spelling or find a specific phrase in a long conversation, you’ll need to add the transcript to the document.
You can choose whether to leave the transcript attached when you share a document, or to remove it. But the recipient will need to open the document in Word on the web to see the attached transcript. In our tests, opening the document in the desktop version of Word and then saving it stripped off the transcript resulted in an error when we tried to access it in Word for the web. You can recover the transcript by going back to an earlier version of the document, but the interface guides you to discard the transcription and start a new one instead.
Currently, you can’t add multiple transcripts to the same document, so if you want to use multiple sources for research, you either need to create a new document for each, or discard the first transcript after extracting what you need and insert a new one.
For privacy reasons, although Word splits up the transcription between different speakers, it doesn’t try to identify who’s speaking, although if you type in a name for one of the speakers that will be used throughout the transcript. And even if you record a second or third conversation with the same person, Word won’t learn their voice (or yours) and put a name to it; again, that’s for privacy reasons, because the transcription service doesn’t store the audio or the transcription.
“Your audio files are sent to Microsoft, but only to provide you with a service; when the transcription is done, your audio and transcription results are not stored by our service at all. The audio file itself is stored in your own personal OneDrive,” Parish said.
This is slightly less convenient than building up a library of voices and names, but it also fits with the privacy principles for all the Office connected experiences. It would be nice to see options like the ability to point at an Outlook meeting to get names instead of having to type them in by hand.
Disappointingly for organisations that might want to tap their backlog of recorded meetings and events for research, you can’t buy extra transcription minutes for uploads (although Microsoft will consider that as an option if customers ask for it). If you want to transcribe more than 300 minutes of existing recordings per person per month, you can always find a quiet place and play the recordings live into a Word document. But any organization wanting to do this at scale will find other services more convenient (whether that’s a transcription service like Otter that has unlimited enterprise plans, or building your own system with Azure Cognitive Services and Power Automate).
Some of these limitations are minor rough edges, others are drawbacks we expect to see addressed in time. With the extra stress that the pandemic is putting on everyone, every feature that can improve productivity is welcome, and anyone who needs to extract information from meetings and calls will find this very useful, so it makes sense that Microsoft has released it quickly even if it’s not as polished as we’d like.
But because it’s so useful, we’d like to see a much richer version of the Transcribe feature. And while the ability to insert snippets and quotes straight into the Word document you’re working on is useful, we’d also like to see it come to more Office applications — especially OneNote, now that development has restarted on the desktop app precisely so it can adopt more of the features in the rest of Office.
Subscribe to the Developer Insider Newsletter
From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know. Delivered Tuesdays and Thursdays