Keep talkin': .NET Speech SDK 1.0 beta

The beta release of .NET's Speech SDK allows you to build speech-enabled Web applications. Learn what's good and what's bad about this new slice of Web development.

Microsoft’s .NET Speech SDK 1.0 beta is designed to allow Web developers to add speech recognition and synthesis to their ASP.NET applications. Speech-enabling a Web application seemed like technology for technology’s sake until I tried to imagine using a computer if I were blind. Another reason to speech-enable a Web application is as near as most people’s hip: telephony, Web-enabled cell phones, and personal digital assistants.

Getting started
You can download the .NET Speech SDK, which consists of:
  • ·        ASP.NET speech controls to add speech input and output to ASP.NET applications.
  • ·        Visual Studio .NET add-ins for speech-enabling ASP.NET Web applications.
  • ·        Tutorials with sample applications.
  • ·        An Internet Explorer speech add-in that handles Speech Application Language Tags (SALT) to allow running and testing speech-enabled applications.

The system requirements for the .NET Speech SDK are a machine running Windows 2000/XP SP3 with a 450-MHz processor or better and about 200 MB of disk space. You also need IIS, .NET Framework SP2 (installed after IIS), Visual Studio .NET, and IE6. In addition, you will need an audio input device, such as a headset or desktop microphone, and an audio output device, sound card, and headset or speakers.

One major caveat: The Microsoft .NET Speech SDK is a beta product. This means that it is subject to all of the little flukes that beta products are prone to. It also means that there are no guarantees that the final release version won’t contain major changes from the current iteration.

Because the Microsoft .NET Speech SDK is intended to develop Web-based applications as opposed to PC-based applications, there are some minor differences from other speech recognition software packages. The most notable of these differences is that the application is not taught to handle the speaker’s speech pattern. Instead, the speaker must tailor his or her speech pattern to what the machine can handle. Although this didn’t present a problem with my New Jersey accent, it might with other accents.

With mobile devices, the spoken audio is compressed and streamed to the server where the actual speech recognition takes places. The results are sent back to the device as XML, and the appropriate action occurs. In contrast, with robust clients like Internet Explorer on a PC, the majority of the processing takes place on the client, which reduces the bandwidth used by speech-enabled applications.

Prompts and grammars
For speech recognition, Microsoft .NET Speech SDK uses what are referred to as prompts and grammars. Prompts, with speech-enabled applications as well as with traditional applications, are a way to communicate with the user when input is expected. The only difference is that with speech-enabled applications, prompts can be audio.

Prompting for input, with the exception of telephony, can be accomplished through either text, speech synthesis, or recorded speech. The first has been part of ASP.NET since day one, and the SDK provides the ability to use the latter two. One of the problems I have encountered with speech synthesis is that the quality hasn’t really improved. The technology for speech synthesis is 40 years old. You’d think, after all that time, synthesized speech would sound like the HAL 9000. Instead, it sounds like a primitive voice mail system.

Grammars are a list of words or phrases that an application will accept as input. Although this limits the number of words that an application can accept, it greatly improves the recognition by limiting the number of possible word choices. In addition, a single application can use multiple grammars.

Consider order entry applications. The grammars required to sell clothes would differ from the grammars required to sell books. A clothes application would require separate grammars for SKU, size, color, and quantity, while a book application would require grammars for genre, title, and quantity. Because grammars are written in XML, any XML or text editor can be used to write them. However, the Microsoft .NET Speech SDK provides a Grammar Editor (Figure A), which provides a graphical way to construct grammars.

Figure A

Grammars may seem strange at first, but they offer flexibility that Microsoft never mentions in the scant documentation provided with the SDK. Consider the grammar for the numbers from one to nine shown in Listing A.

Using this grammar provides the ability to recognize the English numbers from one to nine. By replacing this grammar with the Spanish grammar in Listing B, Spanish speakers can now use the application.

While a single grammar can be used combining English and Spanish, it would probably be best to limit the size of the grammars to avoid recognition errors.

When trying the Microsoft .NET Speech SDK 1.0 beta release, I had to keep reminding myself that it is beta. With this in mind, I was willing to cut Microsoft a little slack. The surprising thing was that with the exception of speech synthesis, Microsoft really didn’t need much slack. I was able to hang a Web application only twice while doing speech recognition, and it never hung during speech synthesis.

While I wouldn’t use a beta release for production applications, it does appear to perform admirably in the role that it was intended. I am looking forward to future releases of the Microsoft .NET Speech SDK.

Editor's Picks