Editor’s note: This article originally appeared in TechRepublic’s Web Development Zone TechMail. Subscribe, and you’ll receive information on Web-development related projects and trends.

Are you ready for the next shift in computing? With the increased speed of computer processors and cheap RAM, voice interface is becoming more of a reality. For example, Microsoft has voice-enabled the current version of Office XP. Another technology that’s showing a lot of promise is the convergence of Web and telephony using VoiceXML.

VoiceXML is a new flavor of XML that defines structures for playing prerecorded voice prompts as well as text-to-speech generation for presentation to the user over the telephone. The integrated response from the user is handled by either DTMF (touch tone) or speech recognition.

The World Wide Web Consortium’s (W3C) working draft on “Voice Browser” activity defines the standards for VoiceXML. W3C is diligently working to expand access to the Web by allowing people to interact with Web sites via spoken commands. This technology allows any telephone to access Web-based services and is especially helpful to people with disabilities. It will also improve interaction with display-based Web content in cases where the mouse and keyboard may be missing or inconvenient.

Developers using VoiceXML code set up a <field> section so a phone application can “listen” for caller commands. Just as text boxes on HTML pages receive the user’s keyboard input, fields in VoiceXML pages receive the caller’s voice or DTMF input. Enclosed within the <field> tag are children tags, used to control the program flow. The following are examples of <field> children tags:

  • <grammar>: Grammar specifies the collection of possible caller inputs that a field should listen for. Fields in VoiceXML cannot take a best guess at what the caller said when listening to arbitrary inputs. Fields must know ahead of time the total possible inputs to expect, although the grammar size can be very large.
  • <prompt>: The prompt asks a caller for input, for example, “Say the name of a restaurant” or “Say or dial a ten-digit phone number.”
  • <nomatch>: This tag becomes active whenever the caller provides an input that is not found in the field’s grammar.
  • <noinput>: This tag becomes active whenever the caller fails to provide any input in response to a field prompt.
  • <filled>: When the caller provides a recognized spoken or DTMF command, the filled section becomes active. This tag is used primarily to determine the application control in response to a caller command.

Are you including speech recognition in your Web design?

Have your clients begun requesting speech recognition features in their Web projects? What are the limitations in current speech recognition technology? Post your comments below.