In the recent articles about CCXML
and SCXML we talked about voice applications and their logic,
and referred often to VoiceXML language. It is still the
backbone in voice applications and dialog systems; and in this article I’d like
to talk about it and its future. VoiceXML 2.0 is
currently under development by the W3C
Voice Browser working group. In 2000, the VoiceXML Forum (formed by AT&T, IBM, Lucent, and
Motorola) released VoiceXML 1.0 to the public.
Shortly thereafter, VoiceXML 1.0 was submitted to the
W3C as the basis for the creation of a new international standard. VoiceXML
is the result of this work based on input from W3C Member companies.

What is it for

Originally it was designed for
creating audio dialogs that feature synthesized speech, digitized audio,
recognition of spoken and DTMF key input, recording of spoken input, telephony,
and mixed initiative conversations. Its major goal is to bring the advantages
of Web-based development and content delivery to interactive voice response

The top-level element is <vxml>, which is mainly a container for dialogs. There are two types of dialogs:
forms and menus. Forms present information and gather input; menus offer
choices of what to do next. Listing A
is the simplest application with one main form without successor dialog (hello-world.vxml.txt):

Listing A

<?xml version=”1.0″ encoding=”UTF-8″?>
    <block>Hello World!</block>

VoiceXML’s main goal is to bring the full power of Web development
and content delivery to voice response applications, and to free the authors of
such applications from low-level programming and resource management. It
enables integration of voice services with data services using the familiar
client-server paradigm.

A voice service is viewed as a
sequence of interaction dialogs between a user and an implementation platform.
The dialogs are provided by document servers, which may be external to the
implementation platform. Document servers maintain overall service logic,
perform database and legacy system operations, and produce dialogs.

A VoiceXML
document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation
and is collected into requests submitted to a document server. The document
server replies with another VoiceXML document to
continue the user’s session with other dialogs. Summarized, VoiceXML
is a markup language that:

  • Shields
    application authors from low-level, and platform-specific details;
  • Separates
    user interaction code (in VoiceXML) from service logic (e.g. CGI scripts);
  • Minimizes
    client/server interactions by specifying multiple interactions per
  • Promotes
    service portability across implementation platforms. VoiceXML is a common
    language for content providers, tool providers, and platform providers;
  • Is
    easy to use for simple interactions, and yet provides language features to
    support complex dialogs.

Requirements for hardware and

According to the spec, the “http” URI scheme must be supported for
document acquisition. In some cases, the document request is generated by the
interpretation of a VoiceXML document, while other
requests are generated by the interpreter context in response to events outside
the scope of the language, for example an incoming phone call. An
implementation platform must support audio output using audio files and
text-to-speech (TTS). An implementation platform is also required to detect and
report character and/or spoken input simultaneously and to control input
detection interval duration with a timer whose length is specified by a VoiceXML document.

The VoiceXML
application platform must report characters (for example, DTMF) entered by a
user, and must support the XML form of DTMF grammars described in the W3C Speech Recognition
Grammar Specification
(SRGS). It also must be able to record audio received
from the user.

So how it works?

We have already seen this basic concept
in the SCXML
language. A VoiceXML
document (or a set of related documents called an application) forms a
conversational finite state machine. The user is always in one conversational
state, or dialog, at a time. Each dialog determines the next dialog transition.
Transitions are specified using URIs, which define
the next document and dialog to use. Execution is terminated when a dialog does
not specify a successor, or if it has an element that explicitly exits the

There are two kinds of dialogs: forms
and menus. Forms define an interaction that collects values for a set of
form item variables. Each field may specify a grammar that defines the
allowable inputs for that field. If a form-level grammar is present, it can be
used to fill several fields from one utterance. A menu presents the user with a
choice of options and then transitions to another dialog based on that choice.

A subdialog
is like a function call, in that it provides a mechanism for invoking a new
interaction, and returning to the original form. Variable instances, grammars,
and state information are saved and are available upon returning to the calling
document. Subdialogs can be used, for example, to
create a confirmation sequence that may require a database query; to create a
set of components that may be shared among documents in a single application;
or to create a reusable library of dialogs shared among many applications.

A session begins when the user
starts to interact with a VoiceXML interpreter
context, continues as documents are loaded and processed, and ends when
requested by the user, a document, or the interpreter context.

An application is a set of
documents sharing the same application root document. Whenever the user
interacts with a document in an application, its application root document is
also loaded. The application root document remains loaded while the user is
transitioning between other documents in the same application, and it is
unloaded when the user transitions to a document that is not in the
application. Figure A (fig1.gif) shows the transition of documents (D) in an application
that share a common application root document (root).

Figure A

Document transitions

Each dialog has one or more
speech and/or DTMF grammars associated with it. In machine directed
applications, each dialog’s grammars are active only when the user is in that
dialog. In mixed initiative applications, where the user and the machine
alternate in determining what to do next, some of the dialogs are flagged to
make their grammars active (i.e., listened for) even when the user is in
another dialog in the same document, or on another loaded document in the same

VoiceXML provides a form-filling mechanism for handling
“normal” user input. In addition, VoiceXML
defines a mechanism for handling events not covered by the form mechanism.
Events are thrown by the platform under a variety of circumstances, such as
when the user does not respond, doesn’t respond intelligibly, requests help,
etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or
their syntactic shorthand.

Example application

Let’s look at the application
with the simplest and most common type of form, in which the form items are
executed exactly once in sequential order to implement a computer-directed
interaction. This will be a weather information service (Listing B) that uses such a form and provides weather information
in specified country and city (weather.vxml.txt).

Listing B

<?xml version=”1.0″ encoding=”UTF-8″?>
<vxml version=”2.0″ xmlns=””
<form id=”weather_info”>
 <block>Welcome to the weather information service.</block>
 <field name=”country”>
  <prompt>What country?</prompt>
  <grammar src=”country.grxml”  type=”application/srgs+xml”/>
  <catch event=”help”>
     Please speak the country for which you want the weather.
 <field name=”city”>
  <prompt>What city?</prompt>
  <grammar src=”city.grxml” type=”application/srgs+xml”/>
  <catch event=”help”>
     Please speak the city for which you want the weather.
  <submit next=”/servlet/weather” namelist=”city country”/>

This dialog proceeds

C (computer): Welcome to the weather information service. What country?
H (human): Help
C: Please speak the country for which you want the weather.
H: Georgia
C: What city?
H: Macon
C: I did not understand what you said. What city?
H: Tbilisi
C: The conditions in Tbilisi Georgia are sunny and clear at 11 AM …

User input

The <grammar> element is
used to provide either a speech grammar or a DTMF grammar. A speech grammar
specifies a set of utterances that a user may speak to perform an action or
supply information, and for a matching utterance, returns a corresponding
semantic interpretation. The following (Listing C) is an example of inline
grammar defined by the XML Form of the W3C Speech Recognition Grammar
Specification (SRGS), (grammar1.xml.txt).

Listing C

<grammar mode=”voice” xml:lang=”en-US” version=”1.0″ root=”command”>
  <!– Command is an action on an object –>
  <!– e.g. “open a window” –>
  <rule id=”command” scope=”public”>
    <rulerefuri=”#action”/> <rulerefuri=”#object”/>

  <rule id=”action”>
      <item> open </item>
      <item> close </item>
      <item> delete </item>
      <item> move </item>

  <rule id=”object”>
   <item repeat=”0-1″>
      <one-of> <item> the </item> <item> a </item> </one-of>
      <item> window </item>
      <item> file </item>
      <item> menu </item>

DTMF grammar specifies a set of key presses that a user may use to perform an action or
supply information, and for matching DTMF input, returns a corresponding
semantic interpretation. All VoiceXML platforms are
required to support the DTMF grammar XML format. The following (Listing D) is an example of a simple
inline XML DTMF grammar that accepts as input either “1 2 3” or
“#” (grammar2.xml.txt).

Listing D

<grammar mode=”dtmf” version=”1.0″ root=”root”>
  <rule id=”root” scope=”public”>
      <item> 1 2 3 </item>
      <item> # </item>

System output

The <prompt> element
controls the output of synthesized speech and prerecorded audio. Conceptually,
prompts are instantaneously queued for play, so interpretation proceeds until
the user needs to provide an input. At this point, the prompts are played, and
the system waits for user input. Once the input is received from the speech
recognition subsystem (or the DTMF recognizer), interpretation proceeds.

The content of the <prompt> element is modeled
on the W3C Speech
Synthesis Markup Language
(SSML). A good introduction into SSML is
also available.

Beyond average

Certainly this article is just
an introduction and cannot cover all details and features of VoiceXML. VoiceXML is a W3C endorsed markup language that allows developers to write
advanced telephony applications with simplicity undreamed of until recent
years. VoiceXML allows the average Web developer to
write telephony applications with the ease and simplicity of writing the average
HTML Web page. As VXML is a tag-based markup language, its structure is very
similar to HTML in many ways, but instead of being a primarily visual medium, VoiceXML is an auditory medium that allows the end user to
navigate through his ‘telephony page’ by using voice commands, rather than by
clicking a button on a Web page. With the implementation of VoiceXML
you do not need to invest in expensive hardware and software for a telephony
application, or in a dedicated location to store all your telephony equipment.
Many voice application hosts and providers such as Skype are ready to provide you
with a free voice application, or voice with enhanced functionality for a
little extra payment.