The W3C Multimodal Interaction Activity group is developing specifications for a new breed of Web application that allows multiple modes of interaction—for instance, speech, handwriting, and keypresses for input, and spoken prompts, audio, and visual displays for output. Specification drafts include:

The specifications also include some explanations and details about how these multiple modes might work together in the future.

In multimodal systems, an event is a representation of some asynchronous occurrence of interest to the multimodal system, such as mouse clicks, hanging up the phone, speech recognition results, or errors. Events may be associated with information about the user interaction, like the location the mouse was clicked. Interaction (input, output) between the user and the application may often be conceptualized as a series of dialogs managed by an interaction manager. A dialog is an interaction between the user and the application that involvesturn taking. In each turn, the interaction manager (working on behalf of the application) collects input from the user, processes it (using the session context and, possibly, external knowledge sources), computes a response, and updates the presentation for the user.

The field of potential use cases of multimodal interaction is large. Devices that are used in various use cases can be classified from the point of view of thickness:

  • Thin client—A device with little processing power or capabilities that can be used to capture user input (microphone, touch display, stylus, etc.) as well as nonuser input, such as GPS
  • Thick client—A device such as a PDA or notebook
  • Medium client—A device capable of input capture and some degree of interpretation; the processing is distributed in a client/server or a multidevice architecture

You can view several use cases for multimodal interaction in a special W3C Note.

The framework
The purpose of the W3C multimodal interaction framework is to identify and relate markup languages for multimodal interaction systems. The framework identifies the major components for every multimodal system. Each component represents a set of related functions. The framework identifies the markup languages used to describe information required by components and for data flowing among components. The framework will build upon a range of existing W3C markup languages together with the W3C DOM. The DOM defines interfaces whereby programs and scripts can dynamically access and update the content, structure, and style of documents.

Figure A illustrates the basic components of the framework.

Figure A
Framework components

The user enters input into the system and observes and hears information presented by the system. The interaction manager is the logical component that coordinates data and manages execution flow from various input and output modality component interface objects. The interaction manager maintains the interaction state and context of the application and responds to input from component interface objects and changes in the system and environment. The interaction manager then manages these changes and coordinates input and output across component interface objects. The session component provides an interface to the interaction manager to support state management and temporary and persistent sessions for multimodal applications. The environment component enables the interaction manager to find out about and respond to changes in device capabilities, user preferences, and environmental conditions—such as which of the available modes the user wants to use and whether the user muted audio input.

Markup languages
Now, let’s take a look at the two specifications of XML-based markup languages for use within the multimodal interaction framework, InkML, and EMMA.

As more electronic devices with pen interfaces are becoming available for entering and manipulating information, applications need to be more effective at leveraging this method of input. Handwriting is an input modality that is familiar for most users, so they will tend to use this as a mode of input and control when available.

Hardware and software vendors have typically stored and represented digital ink using proprietary or restrictive formats. The lack of a public and comprehensive digital ink format has severely limited the capture, transmission, processing, and presentation of digital ink across heterogeneous devices developed by multiple vendors. In response to this need, InkML provides a simple and platform-neutral data format to promote the interchange of digital ink between software applications.

With the establishment of a nonproprietary ink standard, a number of applications, old and new, are expanded so that the pen can be used as a convenient and natural form of input. The current InkML specification defines a set of primitive elements sufficient for all basic ink applications. Few semantics are attached to these elements.

All content of an InkML document is contained within a single <ink> element. The fundamental data element in an InkML file is the <trace>. A trace represents a sequence of contiguous ink points—e.g., the X and Y coordinates of the pen’s position. A sequence of traces accumulates to meaningful units, such as characters and words. The <traceFormat> element defines the format of data within a trace.

Ink traces can have certain attributes, such as color and width, which are captured in the <brush> element. Traces that share the same characteristics, like being written with the same brush, can be grouped together with the <traceGroup> element. In the simplest form, an InkML file/message looks like Listing A, which is the trace for user-input of “Hello,” shown in Figure B.

Figure B
Trace of “Hello”

InkML is rich and simple language. For a more detailed look, see the W3C current draft page.

The EMMA markup language is intended for use by systems that provide semantic interpretations for a variety of inputs, including speech, natural language text, GUI, and ink input. The language is focused on annotating the interpretation information of single and composed inputs, as opposed to (possibly identical) information that might have been collected over the course of a dialog. It provides a set of elements and attributes that are focused on accurately representing annotations on the input interpretations.

An EMMA document typically contains three parts: instance data, data model, and metadata. Instance data is an application-specific markup corresponding to input information that is meaningful to the consumer of an EMMA document. Instances are built by input processors at runtime. Given that utterances may be ambiguous with respect to input values, an EMMA document may hold more than one instance. The data model imposes constraints on the structure and content of an instance. Metadata represents annotations associated with the data contained in the instance. Annotation values are added by input processors at runtime.

The Multimodal Interaction Working Group is currently considering the role of the Resource Description Framework (RDF) in EMMA syntax and processing. It appears useful for EMMA to adopt the spirit of the RDF conceptual triples model and, thereby, enable RDF processing in RDF environments.

However, there is concern that unnecessary processing overhead will be introduced by a requirement for all EMMA environments to support the RDF syntax and its related constructs. An inline syntax would remove this requirement, provide a more compact representation, and enable queries on annotations using XPath, just as for queries on instance data. For these reasons, currently there are three syntax proposals: inline XML syntax, an RDF/XML syntax, and a mixed inline+RDF syntax. You can see the detailed description on the W3C RDF page.

The general purpose of EMMA is to represent information automatically extracted from a user’s input by an interpretation component. Input is to be taken in the general sense of a meaningful user input in any modality supported by the platform. In the architecture shown in Figure A, EMMA conveys content between user input modality components and an interaction manager.

Components that generate EMMA markup include speech recognizers, handwriting recognizers, natural language understanding engines, Dual Tone Multi-Frequency (DTMF) signals, keyboard, and pointing devices, such as a mouse. Components that use EMMA include interaction manager and multimodal integration components.

EMMA is just being developed, but it has tremendous potential for integrating different devices on the Web. More than that, it can make a Web accessible in its full sense. For future Web applications, there should be no difference in interacting with a user via phone DTMF tones, PDA ink pens, or even voice browsers for users with disabilities.