a a a

Interaction in 3D Graphics

Vol.32 No.4 November 1998

Put That Where? Voice and Gesture at the Graphics Interface

Mark Billinghurst
Human Interface Technology Laboratory


A person stands in front of a large projection screen on which is shown a checked floor. They say, “Make a table,” and a wooden table appears in the middle of the floor.

“On the table, place a vase,” they gesture using a fist relative to palm of their other hand to show the relative location of the vase on the table. A vase appears at the correct location.

“Next to the table place a chair.” A chair appears to the right of the table.

“Rotate it like this,” while rotating their hand causes the chair to turn towards the table.

“View the scene from this direction,” they say while pointing one hand towards the palm of the other. The scene rotates to match their hand orientation.

In a matter of moments, a simple scene has been created using natural speech and gesture. The interface of the future? Not at all; Koons, Thorisson and Bolt demonstrated this work in 1992 [23]. Although research such as this has shown the value of combining speech and gesture at the interface, most computer graphics are still being developed with tools no more intuitive than a mouse and keyboard. This need not be the case. Current speech and gesture technologies make multimodal interfaces with combined voice and gesture input easily achievable. There are several commercial versions of continuous dictation software currently available, while tablets and pens are widely supported in graphics applications. However, having this capability doesn’t mean that voice and gesture should be added to every modeling package in a haphazard manner. There are numerous issues that must be addressed in order to develop an intuitive interface that uses the strengths of both input modalities.

In this article we describe motivations for adding voice and gesture to graphical applications, review previous work showing different ways these modalities may be used and outline some general interface guidelines. Finally, we give an overview of promising areas for future research. Our motivation for writing this is to spur developers to build compelling interfaces that will make speech and gesture as common on the desktop as the keyboard and mouse.

Why Multimodal Interfaces?

There are several strong reasons for creating interfaces that allow combined voice and gestural input. The first is purely practical; ease of expression. As Martin points out [14], typical computer interaction modalities are characterized by an ease vs. expressiveness trade-off. Ease corresponds to the efficiency with which commands can be remembered and expressiveness the size of the command vocabulary. Common interaction devices range from the mouse that maximizes ease, to the keyboard that maximizes expressiveness. Multimodal input overcomes this trade-off; speech and gestural commands are easy to execute while retaining a large command vocabulary.

Voice and gesture complement each other and when used together, create an interface more powerful than either modality alone. Cohen [4] summarizes this complementary relationship and shows how natural language interaction is suited for descriptive techniques, while gestural interaction is ideal for direct manipulation of objects. The strengths of each modality match the weaknesses of the other. For example, unlike gestural input, voice is not tied to a spatial metaphor, so it can be used to interact with objects regardless of degree of visual exposure. This is particularly valuable in graphical environments where objects may be hidden or occluded. Allowing all types of input maximizes the usefulness of an interface by broadening the range of tasks that can be done in an intuitive manner.

For many types of applications, users prefer using combined voice and gestural communication to either modality alone. Hauptman and MacAvinney [6] used a simulated speech and gesture recognizer in a typical graphics task to evaluate user preference. Users overwhelmingly preferred combined voice and gestural recognition due to the greater expressiveness possible. When a combined input was possible, subjects used speech and gesture together 71 percent of the time as opposed to voice only (13 percent) or gesture only (16 percent). The more spatial the task the more users prefer multimodal input. In a verbal task 56 percent of the users preferred combined pen/voice input over either modality alone, as did 89 percent in the numerical task and 100 percent of the users in a spatial map task [17]. This suggests that for tasks common in graphics application, users will prefer multimodal input to voice or gesture alone.

Combining speech and gesture improves recognition accuracy and reduces the length of speech input, producing faster task completion time compared to speech only interfaces. Using a speech and pen based interface, Oviatt [17] found that multimodal input produced a 36 percent reduction in task errors and 23 percent fewer spoken words resulting in 10 percent faster completion times compared to the speech only. Bolt [2] also showed how either modality could be used to resolve ambiguities in the other. If the user says “Create a blue box there” but the machine fails to recognize “Create,” the string “* a blue box there” can still be resolved by considering where the user is pointing. If there is no box, then the likely word is “Create.”

There are also psychological reasons for integrating speech and gesture recognition into graphics applications. A person’s ability to perform multiple tasks is affected by whether these tasks use the same or different sensory modes, for example visuo/spatial or verbal modes [24]. In multimodal interfaces users can perform visuo/spatial tasks at the same time as giving verbal commands with little cognitive interference. Martin [14] finds that people using a CAD program with the addition of speech input were more productive than those using a traditional interface. They remained visually focused on the screen while using speech commands, improving performance by 108 percent.

Figure 1
Figure 1: DreamSpace: an unencumbered interface.

Lessons from Previous Interfaces

The seminal multimodal graphical interface was that demonstrated in the “Put-That-There” work of the Architecture Machine Group [2]. The interface consisted of a large room, one wall of which was a back projection panel. Users sat in the center of the room in a chair wearing magnetic position sensing devices on their wrists to measure hand position. Users could use speech, and gesture, or a combination of the two to add, delete and move graphical objects shown on the wall projection panel. Although nearly 20 years old, a video of their work is still an impressive example of an effective multimodal interface. Part of this is because they discovered that by integrating speech and gesture recognition with contextual understanding, neither had to be perfect provided they converged on the user’s intended meaning. In this case the computer responds to users’ commands by using speech and gesture recognition and taking the current context into account. Figure 1 shows a user interacting with the modern equivalent of “Put that there,” the DreamSpace interface of Lucente et al [12] which uses machine vision to free the user from encumbering tracking devices.

Subsequent multimodal interfaces have mainly used combined voice and mouse input in command and control type applications, but have also reinforced the importance of using contextual knowledge to aid imperfect recognition. For example, the Boeing “Talk and Draw” project [20] developed a multimodal AWACS workstation for directing military operations. Vocal commands were combined with mouse selection of graphical objects. The graphical commands were processed first to create an interaction context and the speech input parsed from within that context. If the speech recognizer produced incomplete recognition, knowledge of the graphical context could be used to prompt the user for the relevant input.

CUBRICON is a similar interface developed by Neal and Shapiro for Air Force Command and Control [15]. However in this case the relationship between the speech, text and gestural input is formalized in modified ATN grammars. Like traditional speech recognition grammars, they are used to identify user input, but in this case each noun phrase can consist of zero or more words of text along with zero or more pointing references. Contextual knowledge sources such as the discourse or domain specific knowledge bases are then used to interpret the input and determine the referent of any multimodal commands.

Successful multimodal interfaces divide the command vocabulary according to the strengths of each modality. In a surface modeling program developed by Weiner and Ganapathy [27] speech was used for invoking menu commands, whose magnitude and graphics parameters were then determined by gesturing with a dataglove. This meant the users never had to remove their hands from the modeling surface to issue a command. They found that adding speech recognition markedly improved the interface, especially because of the sensible partitioning of the voice and gesture semantics. This was based on three basic assumptions about gesture and speech:

  • Gesture recognition is not as efficient as speech.
  • A spoken vocabulary has a more standard interpretation than gesture.
  • Hand gesturing and speech complement each other.

Marsh, et al clarify these assumptions by summarizing several years experience of developing multimodal command and control interfaces [13]. They found that gestures have the following limitations: little relationship between operations, commands must be entered sequentially, previous operations cannot be referred to and the user must follow an arbitrary order of operations. As a result there is a set of discourse and dialogues properties that should be supported in a multimodal interface to significantly enhance its functionality, including reference, deixis and ellipsis, focus of attention and context maintenance.

These examples show the importance of modeling the user and designing interfaces that draw on existing behaviors. “Put that There” was so successful because it responded to a natural input style that is used in everyday conversation. Oviatt [18] emphasizes the importance of creating systems that accommodate the features of everyday conversation. In the case of speech this includes disfluencies, prosodic modulation of language and ungrammatical input. Similarly there are several well-established relationships between gesture and speech. For example, Kendon finds that the stroke of a gesture slightly precedes the peak speech syllable in co-occurring speech, and gesture finishes before the accompanying speech [9].

Although the dominant gestures used in these multimodal systems are pointing or deictic gestures, they need not be. Sparrell and Koons developed a system that used iconic gesture and speech input [21]. Iconics are the gestures made when the hand represents an object and its motion — such as the gesture accompanying the phrase “The plane moved like this.” By considering relative hand position and orientation the user could intuitively position and orient objects in a graphical scene. If the user says, “rotate the chair like this,” while rotating their hands, the chair will rotate the same amount as their hands do. Iconic gesture can only be understood in the context of the speech input, so they use a top down approach to multimodal understanding where the speech input determines how the gestural input is to be processed and understood.

A key finding from their work is that the graphical environment needs to be represented in different knowledge representations for each of the input modalities in order for the interaction to be smooth and seamless [10]. For example, speech can be used to describe non-visual properties of objects such as their spatio-temporal relationships and so requires a representation encoding object spatio-temporal relationships. As summarized in table 1, each of the input modalities has different representational needs.


Representational Needs


Shape/visual appearance
Spatio-temporal information


Iconic - shape
Deictic - spatial

Table 1: Multimodal representation requirements.

Each object in the scene must be stored using multiple encodings to meet these needs, including:

  • Algebraic Representation
    Categorical or abstract relations.
  • Visual Representation
    Object shape and appearance.
  • Metric Representation
    Object spatial and temporal relationships.

Algebraic knowledge representations such as frames come from the artificial intelligence field, while the graphics and vision communities have provided visual representations such as object primitives. Metric representation is still an open research area: how can we encode the notion of “between” or “on top of” in an unambiguous manner?

Once the graphics environment has been represented in multiple forms, interpreting user input involves three steps:

  • The identification of the key input features.
  • Representing these features in algebraic, visual and metric form.
  • Mapping the input representation onto the representations of the graphical environment.

Although many interesting multimodal interfaces can be built which use sequential voice and gesture input, the most intuitive interfaces allow users to express simultaneous speech and gestural commands. In order to respond to the user’s voice and gesture the interface needs to integrate the input into a single semantic representation that can be used to generate the appropriate system commands and responses. The general approach is to time stamp the raw input data from each modality, parse it into an intermediate level semantic form and then use temporal or contextual information to merge related inputs. Koons, Sparrell and Thorisson show how a frame-based architecture can be used to integrate partial input from speech, gesture and gaze into a single coherent command [11].

A slightly more sophisticated approach is that described by Nigay and Coutaz [16]. They present a generic fusion engine that can be embedded in a multiagent multimodal architecture. Their fusion engine attempts three types of data fusion in order:

  • Microtemporal — Input information produced at the same time is merged together.
  • Macrotemporal — Sequential input information is merged together.
  • Contextual — Input information related by contextual features is merged.
Figure 2
Figure 2: Speech input structure.

Figure 2
Figure 3a: Point type structure.

Figure 2
Figure 3b: Line type structure.

Figure 2
Figure 4: Final integrated frame.

However, as Johnston et al [7] point out, there is no general and well-defined mechanism for multimodal integration. In addition many multimodal interfaces are speech-driven, so that gestural interpretation is delayed until required to interpret speech expressions. They propose a method for multimodal integration that overcomes these limitations. The basic approach is applying a unification operation to strongly typed feature structures that are semantic representations of the user’s speech and gestural input. For a given multimodal input, several structures will be generated with their associated probabilities. Each of these structures has several associated types that limit the ways in which they can be combined with other structures.

For example, if the user says “Barbed wire,” the feature structure shown in Figure 2 will be generated. The type of each structure is shown at the bottom left of the structures.

While they are speaking, if they make a line gesture, all the possible interpretations are generated so in this case both point and line type structures are produced (see Figure 3).

The multimodal integration is now guided by tagging of the speech and gesture input as either complete or partial, and examination of the time stamps associated with the multimodal input. In this case the spoken phrase is incomplete and requires a line type structure to fill in the location slot. The correct structure is found from the gesture input and the final integrated result is produced (see Figure 4). For a given time frame all of the possible gesture and speech input structures are integrated using type consistency and the probability for each result found from the probability of the component structures. The result with the highest probability is the one selected.

The use of typed feature structures will work with any gestural input. It supports partiality, integrating incomplete input structures as shown in Figure 4. It is also impossible to generate results with incompatible types so the input modes can compensate for errors in speech or gesture recognition.

Design Recommendations

As we have seen from these examples, powerful effects can be achieved by combining speech and gesture recognition with simple context recognition. Speech and gesture complement each other and the ability to have combined voice and gestural input in an interface is to be preferred over restricting the user to either modality alone.

There are several recommendations that can be made from an interface perspective:

  • Speech should be used for non-graphical command and control tasks.
  • Gestures should be used for visuo/spatial input, including deictic and iconic.
  • Contextual knowledge should be used to resolve ambiguous input.
  • Multiple knowledge representations should be used for understanding the user’s input.
  • A grammar or type-constrained unification method should be used for input integration.

There are also guidelines that are applicable to each of the modalities. For speech, there is a need for a special acoustically distinct command vocabulary, providing constant feedback about the recognizer’s activities and separation of speech from the graphics modality as much as possible [8]. For gesture, hand tension should signify the start of gestural commands, commands should be fast, incremental and reversible, and natural gestures should be used to favoring ease of learning [1].

However the most important recommendation is to carefully consider the appropriateness of multimodal input for the particular application and to evaluate the user interface at every step of the design process. Sturman and Zeltzer provide an example of a general methodology for evaluating gesture interfaces [22] while Coutaz et al discuss methods for evaluating multimodal interfaces [5].

Future Research

Although the research of the past has demonstrated the power of multimodal interfaces for graphics applications, there is still much work that remains to be done.

As Koons points out, there is no generally accepted method for representing object spatial and metric relationships [21]. This is needed so that commands such as “Place a cube between those cones” can be understood. Clay and Wilhelms [3] have completed some research in this area, but there is still work remaining in describing parts of objects and paths or regions between objects.

Input integration is another area for fertile research. The methods we have reviewed use time stamping as a key guide for integration. However this need not be the case. Vo and Waibel describe a method for semantic rather than temporal integration that combines objects with related meaning [26]. There may also be value in integration techniques that combine several methods. For example, Vergo uses neural networks, frame integration and agents in a heterogeneous integration mechanism [25].

Finally, a more ambitious goal is adaptive multimodal interfaces. Current interfaces require the user to learn specific speech commands or gestures. Future multimodal systems should adapt to the user’s distinctive speech and gesture styles. First steps in this area have been made by Roy and Pentland who have developed a multimodal interface in which the user can teach the system words for object parameters such as shape and color [19].

Mark Billinghurst is a fourth year Ph.D. student at the University of Washington’s Human Interface Technology Laboratory. His work focuses on conversational interfaces that combine multimodal input and output. He was co-organizer of the VRAIS 96, VRST 96, and Visual 98 tutorials on conversational interfaces and has published several papers and book chapters in the field. A native of New Zealand, Billinghurst graduated from Waikato University, Hamilton, New Zealand, with honors in 1990 and earned his master’s degrees in applied mathematics and physics in 1992.

Mark Billinghurst
Human Interface Technology Laboratory
University of Washington
Box 352-142
Seattle, WA 98195

The copyright of articles and images printed remains with the author unless otherwise indicated.


  1.  Baudel, T. and M. Beaudouin-Lafon. “Charade: Remote Control of Objects Using Free-Hand Gestures,” Communications of the ACM, 36(7), 1993, pp. 28-35.
  2.  Bolt, R.A. “Conversing with Computers,” In R. Baecker, W. Buxton, (Eds.). Readings in Human-Computer Interaction: A Multidisciplinary Approach, California: Morgan-Kaufmann, 1987.
  3.  Clay, S. and J. Wilhelms. “Put: Language-Based Interactive Manipulation of Objects,” IEEE Computer Graphics and Applications, 16(2), March 1996, pp. 31-39.
  4.  Cohen, P. “The Role of Natural Language in a Multimodal Interface,” Proceedings of the UIST ‘92 Conference, 1992, pp.143-149.
  5.  Coutaz, J., D. Slaber and B. Balbo. “Towards Automatic Evaluation of Multimodal User Interfaces,” Knowledge-Based Systems, 6(4), December 1993, pp. 258-266.
  6.  Hauptmann, A.G. and P. McAvinney. “Gestures with Speech for Graphics Manipulation,” Intl. J. Man-Machine Studies, 38, 1993, pp. 231-249.
  7.  Johnston, M., P. R. Cohen, D. McGee, S. L. Oviatt, J. A. Pittman and I. Smith. “Unification-based multimodal integration,” Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 1997.
  8.  Jones, D., K. Hopeshi and C. Frankish. “Design Guidelines for Speech Recognition Interfaces,” Applied Ergonomics, 20(1), 1989, pp. 47-52.
  9.  Kendon, A. “Gesticulation and Speech: Two Aspects of the Process of Utterance,” In M. Key (Ed.) The Relation between Verbal and Nonverbal Communication. The Hague: Mouton, 1980, pp. 207-227.
  10.  Koons, D.B. “Capturing and Interpreting Multi-Modal Descriptions with Multiple Representations,” AAAI Spring 1994 Symposium, Intelligent Multi-Modal Multi-Media Interface Systems, Stanford, March 21-23, 1994.
  11.  Koons, D. B., C. J. Sparrell, and K. R. Thorisson. “Integrating Simultaneous Output from Speech, Gaze, and Hand Gestures,” In M. Maybury, (Ed.). Intelligent Multimedia Interfaces, Menlo Park: AAAI/MIT Press, 1993, pp. 243-261.
  12.  Lucente, M., G. Zwart and A. George. “Visualization Space: A Testbed for Deviceless Multimodal User Interface,” Proceedings of the Intelligent Environments Symposium, AAAI Spring Symposium Series, March 23-25, 1998, Stanford University.
  13.  Marsh, E., K. Wauchope and J. Gurney. “Human-Machine Dialogue for Multi-Modal Decision Support Systems,” AAAI Spring Symposium Series on Intelligent Multi-Media Multi-Modal Systems, Stanford University, 1994, Website.
  14.  Martin, G.L. “The Utility of Speech Input in User-Computing Interfaces,” Intl. J. Man-Machine Studies, 30, 1989, pp. 355-375.
  15.  Neal, J. and S. Shapiro. “Intelligent multi-media interface technology,” In Intelligent User Interfaces, J. Sullivan, S. Tyler (Eds.) ACM Press, New York, New York, 1991, pp. 45-68.
  16.  Nigay, L. and J. Coutaz. “A Generic Platform for Addressing the Multimodal Challenge,” In Conference on Human Factors in Computing Systems (CHI ‘95), May 1995, ACM Press, pp. 98-105.
  17.  Oviatt, S. “Multimodal Interfaces for Dynamic Interactive Maps,”Proceedings of the Conference on Human Factors in Computing Systems: CHI ‘96, Vancouver, Canada. ACM Press, New York, 1996, pp. 95-102.
  18.  Oviatt, S. “User-Centered Modeling for Spoken Language and Multimodal Interfaces,” IEEE Multimedia, Winter 1996, pp. 26-35.
  19.  Roy D. and A. Pentland. “Multimodal Adaptive Interfaces. Vision and Modeling,” Tech. Report #438, Media Laboratory, MIT, Cambridge, MA, 1997, Website.
  20.  Salisbury, M., J. Henderson, T. Lammers, C. Fu and S. Moody. “Talk and Draw: Bundling Speech and Graphics,” IEEE Computer, 1990, 23(8), pp. 59-65.
  21.  Sparrell, C. and D. Koons. “Interpretation of Coverbal Depictive Gestures,” Proceedings of Intelligent Multi-Modal Multi-Media Interface Systems, AAAI Spring Symposium Series, March 21-23, 1994, Stanford University.
  22.  Sturman, D. and D. Zeltzer. “A design method for “whole-hand” human-computer interaction,” ACM Trans. on Information Systems, 11(3), July 1993, pp. 219-238.
  23.  Thorisson, K., D. Koons and R. Bolt. “Multi-Model Natural Dialogue,” CHI 92 Video Proceedings, 1992, p. 653.
  24.  Treisman, A. and A. Davies. “Divided Attention to Ear and Eye,” In S. Kornblum, (Ed.). Attention and Performance IV, New York: Erlbaum, 1973, pp. 101-117.
  25.  Vergo, J. “A Statistical Approach to Multimodal Natural Language Interaction,” Proceedings of the AAAI ‘98 Workshop on Representations for Multi-model Human Computer Interaction, July 1998, Website.
  26.  Vo, M. and A. Waibel. “Modeling and Interpreting Multimodal Inputs: A Semantic Integration Approach,”Technical Report CMU-CS-97-192, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1997.
  27.  Weimer, D. and S. K. Ganapathy. “A Synthetic Visual Environment with Hand Gesturing and Voice Input,” In Proceedings of Human Factors in Computing Systems (CHI’89), ACM Press, 1989, pp. 235-240.