IEEE/ACL 2006 Workshop on Spoken Language Technology
Demo Session

Session Chairs:

Ciprian Chelba, Google
Dilek Hakkani-Tür, ICSI
Tanja Schultz, CMU

The Hidden Information State (HIS) Dialogue System
Steve Young, Jost Schatzmann, Blaise Thomson, Hui Ye, and Karl Weilhammer

The HIS Dialogue System is the first trainable and scalable implementation of a spoken dialog system based on the Partially-Observable Markov-Decision-Process model of dialogue. The system supports and responds to N-best hypotheses from the ATK/HTK recogniser and maintains a probability distribution over dialogue states. It provides a visual display which shows in real time how competing state hypotheses are being ranked. Apart from some very simple ontology definitions, the dialog manager has no application dependent heuristics.
For further information please see:
S. Young (2006). "Using POMDPs for Dialog Management." IEEE/ACL Workshop on Spoken Language Technology (SLT 2006), Aruba
Web-based Multimodal and Multimedia Interaction
James Glass, Scott Cyphers, Alexander Gruenstein, TJ Hazen, Lee Hetherington, Stephanie Seneff, and Greg Yu

In this work we demonstrate our recently deployed publicly available multimodal and multimedia interfaces that use conventional web-browsers. Multimodal prototypes expand the paradigm of telephone-based spoken dialogue interaction that has been so effective for spoken language research. Multimedia interfaces enable users to search inside audio-visual content that would otherwise be tedious to browse manually. During the demonstration we showcase our recent web-based prototypes for speech and pen-based interaction and video processing and retrieval. The multimodal prototypes include a map-based city browser application, while the multimedia interface allows users to search and browse processed videos of academic lecture recordings.
Efficient Navigation across Multi-media Archives Based on Spoken Document Understanding
Lin-shan Lee, Yi-cheng Pan, Sheng-yi Kong, Yi-sheng Fu, Yu-tsun Huang, and Chien-chih Wang

The multi-media archives are very difficult to be shown on the screen, and very difficult to browse. In this demo we show a complete system handling this problem with the following functions: (1) Automatic Generation of Titles and Summaries for each of the spoken documents for easier browsing, (2) Global Semantic Structuring of the entire spoken document archive presenting an overall picture, and (3) Query-based Local Semantic Structuring for the spoken documents retrieved by the user's query, providing the detailed semantic structure given the query. The preliminary version of this system has been demoed in Interspeech 2006, Pittsburgh, in the Special Session of Speech Summarization with paper title: "Multi-layered Summarization of Spoken Document Archives by Information Extraction and Semantic Structuring"
IBM MASTOR Speech-to-Speech Translation System
Yuqing Gao

The IBM MASTOR Speech-to-Speech Translation System combines IBM cutting-edge technologies in the areas of automatic speech recognition, natural understanding, machine translation and speech synthesis. The tight coupling of speech recognition, understanding and translation effectively mitigate the effects of speech recognition errors, resulting in a highly robust system for limited domains. MASTOR currently has bi-directional English-Arabic and English-Mandarin translation capabilities on unconstrained free-form natural speech input with a large vocabulary (50k for English, 100k for Arabic and 20k for Chinese) in multiple domains such as travel and emergency medical diagnosis. MASTOR runs in real-time either on a laptop, or on a handheld PDA, with minimal performance degradation. Both versions of the system displayed outstanding performances in the DARPA evaluations.
Using Discourse Structure in Speech-based Computer Tutoring
Mihai Rotaru, Diane Litman, Hua Ai, Kate Forbes-Riley, Gregory Nicholas, Amruta Purandare, Scott Silliman, Joel Tetreault, and Art Ward

This demonstration illustrates a graphical representation of the discourse structure in our speech-based computer tutor, ITSPOKE. Our system is a speech-enabled version of the Why2-Atlas qualitative physics intelligent tutoring system. We use the discourse structure hierarchy and a manual labeling of the discourse segment purpose to produce a graphical representation we call the Navigation Map. During the dialogue, the Navigation Map is automatically updated to reflect topics discussed in the past, the current topic and the next topic. A preliminary user study investigates whether users prefer the ITSPOKE version with the Navigation Map over the one without the Navigation Map.
Demonstration of EVITA ? Space Suit Astronaut Assistant
Jim Hieronymus and John Dowding

A spoken dialogue system has been developed at NASA Ames Research Center over the past 5 years to provide information to an Astronaut exploring the lunar surface. Tasks include navigation, scheduling, display and robot control, asset management, and communication. The EVITA system has an approximately 1200 word vocabulary and a Gemini typed unification language model compiled to a Nuance PCFG. The system listens continuously and accepts an utterance if it is semantically meaningful at this point in the dialogue. System components include an always listening audio provider, a Nuance speech recognition engine, Gemini parser, dialogue manager and Festival speech synthesis.
Modeling and understanding human-human communication scenes
Herve Bourlard

We will demonstrate here several results of research efforts towards the modeling and understanding of human-human communication scenes, typically revolves around instrumented meeting rooms, which enable the collection, annotation, structuring, and browsing of multimodal meeting recordings. For each meeting, audio, video, slides, and textual information (notes, whiteboard text, etc) are recorded and time-synchronized. Relevant information is extracted from these raw multimodal signals using state-of-the-art statistical modeling technologies. The resulting multimedia and information streams are then available to be structured, browsed and queried within an easily accessible archive. Demonstration of more recent activities towards the joint interaction between these multimedia recording and the automatic extraction (and exploitation) of social network topologies will also be demonstrated.
The at&t Multimodal Presentation Dashboard
Michael Johnston, Patrick Ehlen, David Gibbon, Zhu Liu

The at&t multimodal presentation dashboard allows users to control and browse presentation content such as slides and diagrams through a multimodal interface that supports speech and pen input. In addition to control commands e.g. "take me to slide 10", the system allows multimodal search over content collections. For example, if the user says "get me a slide about internet telephony" the system will present a ranked series of candidate slides that they can then select among using voice or pen. As presentations are loaded their content is analyzed and language and understanding models are built dynamically. This approach frees the user from the constraints of linear order allowing for a more dynamic and responsive presentation style.
Audio/Video Navigation with A/V X-Ray
Patrick Nguyen and Milind Mahajan
abstract

IEEE/ACL 2006 Workshop on Spoken Language Technology Demo Session

Session Chairs:

IEEE/ACL 2006 Workshop on Spoken Language Technology
Demo Session