IEEE/ACL 2006 Workshop on Spoken Language Technology
Demo Session
Session Chairs:
Ciprian Chelba, Google
Dilek Hakkani-Tür, ICSI
Tanja Schultz, CMU
- The Hidden Information State (HIS) Dialogue System
Steve Young, Jost Schatzmann, Blaise Thomson, Hui Ye, and Karl Weilhammer
The HIS Dialogue System is the first trainable and scalable implementation of a spoken dialog system based on the Partially-Observable Markov-Decision-Process model of dialogue. The system supports and responds to N-best hypotheses from the ATK/HTK recogniser and maintains a probability distribution over dialogue states. It provides a visual display which shows in real time how competing state hypotheses are being ranked. Apart from some very simple ontology definitions, the dialog manager has no application dependent heuristics.
For further information please see:
S. Young (2006). "Using POMDPs for Dialog Management." IEEE/ACL Workshop
on Spoken Language Technology (SLT 2006), Aruba
- Web-based Multimodal and Multimedia Interaction
James Glass, Scott Cyphers, Alexander Gruenstein, TJ Hazen, Lee Hetherington, Stephanie Seneff, and Greg Yu
In this work we demonstrate our recently deployed publicly available
multimodal and multimedia interfaces that use conventional
web-browsers. Multimodal prototypes expand the paradigm of
telephone-based spoken dialogue interaction that has been so effective
for spoken language research. Multimedia interfaces enable users to
search inside audio-visual content that would otherwise be tedious to
browse manually. During the demonstration we showcase our recent
web-based prototypes for speech and pen-based interaction and video
processing and retrieval. The multimodal prototypes include a
map-based city browser application, while the multimedia interface
allows users to search and browse processed videos of academic lecture
recordings.
- Efficient Navigation across Multi-media Archives Based on Spoken Document Understanding
Lin-shan Lee, Yi-cheng Pan, Sheng-yi Kong, Yi-sheng Fu, Yu-tsun Huang, and Chien-chih Wang
The multi-media archives are very difficult to be shown on the screen, and
very difficult to browse. In this demo we show a complete system handling
this problem with the following functions:
(1) Automatic Generation of Titles and Summaries for each of the spoken
documents for easier browsing, (2) Global Semantic Structuring of the entire
spoken document archive presenting an overall picture, and (3) Query-based
Local Semantic Structuring for the spoken documents retrieved by the user's
query, providing the detailed semantic structure given the query. The
preliminary version of this system has been demoed in Interspeech 2006,
Pittsburgh, in the Special Session of Speech Summarization with paper title:
"Multi-layered Summarization of Spoken Document Archives by Information
Extraction and Semantic Structuring"
- IBM MASTOR Speech-to-Speech Translation System
Yuqing Gao
The IBM MASTOR Speech-to-Speech Translation System combines
IBM cutting-edge technologies in the areas of automatic speech recognition,
natural understanding, machine translation and speech synthesis. The tight
coupling of speech recognition, understanding and translation effectively
mitigate the effects of speech recognition errors, resulting in a highly
robust system for limited domains. MASTOR currently has bi-directional
English-Arabic and English-Mandarin translation capabilities on
unconstrained free-form natural speech input with a large vocabulary (50k
for English, 100k for Arabic and 20k for Chinese) in multiple domains such
as travel and emergency medical diagnosis. MASTOR runs in real-time either
on a laptop, or on a handheld PDA, with minimal performance degradation.
Both versions of the system displayed outstanding performances in the DARPA
evaluations.
- Using Discourse Structure in Speech-based Computer Tutoring
Mihai Rotaru, Diane Litman, Hua Ai, Kate Forbes-Riley, Gregory Nicholas, Amruta Purandare, Scott Silliman, Joel Tetreault, and Art Ward
This demonstration illustrates a graphical representation of the discourse
structure in our speech-based computer tutor, ITSPOKE. Our system is a
speech-enabled version of the Why2-Atlas qualitative physics intelligent
tutoring system. We use the discourse structure hierarchy and a manual
labeling of the discourse segment purpose to produce a graphical
representation we call the Navigation Map. During the dialogue, the
Navigation Map is automatically updated to reflect topics discussed in the
past, the current topic and the next topic. A preliminary user study
investigates whether users prefer the ITSPOKE version with the Navigation
Map over the one without the Navigation Map.
- Demonstration of EVITA ? Space Suit Astronaut Assistant
Jim Hieronymus and John Dowding
A spoken dialogue system has been developed at NASA Ames Research Center over the past 5 years to
provide information to an Astronaut exploring the lunar surface. Tasks include navigation, scheduling,
display and robot control, asset management, and communication. The EVITA system has an approximately
1200 word vocabulary and a Gemini typed unification language model compiled to a Nuance PCFG. The system
listens continuously and accepts an utterance if it is semantically meaningful at this point in the
dialogue. System components include an always listening audio provider, a Nuance speech recognition
engine, Gemini parser, dialogue manager and Festival speech synthesis.
- Modeling and understanding human-human communication scenes
Herve Bourlard
We will demonstrate here several results of research efforts towards
the modeling and understanding of human-human communication scenes,
typically revolves around instrumented meeting rooms,
which enable the collection, annotation, structuring, and browsing
of multimodal meeting recordings. For each meeting, audio, video,
slides, and textual information (notes, whiteboard text, etc) are
recorded and time-synchronized. Relevant information is extracted
from these raw multimodal signals using state-of-the-art statistical
modeling technologies. The resulting multimedia and information
streams are then available to be structured, browsed and queried
within an easily accessible archive.
Demonstration of more recent activities towards the joint interaction
between these multimedia recording and the automatic extraction
(and exploitation) of social network topologies will also be demonstrated.
- The at&t Multimodal Presentation Dashboard
Michael Johnston, Patrick Ehlen, David Gibbon, Zhu Liu
The at&t multimodal presentation dashboard allows users to control and browse
presentation content such as slides and diagrams through a multimodal interface
that supports speech and pen input. In addition to control commands e.g.
"take me to slide 10", the system allows multimodal search over content
collections. For example, if the user says "get me a slide about internet telephony"
the system will present a ranked series of candidate slides that they can then select
among using voice or pen. As presentations are loaded their content is analyzed and
language and understanding models are built dynamically. This approach frees the
user from the constraints of linear order allowing for a more dynamic and
responsive presentation style.
- Audio/Video Navigation with A/V X-Ray
Patrick Nguyen and Milind Mahajan
abstract