Multimodal Contextualized Semantic Parsing from Speech

Multimodal Contextualized Semantic Parsing from Speech (2024)

Jordan Voas, Raymond Mooney, David Harwath

We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent’s knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.

View:

PDF, Arxiv

Citation:

Association for Computational Linguistics (ACL) (2024).

Bibtex:

Presentation:

Slides (PDF) Poster Video

People

Raymond J. Mooney	Faculty	mooney [at] cs utexas edu
Jordan Voas	Ph.D. Student	jvoas [at] utexas edu

Areas of Interest

Connecting Language and Perception Deep Learning Language and Vision Learning for Semantic Parsing Speech

Labs

Machine Learning