Publications: Speech
Spoken Language Technology; language-audio processing
- Temporally Streaming Audio-Visual Synchronization for Real-World Videos
[Details] [PDF]
Jordan Voas, Wei-Cheng Tseng, Layne Berry, Xixi Hu, Puyuan Peng, James Stuedemann, and David Harwath
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), February 2025.We introduce RealSync, a novel dataset designed to significantly enhance the training and evaluation of models for audio-visual synchronization (AV Sync) tasks. Sourced from high-quality YouTube channels, RealSync covers a wide range of content domains, providing an improved scale, diversity, and alignment with broadcast content compared to existing datasets. It features extended-length video samples, catering to the critical need for more comprehensive, real-world training and evaluation materials. Alongside this dataset, we present StreamSync, a model tailored for real-world AV Sync applications. StreamSync is designed to be backbone agnostic and incorporates a streaming mechanism that processes consecutive video segments dynamically, iteratively refining synchronization predictions. This innovative approach enables StreamSync to outperform existing models, offering superior synchronization accuracy with minimal computational cost per iteration. Together, our dataset and the StreamSync model establish a new benchmark for AVSync research, promising to drive the development of more robust and practical AVSync methods. https://github.com/jvoas655/StreamSync
ML ID: 435
- Measuring Sound Symbolism in Audio-visual Models
[Details] [PDF] [Poster]
Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
In IEEE Spoken Language Technology (SLT) Workshop, December 2024.Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations–known as sound symbolism–which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
ML ID: 434
- Multimodal Contextualized Semantic Parsing from Speech
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Jordan Voas, Raymond Mooney, David Harwath
In Association for Computational Linguistics (ACL), August 2024.We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent’s knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
ML ID: 431