Machine Learning Research Group | University of Texas

Publications: Connecting Language and Perception

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Ideally, an AI system would be able to learn language like a human child, by being exposed to utterances in a rich perceptual environment. The perceptual context would provide the necessary supervisory information, and learning the connection between language and perception would ground the system's semantic representations in its perception of the world. As a step in this direction, our research is developing systems that learn semantic parsers and language generators from sentences paired only with their perceptual context. It is part of our research on natural language learning. Our research on this topic is supported by the National Science Foundation through grants IIS-0712097 and IIS-1016312.

Grounded Language Learning [Video Lecture]

AAAI

Learning Language from its Perceptual Context [Video Lecture]

ECML-PKDD

Sub-areas:

Hide abstracts

Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
[Details] [PDF]
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
International Conference on Robotics and Automation (ICRA), June 2026.
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We propose MICoBot, a system that enables the human and robot, both using natural language, to take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the estimated human's willingness to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models. See additional videos and materials at our project site: https://robin-lab.cs.utexas.edu/MicoBot/.
ML ID: 441
Reasoning about Actions with Large Multimodal Models
[Details] [PDF] [Slides (PDF)]
Vanya Cohen
October 2025. Ph.D. Proposal.
Large multimodal models have become central for solving sequential decision-making tasks, enabling improved learning in diverse areas such as home robotics and automated software development. However, leveraging these models for sequential decision-making requires robust action reasoning capabilities, which remain a significant challenge. This thesis aims to improve and evaluate action reasoning in large multimodal models. First, we introduce a method to improve the parsing of instructional texts into action sequences by integrating external symbolic planners and planning domains during autoregressive language model decoding. Next, we propose a method that leverages the compositional structure of language instructions to improve generalization and sample efficiency of acquiring new tasks with reinforcement learning. Last, we propose a new benchmark to evaluate the understanding of dependencies between actions described in instructional texts. Future work will focus on evaluating the world modeling limitations of frontier models. Current models struggle to reason about the effects of actions in multimodal entity state tracking tasks. We aim to extend entity state tracking evaluations to embodied domains. From this benchmark we derive a post-training method for improving the entity-state reasoning abilities of language models. Together these contributions enhance the understanding of how models reason about actions and provide insights toward their improvement for real-world sequential decision-making problems.
ML ID: 445
Augmenting Robotic Capabilities through Natural Language
[Details] [PDF] [Slides (PDF)]
Albert Yu
October 2025. Ph.D. Proposal.
Despite rapid advances in language and vision models, current robots still lag far behind human physical capabilities due to the relative scarcity of real-world data compared to online text and images. How can we leverage abundant language data to advance robotic capabilities? Language provides semantic structure that facilitates the understanding of diverse data, improving sample efficiency in scarce data regimes. It also provides a natural communicative medium when interacting with and learning from humans. To leverage the first benefit of language, we first take inspiration from how humans teach each other in video tutorials, through simultaneous video and language streams, to more efficiently teach robots new skills. We then show that language can bridge wide visual sim2real gaps, enabling robots to learn tasks with just a few real-world demonstrations by leveraging knowledge from imperfect simulation data. To leverage the second benefit of language, we explore how bidirectional dialog can enable robots to solve complex manipulation tasks by communicating to and collaborating with a wide distribution of human collaborators in the real-world. We develop a robotic framework that requests and proactively offers help through mixed-initiative, free-form dialog, enabling the robot to adapt to changing human preferences and each agent’s physical capabilities to be strategically utilized. Finally, we discuss avenues of future work, such as how human-robot collaboration can be facilitated through dialog-based replanning, how both agents can improve through bidirectional feedback, and how language-based guidelines extracted from manuals can enable robots to behave more safely and learn more quickly.
ML ID: 444
Temporally Streaming Audio-Visual Synchronization for Real-World Videos
[Details] [PDF]
Jordan Voas, Wei-Cheng Tseng, Layne Berry, Xixi Hu, Puyuan Peng, James Stuedemann, and David Harwath
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), February 2025.
We introduce RealSync, a novel dataset designed to significantly enhance the training and evaluation of models for audio-visual synchronization (AV Sync) tasks. Sourced from high-quality YouTube channels, RealSync covers a wide range of content domains, providing an improved scale, diversity, and alignment with broadcast content compared to existing datasets. It features extended-length video samples, catering to the critical need for more comprehensive, real-world training and evaluation materials. Alongside this dataset, we present StreamSync, a model tailored for real-world AV Sync applications. StreamSync is designed to be backbone agnostic and incorporates a streaming mechanism that processes consecutive video segments dynamically, iteratively refining synchronization predictions. This innovative approach enables StreamSync to outperform existing models, offering superior synchronization accuracy with minimal computational cost per iteration. Together, our dataset and the StreamSync model establish a new benchmark for AVSync research, promising to drive the development of more robust and practical AVSync methods. https://github.com/jvoas655/StreamSync
ML ID: 435
Measuring Sound Symbolism in Audio-visual Models
[Details] [PDF] [Poster]
Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
In IEEE Spoken Language Technology (SLT) Workshop, December 2024.
Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations–known as sound symbolism–which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
ML ID: 434
Multimodal Contextualized Semantic Parsing from Speech
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Jordan Voas, Raymond Mooney, David Harwath
In Association for Computational Linguistics (ACL), August 2024.
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent’s knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
ML ID: 431
What is the Best Automated Metric for Text to Motion Generation?
[Details] [PDF]
Jordan Voas
Masters Thesis, Department of Computer Science, UT Austin, Austin, TX, May 2023.
There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing meaningful training losses for supervising generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and rarely used coordinate errors show strong correlations. Several recently developed metrics are not recommended due to their low correlation compared to alternatives. Additionally, multiple novel metrics which exhibiting improved correlation and potential for future use.
ML ID: 423
Directly Optimizing Evaluation Metrics to Improve Text to Motion
[Details] [PDF]
Yili Wang
Masters Thesis, Department of Computer Science, UT Austin, May 2023.
There is a long-existing discrepancy between training and testing process of most generative models including both text-to-text models like machine translation (MT), and multi-modal models like image captioning and text-to-motion generation. These models are usually trained to optimize a specific objective like log-likelihood (MLE) in the Seq2Seq models or the KL-divergence in the variational autoencoder (VAE) models. However, they are tested using different evaluation metrics such as the BLEU score and Fréchet Inception Distance (FID). Our paper aims to address such discrepancy in text-to-motion generation models by developing algorithms to directly optimize the target metric during training time. We explore three major techniques: reinforcement learning, contrastive learning methods, and differentiable metrics that are originally applied to natural language processing fields and adapt them to the language-and-motion domain.
ML ID: 418
Systematic Generalization on gSCAN with Language Conditioned Embedding
[Details] [PDF] [Video]
Tong Gao, Qi Huang and Raymond J. Mooney
In The 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , December 2020.
Systematic Generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. As shown in recent work, state-of-the-art deep learning models fail dramatically even on tasks for which they are designed when the test set is systematically different from the training data. We hypothesize that explicitly modeling the relations between objects in their contexts while learning their representations will help achieve systematic generalization. There- fore, we propose a novel method that learns objects’ contextualized embedding with dynamic message-passing conditioned on the input natural language and is end-to-end trainable with other downstream deep learning modules. To our knowledge, this model is the first one that significantly outperforms the provided baseline and reaches state-of-the-art performance on grounded SCAN (gSCAN), a grounded natural language navigation dataset designed to require systematic generalization in its test splits.
ML ID: 390
Dialog as a Vehicle for Lifelong Learning
[Details] [PDF] [Slides (PDF)] [Video]
Aishwarya Padmakumar, Raymond J. Mooney
In Position Paper Track at the SIGDIAL Special Session on Physically Situated Dialogue (RoboDial 2.0), July 2020.
Dialog systems research has primarily been focused around two main types of applications – task-oriented dialog systems that learn to use clarification to aid in understanding a goal, and open-ended dialog systems that are expected to carry out unconstrained “chit chat” conversations. However, dialog interactions can also be used to obtain various types of knowledge that can be used to improve an underlying language understanding system, or other machine learning systems that the dialog acts over. In this position paper, we present the problem of designing dialog systems that enable lifelong learning as an important challenge problem, in particular for applications involving physically situated robots. We include examples of prior work in this direction, and discuss challenges that remain to be addressed.
ML ID: 386
Learning a Policy for Opportunistic Active Learning
[Details] [PDF]
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
Active learning identifies data points to label that are expected to be the most useful in improving a supervised model. Opportunistic active learning incorporates active learning into interactive tasks that constrain possible queries during interactions. Prior work has shown that opportunistic active learning can be used to improve grounding of natural language descriptions in an interactive object retrieval task. In this work, we use reinforcement learning for such an object retrieval task, to learn a policy that effectively trades off task completion with model improvement that would benefit future tasks.
ML ID: 368
Learning to Connect Language and Perception
[Details] [PDF]
Raymond J. Mooney
In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI), 1598--1601, Chicago, IL, July 2008. Senior Member Paper.
To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Current natural language processing and computer vision systems make extensive use of machine learning to acquire the probabilistic knowledge needed to comprehend linguistic and visual input. However, to date, there has been relatively little work on learning the relationships between the two modalities. In this talk, I will review some of the existing work on learning to connect language and perception, discuss important directions for future research in this area, and argue that the time is now ripe to make a concerted effort to address this important, integrative AI problem.
ML ID: 216
Learning Language Semantics from Ambiguous Supervision
[Details] [PDF]
Rohit J. Kate and Raymond J. Mooney
In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), 895-900, Vancouver, Canada, July 2007.
This paper presents a method for learning a semantic parser from ambiguous supervision. Training data consists of natural language sentences annotated with multiple potential meaning representations, only one of which is correct. Such ambiguous supervision models the type of supervision that can be more naturally available to language-learning systems. Given such weak supervision, our approach produces a semantic parser that maps sentences into meaning representations. An existing semantic parsing learning system that can only learn from unambiguous supervision is augmented to handle ambiguous supervision. Experimental results show that the resulting system is able to cope up with ambiguities and learn accurate semantic parsers.
ML ID: 200
Learning Language from Perceptual Context: A Challenge Problem for AI
[Details] [PDF]
Raymond J. Mooney
In Proceedings of the 2006 AAAI Fellows Symposium, Boston, MA, July 2006.
We present the problem of learning to understand natural language from examples of utterances paired only with their relevant real-world context as an important challenge problem for AI. Machine learning has been adopted as the most effective way of developing natural-language processing systems; however, currently, complex annotated corpora are required for training. By learning language from perceptual context, the need for laborious annotation is removed and the system's resulting understanding is grounded in its perceptual experience.
ML ID: 192