Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: Connecting Language and Perception

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Ideally, an AI system would be able to learn language like a human child, by being exposed to utterances in a rich perceptual environment. The perceptual context would provide the necessary supervisory information, and learning the connection between language and perception would ground the system's semantic representations in its perception of the world. As a step in this direction, our research is developing systems that learn semantic parsers and language generators from sentences paired only with their perceptual context. It is part of our research on natural language learning. Our research on this topic is supported by the National Science Foundation through grants IIS-0712097 and IIS-1016312.
  • Grounded Language Learning [Video Lecture]
  • Raymond J. Mooney, Invited Talk, AAAI, 2013.
  • Learning Language from its Perceptual Context [Video Lecture]
  • Raymond J. Mooney, Invited Talk, ECML-PKDD, 2008.

Sub-areas:
  1. Measuring Sound Symbolism in Audio-visual Models
    [Details] [PDF]
    Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney
    In IEEE Spoken Language Technology (SLT) Workshop, December 2024.
    Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks. This study investigates whether pre-trained audio-visual models demonstrate non-arbitrary associations between sounds and visual representations–known as sound symbolism–which is also observed in humans. We developed a specialized dataset with synthesized images and audio samples and assessed these models using a non-parametric approach in a zero-shot setting. Our findings reveal a significant correlation between the models' outputs and established patterns of sound symbolism, particularly in models trained on speech data. These results suggest that such models can capture sound-meaning connections akin to human language processing, providing insights into both cognitive architectures and machine learning strategies.
    ML ID: 434
  2. Multimodal Contextualized Semantic Parsing from Speech
    [Details] [PDF] [Slides (PDF)] [Poster] [Video]
    Jordan Voas, Raymond Mooney, David Harwath
    In Association for Computational Linguistics (ACL), August 2024.
    We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent’s knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
    ML ID: 431
  3. What is the Best Automated Metric for Text to Motion Generation?
    [Details] [PDF]
    Jordan Voas
    Masters Thesis, Department of Computer Science, UT Austin, Austin, TX, May 2023.
    There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing meaningful training losses for supervising generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and rarely used coordinate errors show strong correlations. Several recently developed metrics are not recommended due to their low correlation compared to alternatives. Additionally, multiple novel metrics which exhibiting improved correlation and potential for future use.
    ML ID: 423
  4. What is the Best Automated Metric for Text to Motion Generation?
    [Details] [PDF] [Slides (PPT)] [Video]
    Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney
    In ACM SIGGRAPH Asia, December 2023.
    There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.
    ML ID: 422
  5. Directly Optimizing Evaluation Metrics to Improve Text to Motion
    [Details] [PDF]
    Yili Wang
    Masters Thesis, Department of Computer Science, UT Austin, May 2023.
    There is a long-existing discrepancy between training and testing process of most generative models including both text-to-text models like machine translation (MT), and multi-modal models like image captioning and text-to-motion generation. These models are usually trained to optimize a specific objective like log-likelihood (MLE) in the Seq2Seq models or the KL-divergence in the variational autoencoder (VAE) models. However, they are tested using different evaluation metrics such as the BLEU score and Fréchet Inception Distance (FID). Our paper aims to address such discrepancy in text-to-motion generation models by developing algorithms to directly optimize the target metric during training time. We explore three major techniques: reinforcement learning, contrastive learning methods, and differentiable metrics that are originally applied to natural language processing fields and adapt them to the language-and-motion domain.
    ML ID: 418
  6. Systematic Generalization on gSCAN with Language Conditioned Embedding
    [Details] [PDF] [Video]
    Tong Gao, Qi Huang and Raymond J. Mooney
    In The 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing , December 2020.
    Systematic Generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. As shown in recent work, state-of-the-art deep learning models fail dramatically even on tasks for which they are designed when the test set is systematically different from the training data. We hypothesize that explicitly modeling the relations between objects in their contexts while learning their representations will help achieve systematic generalization. There- fore, we propose a novel method that learns objects’ contextualized embedding with dynamic message-passing conditioned on the input natural language and is end-to-end trainable with other downstream deep learning modules. To our knowledge, this model is the first one that significantly outperforms the provided baseline and reaches state-of-the-art performance on grounded SCAN (gSCAN), a grounded natural language navigation dataset designed to require systematic generalization in its test splits.
    ML ID: 390
  7. Dialog as a Vehicle for Lifelong Learning
    [Details] [PDF] [Slides (PDF)] [Video]
    Aishwarya Padmakumar, Raymond J. Mooney
    In Position Paper Track at the SIGDIAL Special Session on Physically Situated Dialogue (RoboDial 2.0), July 2020.
    Dialog systems research has primarily been focused around two main types of applications – task-oriented dialog systems that learn to use clarification to aid in understanding a goal, and open-ended dialog systems that are expected to carry out unconstrained “chit chat” conversations. However, dialog interactions can also be used to obtain various types of knowledge that can be used to improve an underlying language understanding system, or other machine learning systems that the dialog acts over. In this position paper, we present the problem of designing dialog systems that enable lifelong learning as an important challenge problem, in particular for applications involving physically situated robots. We include examples of prior work in this direction, and discuss challenges that remain to be addressed.
    ML ID: 386
  8. Generating Animated Videos of Human Activities from Natural Language Descriptions
    [Details] [PDF] [Poster]
    Angela S. Lin, Lemeng Wu, Rodolfo Corona , Kevin Tai , Qixing Huang , Raymond J. Mooney
    In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2018, December 2018.
    Generating realistic character animations is of great importance in computer graphics and related domains. Existing approaches for this application involve a significant amount of human interaction. In this paper, we introduce a system that maps a natural language description to an animation of a humanoid skeleton. Our system is a sequence-to-sequence model that is pretrained with an autoencoder objective and then trained end-to-end.
    ML ID: 369
  9. Learning a Policy for Opportunistic Active Learning
    [Details] [PDF]
    Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney
    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
    Active learning identifies data points to label that are expected to be the most useful in improving a supervised model. Opportunistic active learning incorporates active learning into interactive tasks that constrain possible queries during interactions. Prior work has shown that opportunistic active learning can be used to improve grounding of natural language descriptions in an interactive object retrieval task. In this work, we use reinforcement learning for such an object retrieval task, to learn a policy that effectively trades off task completion with model improvement that would benefit future tasks.
    ML ID: 368
  10. Learning to Connect Language and Perception
    [Details] [PDF]
    Raymond J. Mooney
    In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI), 1598--1601, Chicago, IL, July 2008. Senior Member Paper.
    To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Current natural language processing and computer vision systems make extensive use of machine learning to acquire the probabilistic knowledge needed to comprehend linguistic and visual input. However, to date, there has been relatively little work on learning the relationships between the two modalities. In this talk, I will review some of the existing work on learning to connect language and perception, discuss important directions for future research in this area, and argue that the time is now ripe to make a concerted effort to address this important, integrative AI problem.
    ML ID: 216
  11. Learning Language Semantics from Ambiguous Supervision
    [Details] [PDF]
    Rohit J. Kate and Raymond J. Mooney
    In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), 895-900, Vancouver, Canada, July 2007.
    This paper presents a method for learning a semantic parser from ambiguous supervision. Training data consists of natural language sentences annotated with multiple potential meaning representations, only one of which is correct. Such ambiguous supervision models the type of supervision that can be more naturally available to language-learning systems. Given such weak supervision, our approach produces a semantic parser that maps sentences into meaning representations. An existing semantic parsing learning system that can only learn from unambiguous supervision is augmented to handle ambiguous supervision. Experimental results show that the resulting system is able to cope up with ambiguities and learn accurate semantic parsers.
    ML ID: 200
  12. Learning Language from Perceptual Context: A Challenge Problem for AI
    [Details] [PDF]
    Raymond J. Mooney
    In Proceedings of the 2006 AAAI Fellows Symposium, Boston, MA, July 2006.
    We present the problem of learning to understand natural language from examples of utterances paired only with their relevant real-world context as an important challenge problem for AI. Machine learning has been adopted as the most effective way of developing natural-language processing systems; however, currently, complex annotated corpora are required for training. By learning language from perceptual context, the need for laborious annotation is removed and the system's resulting understanding is grounded in its perceptual experience.
    ML ID: 192