Machine Learning Research Group | University of Texas

Publications: Language and Vision

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Vision is the primary source of perception and grounding language in vision is an important AI problem with many applications. Our group has focused particularly on automated video captioning, producing natural language descriptions of short video clips using both graphical models and deep neural networks.

Hide abstracts

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
[Details] [PDF]
Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu and Ray Mooney
In European Chapter of the Association for Computational Linguistics (EACL), March 2026.
Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with system-atically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2 percent after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
ML ID: 446
Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
[Details] [PDF]
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
International Conference on Robotics and Automation (ICRA), June 2026.
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We propose MICoBot, a system that enables the human and robot, both using natural language, to take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the estimated human's willingness to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models. See additional videos and materials at our project site: https://robin-lab.cs.utexas.edu/MicoBot/.
ML ID: 441
Augmenting Robotic Capabilities through Natural Language
[Details] [PDF] [Slides (PDF)]
Albert Yu
October 2025. Ph.D. Proposal.
Despite rapid advances in language and vision models, current robots still lag far behind human physical capabilities due to the relative scarcity of real-world data compared to online text and images. How can we leverage abundant language data to advance robotic capabilities? Language provides semantic structure that facilitates the understanding of diverse data, improving sample efficiency in scarce data regimes. It also provides a natural communicative medium when interacting with and learning from humans. To leverage the first benefit of language, we first take inspiration from how humans teach each other in video tutorials, through simultaneous video and language streams, to more efficiently teach robots new skills. We then show that language can bridge wide visual sim2real gaps, enabling robots to learn tasks with just a few real-world demonstrations by leveraging knowledge from imperfect simulation data. To leverage the second benefit of language, we explore how bidirectional dialog can enable robots to solve complex manipulation tasks by communicating to and collaborating with a wide distribution of human collaborators in the real-world. We develop a robotic framework that requests and proactively offers help through mixed-initiative, free-form dialog, enabling the robot to adapt to changing human preferences and each agent’s physical capabilities to be strategically utilized. Finally, we discuss avenues of future work, such as how human-robot collaboration can be facilitated through dialog-based replanning, how both agents can improve through bidirectional feedback, and how language-based guidelines extracted from manuals can enable robots to behave more safely and learn more quickly.
ML ID: 444
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
[Details] [PDF]
Vanya Cohen, Raymond Mooney
Preprint, January 2025.
Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.
ML ID: 437
Multimodal Contextualized Semantic Parsing from Speech
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Jordan Voas, Raymond Mooney, David Harwath
In Association for Computational Linguistics (ACL), August 2024.
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent’s knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
ML ID: 431
Incorporating External Information for Visual Question Answering
[Details] [PDF] [Slides (PDF)]
Jialin Wu
PhD Thesis, Department of Computer Science, UT Austin, August 2022.
Visual question answering (VQA) has recently emerged as a challenging multi-modal task and has gained popularity. The goal is to answer questions that query information associated with the visual content in the given image. Since the required information could be from both inside and outside the image, common types of visual features, such as object and attribute detection, fail to provide enough materials for answering the questions. External information, such as captions, explanations, encyclopedia articles, and commonsense databases, can help VQA systems comprehensively understand the image, reason following the right path, and access external facts. Specifically, they provide concise descriptions of the image, precise reasons for the correct answer, and factual knowledge beyond the image. In this dissertation, we present our work on generating image captions that are targeted to help answer a specific visual question. We use explanations to recognize the critical objects to prevent the VQA models from taking language prior shortcuts. We introduce an approach that generates textual explanations and utilizes them to determine which answer is mostly supported. At last, we explore retrieving and exploiting external knowledge beyond the visual content, which is indispensable, to help answer knowledge-based visual questions.
ML ID: 413
Zero-shot Video Moment Retrieval With Off-the-Shelf Models
[Details] [PDF] [Poster]
Anuj Diwan, Puyuan Peng, Raymond J. Mooney
In Workshop on Transfer Learning for Natural Language Processing at NeurIPS 2022, December 2022.
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights (Lei et al., 2021) benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.
ML ID: 412
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
[Details] [PDF] [Poster] [Video]
Jialin Wu, Raymond Mooney
In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), December 2022.
Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine the retrieved knowledge with state-of-the-art VQA models, and achieve a new state-of-the-art performance on OK-VQA.
ML ID: 411
Using Natural Language to Aid Task Specification in Sequential Decision Making Problems
[Details] [PDF] [Slides (PDF)] [Video]
Prasoon Goyal
PhD Thesis, Department of Computer Science, UT Austin, July 2022.
Building intelligent agents that can help humans accomplish everyday tasks, such as a personal robot at home or a robot in a work environment, is a long-standing goal of artificial intelligence. One of the requirements for such general-purpose agents is the ability to teach them new tasks or skills relatively easily. Common approaches to teaching agents new skills include reinforcement learning (RL) and imitation learning (IL). However, specifying the task to the learning agent, i.e. designing effective reward functions for reinforcement learning and providing demonstrations for imitation learning, are often cumbersome and time-consuming. Further, designing reward functions and providing a set of demonstrations that sufficiently disambiguates the desired task may not be particularly accessible for end users without a technical background.

In this dissertation, we explore using natural language as an auxiliary signal to aid task specification, which reduces the burden on the end user. To make reward design easier, we propose a novel framework that is used to generate language-based rewards in addition to the extrinsic rewards from the environment for faster policy training using RL. We show that using our framework, very simple extrinsic rewards along with a natural language description of the task are sufficient to teach new tasks to the learning agent. To ameliorate the problem of providing demonstrations, we propose a new setting that enables an agent to learn a new task without demonstrations in an IL setting, given a demonstration from a related task and a natural language description of the difference between the desired task and the demonstrated task. The techniques we develop for this setting would enable teaching multiple related tasks to learning agents by providing a small set of demonstrations and several natural language descriptions, thereby reducing the burden of providing demonstrations for each task.

The primary contributions of this dissertation include novel problem settings, benchmarks, and algorithms that allow using natural language as an auxiliary modality for task specification in RL and IL. We believe this dissertation will serve as a foundation for future research along these lines, to make progress toward having intelligent agents that can conveniently be taught new tasks by end users.
ML ID: 410
Multi-Modal Answer Validation for Knowledge-Based VQA
[Details] [PDF] [Video]
Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi
In Proceedings of the AAAI Conference on Artificial Intelligence, February 2022.
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in a variety of forms, including visual, textual, and commonsense knowledge. The use of more knowledge sources, however, also increases the chance of retrieving more irrelevant or noisy facts, making it difficult to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. This is in contrast to existing approaches that search for the answer in a vast collection of often irrelevant facts. Our approach aims to learn which knowledge source should be trusted for each answer candidate and how to validate the candidate using that source. We consider a multi-modal setting, relying on both textual and visual knowledge resources, including images searched using Google, sentences from Wikipedia articles, and concepts from ConceptNet. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results.
ML ID: 401
Towards Automated Error Analysis: Learning to Characterize Errors
[Details] [PDF] [Poster]
Tong Gao, Shivang Singh, Raymond J. Mooney
Short version appears in the 19th International Florida Artificial Intelligence Research Society Conference (FLAIRS), May 2022.
Characterizing the patterns of errors that a system makes helps researchers focus future development on increasing its accuracy and robustness. We propose a novel form of ”meta learning” that automatically learns interpretable rules that characterize the types of errors that a system makes, and demonstrate these rules’ ability to help understand and improve two NLP systems. Our approach works by collecting error cases on validation data, extracting meta-features describing these samples, and finally learning rules that characterize errors using these features. We apply our approach to VilBERT, for Visual Question Answering, and RoBERTa, for Common Sense Question Answering. Our system learns interpretable rules that provide insights into systemic errors these systems make on the given tasks. Using these insights, we are also able to “close the loop” and modestly improve performance of these systems.
ML ID: 400
Incorporating Textual Resources to Improve Visual Question Answering
[Details] [PDF] [Slides (PDF)]
Jialin Wu
September 2021. Ph.D. Proposal.
Recently, visual question answering (VQA) emerged as a challenge multi-modal task and gained in popularity. The goal is to answer questions that query information associated with the visual content in the given image. Since the required information could be from both inside and outside the image, common types of visual features, such as object and attribute detection, fail to provide enough materials for answering the questions. Textual resources, such as captions, explanations, encyclopedia articles, can help VQA systems comprehensively understand the image, reason following the right path, and access external facts. Specifically, they provide concise descriptions of the image, precise reasons for the correct answer, and factual knowledge beyond the image. We presented completed work on generating image captions that are targeted to help answer a specific visual question. We introduced an approach that generates textual explanations and used these explanations to determine which answer is mostly supported. We used explanations to recognize the critical objects for solving the visual question and trained the VQA systems to be influenced by these objects most. We also explored using textual resources to provide external knowledge beyond the visual content that is indispensable for a recent trend towards knowledge-based VQA. We further propose to break down visual questions such that each segment, which carries a single piece of semantic content in the question, can be associated with its specific knowledge. This separation aims to help the VQA system understand the question structure to satisfy the need for linking different aspects of the question to different types of information within and beyond the image.
ML ID: 397
Improving VQA and its Explanations by Comparing Competing Explanations
[Details] [PDF] [Slides (PDF)]
Jialin Wu, Liyan Chen, Raymond J. Mooney
In The AAAI Conference on Artificial Intelligence (AAAI), Explainable Agency in Artificial Intelligence Workshop, February 2021.
Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer distribution given the question and visual content. As a result, these systems frequently take shortcuts, focusing on simple visual concepts or question priors. This phenomenon becomes more problematic as the questions become complex that requires more reasoning and commonsense knowledge. To address this issue, we present a novel framework that uses explanations for competing answers to help VQA systems select the correct answer. By training on human textual explanations, our framework builds better representations for the questions and visual content, and then reweights confidences in the answer candidates using either generated or retrieved explanations from the training set. We evaluate our framework on the VQA-X dataset, which has more difficult questions with human explanations, achieving new state-of-the-art results on both VQA and its explanations.
ML ID: 387
Dialog Policy Learning for Joint Clarification and Active Learning Queries
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Aishwarya Padmakumar, Raymond J. Mooney
In The AAAI Conference on Artificial Intelligence (AAAI), February 2021.
Intelligent systems need to be able to recover from mistakes, resolve uncertainty, and adapt to novel concepts not seen during training. Dialog interaction can enable this by the use of clarifications for correction and resolving uncertainty, and active learning queries to learn new concepts encountered during operation. Prior work on dialog systems has either focused on exclusively learning how to perform clarification/ information seeking, or to perform active learning. In this work, we train a hierarchical dialog policy to jointly perform both clarification and active learning in the context of an interactive language-based image retrieval task motivated by an on-line shopping application, and demonstrate that jointly learning dialog policies for clarification and active learning is more effective than the use of static dialog policies for one or both of these functions.
ML ID: 385
Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems
[Details] [PDF] [Slides (PDF)]
Aishwarya Padmakumar
PhD Thesis, Department of Computer Science, The University of Texas at Austin, August 2020.
Natural language interfaces have the potential to make various forms of technology, including mobile phones and computers as well as robots or other machines such as ATMs and self-checkout counters, more accessible and less intimidating to users who are unfamiliar or uncomfortable with other types of interfaces. In particular, natural language understanding systems on physical robots face a number of challenges, including the need to ground language in perception, the ability to adapt to changes in the environment and novel uses of language, and to deal with uncertainty in understanding. To effectively handle these challenges, such systems need to perform lifelong learning - continually updating the scope and predictions of the model with user interactions. In this thesis, we discuss ways in which dialog interaction with users can be used to improve grounded natural language understanding systems, motivated by service robot applications. We focus on two types of queries that can be used in such dialog systems – active learning queries to elicit knowledge about the environment that can be used to improve perceptual models, and clarification questions that confirm the system’s hypotheses, or elicit specific information required to complete a task. Our goal is to build a system that can learn how to interact with users balancing a quick completion of tasks desired by the user with asking additional active learning questions to improve the underlying grounded language understanding components. We present work on jointly improving semantic parsers from and learning a dialog policy for clarification dialogs, that improve a robot’s ability to understand natural language commands. We introduce the framework of opportunistic active learning, where a robot introduces opportunistic queries, that may not be immediately relevant, into an interaction in the hope of improving performance in future interactions. We demonstrate the usefulness of this framework in learning to ground natural language descriptions of objects, and learn a dialog policy for such interactions. We also learn dialog policies that balance task completion, opportunistic active learning, and attribute-based clarification questions. Finally, we attempt to expand this framework to different types of underlying models of grounded language understanding.
ML ID: 389
Self-Critical Reasoning for Robust Visual Question Answering
[Details] [PDF] [Slides (PDF)] [Poster]
Jialin Wu and Raymond J. Mooney
In Proceedings of Neural Information Processing Systems (NeurIPS) , December 2019.
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution [1]. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5 % using textual explanations and 48.5 % using automatically annotated regions.
ML ID: 380
Hidden State Guidance: Improving Image Captioning using An Image Conditioned Autoencoder
[Details] [PDF] [Poster]
Jialin Wu and Raymond J. Mooney
In Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS 2019, December 2019.
Most RNN-based image captioning models receive supervision on the output words to mimic human captions. Therefore, the hidden states can only receive noisy gradient signals via layers of back-propagation through time, leading to less accurate generated captions. Consequently, we propose a novel framework, Hidden State Guidance (HSG), that matches the hidden states in the caption decoder to those in a teacher decoder trained on an easier task of autoencoding the captions conditioned on the image. During training with the REINFORCE algorithm, the conventional rewards are sentence-based evaluation metrics equally distributed to each generated word, no matter their relevance. HSG provides a word-level reward that helps the model learn better hidden representations. Experimental results demonstrate that HSG clearly outperforms various state-of-the-art caption decoders using either raw images, detected objects, or scene graph features as inputs.
ML ID: 379
Generating Question Relevant Captions to Aid Visual Question Answering
[Details] [PDF] [Slides (PPT)]
Jialin Wu, Zeyuan Hu, Raymond J. Mooney
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, August 2019.
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.
ML ID: 375
Faithful Multimodal Explanation for Visual Question Answering
[Details] [PDF] [Slides (PPT)]
Jialin Wu and Raymond J. Mooney
In Proceedings of the Second BlackboxNLP Workshop at ACL, 103-112, Florence, Italy, August 2019.
AI systems’ ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have enabled significant progress on many challenging problems such as visual question answering (VQA). However, most of them are opaque black boxes with limited explanatory capability. This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning process while capturing the style of comprehensible human explanations. Extensive experimental evaluation demonstrates the advantages of this approach compared to competing methods using both automated metrics and human evaluation.
ML ID: 374
Learning a Policy for Opportunistic Active Learning
[Details] [PDF]
Aishwarya Padmakumar, Peter Stone, Raymond J. Mooney
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP-18), Brussels, Belgium, November 2018.
Active learning identifies data points to label that are expected to be the most useful in improving a supervised model. Opportunistic active learning incorporates active learning into interactive tasks that constrain possible queries during interactions. Prior work has shown that opportunistic active learning can be used to improve grounding of natural language descriptions in an interactive object retrieval task. In this work, we use reinforcement learning for such an object retrieval task, to learn a policy that effectively trades off task completion with model improvement that would benefit future tasks.
ML ID: 368
Explainable Improved Ensembling for Natural Language and Vision
[Details] [PDF] [Slides (PPT)] [Slides (PDF)]
Nazneen Rajani
PhD Thesis, Department of Computer Science, The University of Texas at Austin, July 2018.
Ensemble methods are well-known in machine learning for improving prediction accuracy. However, they do not adequately discriminate among underlying component models. The measure of how good a model is can sometimes be estimated from “why” it made a specific prediction. We propose a novel approach called Stacking With Auxiliary Features (SWAF) that effectively leverages component models by integrating such relevant information from context to improve ensembling. Using auxiliary features, our algorithm learns to rely on systems that not just agree on an output prediction but also the source or origin of that output. We demonstrate our approach to challenging structured prediction problems in Natural Language Processing and Vision including Information Extraction, Object Detection, and Visual Question Answering. We also present a variant of SWAF for combining systems that do not have training data in an unsupervised ensemble with systems that do have training data. Our combined approach obtains a new state-of-the-art, beating our prior performance on Information Extraction. The state-of-the-art systems on many AI applications are ensembles of deep-learning models. These models are hard to interpret and can sometimes make odd mistakes. Explanations make AI systems more transparent and also justify their predictions. We propose a scalable approach to generate visual explanations for ensemble methods using the localization maps of the component systems. Crowdsourced human evaluation on two new metrics indicates that our ensemble’s explanation significantly qualitatively outperforms individual systems’ explanations.
ML ID: 364
Joint Image Captioning and Question Answering
[Details] [PDF] [Poster]
Jialin Wu, Zeyuan Hu and Raymond J. Mooney
In VQA Challenge and Visual Dialog Workshop at the 31st IEEE Conference on Computer Vision and Pattern Recognition (CVPR-18) , June 2018.
Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.
ML ID: 362
Stacking With Auxiliary Features for Visual Question Answering
[Details] [PDF] [Poster]
Nazneen Fatema Rajani, Raymond J. Mooney
In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2217-2226, 2018.
Visual Question Answering (VQA) is a well-known and challenging task that requires systems to jointly reason about natural language and vision. Deep learning models in various forms have been the standard for solving VQA. However, some of these VQA models are better at certain types of image-question pairs than other models. Ensembling VQA models intelligently to leverage their diverse expertise is, therefore, advantageous. Stacking With Auxiliary Features (SWAF) is an intelligent ensembling technique which learns to combine the results of multiple models using features of the current problem as context. We propose four categories of auxiliary features for ensembling for VQA. Three out of the four categories of features can be inferred from an image-question pair and do not require querying the component models. The fourth category of auxiliary features uses model-specific explanations. In this paper, we describe how we use these various categories of auxiliary features to improve performance for VQA. Using SWAF to effectively ensemble three recent systems, we obtain a new state-of-the-art. Our work also highlights the advantages of explainable AI models.
ML ID: 360
Ensembling Visual Explanations for VQA
[Details] [PDF] [Poster]
Nazneen Fatema Rajani, Raymond J. Mooney
In Proceedings of the NIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL), December 2017.
Explanations make AI systems more transparent and also justify their predictions. The top-ranked Visual Question Answering (VQA) systems are ensembles of multiple systems; however, there has been no work on generating explanations for such ensembles. In this paper, we propose different methods for ensembling visual explanations for VQA using the localization maps of the component systems. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is superior to each of the individual system’s visual explanation, although the results vary depending on the individual system that the ensemble is compared against as well as the number of individual systems that agree with the ensemble model’s answer. Overall, our ensemble explanation is better 63% of the time when compared to any individual system’s explanation. Our algorithm is also efficient and scales linearly in the number of component systems in the ensemble.
ML ID: 359
Natural-Language Video Description with Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan
PhD Thesis, Department of Computer Science, The University of Texas at Austin, August 2017.
For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting meaning from video pixels and generating a sentence description is a very complex problem. The goal of this thesis is to develop models that can automatically generate natural language descriptions for events in videos. It presents several approaches to automatic video description by building on recent advances in “deep” machine learning. The techniques presented in this thesis view the task of video description akin to machine translation, treating the video domain as a source “language” and uses deep neural net architectures to “translate” videos to text. Specifically, I develop video captioning techniques using a unified deep neural network with both convolutional and recurrent structure, modeling the temporal elements in videos and language with deep recurrent neural networks. In my initial approach, I adapt a model that can learn from paired images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. To further improve grammaticality and descriptive quality, I also propose methods to integrate linguistic knowledge from plain text corpora. Additionally, I show that such linguistic knowledge can help describe novel objects unseen in paired image/video-caption data. Finally, moving beyond short video clips, I present methods to process longer multi-activity videos, specifically to jointly segment and describe coherent event sequences in full-length movies.
ML ID: 349
Using Explanations to Improve Ensembling of Visual Question Answering Systems
[Details] [PDF] [Poster]
Nazneen Fatema Rajani and Raymond J. Mooney
In Proceedings of the IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI), 43-47, Melbourne, Australia, August 2017.
We present results on using explanations as auxiliary features to improve stacked ensembles for Visual Question Answering (VQA). VQA is a challenging task that requires systems to jointly reason about natural language and vision. We present results applying a recent ensembling approach to VQA, Stacking with Auxiliary Features (SWAF), which learns to combine the results of multiple systems. We propose using features based on explanations to improve SWAF. Using explanations we are able to improve ensembling of three recent VQA systems.
ML ID: 346
Multi-Modal Word Synset Induction
[Details] [PDF]
Jesse Thomason and Raymond J. Mooney
In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17), 4116--4122, Melbourne, Australia, 2017.
A word in natural language can be polysemous, having multiple meanings, as well as synonymous, meaning the same thing as other words. Word sense induction attempts to find the senses of polysemous words. Synonymy detection attempts to find when two words are interchangeable. We combine these tasks, first inducing word senses and then detecting similar senses to form word-sense synonym sets (synsets) in an unsupervised fashion. Given pairs of images and text with noun phrase labels, we perform synset induction to produce collections of underlying concepts described by one or more noun phrases. We find that considering multi-modal features from both visual and textual context yields better induced synsets than using either context alone. Human evaluations show that our unsupervised, multi-modally induced synsets are comparable in quality to annotation-assisted ImageNet synsets, achieving about 84% of ImageNet synsets' approval.
ML ID: 344
Captioning Images with Diverse Objects
[Details] [PDF] [Slides (PDF)] [Poster]
Subhashini Venugopalan and Lisa Anne Hendricks and Marcus Rohrbach and Raymond Mooney and Trevor Darrell and Kate Saenko
In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR-17), 5753--5761, 2017.
Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources -- labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets. We demonstrate that our model exploits semantic information to generate captions for hundreds of object categories in the ImageNet object recognition dataset that are not observed in MSCOCO image-caption training data, as well as many categories that are observed very rarely. Both automatic evaluations and human judgements show that our model considerably outperforms prior work in being able to describe many more categories of objects.
ML ID: 341
Stacking With Auxiliary Features: Improved Ensembling for Natural Language and Vision
[Details] [PDF] [Slides (PDF)]
Nazneen Fatema Rajani
November 2016. PhD proposal, Department of Computer Science, The University of Texas at Austin.
Ensembling methods are well known in machine learning for improving prediction accuracy. However, they are limited in the sense that they cannot effectively discriminate among underlying component models. Some models perform better at certain types of input instances than other models. The measure of how good a model is can sometimes be gauged from "where" it extracted the output and "why" it made the prediction. This information can be exploited to leverage the component models in an ensemble. In this proposal, we present stacking with auxiliary features that integrates relevant information from multiple sources to improve ensembling. We use two types of auxiliary features - instance features and provenance features. The instance features enable the stacker to discriminate across input instances while the provenance features enable the stacker to discriminate across component systems. When combined together, our algorithm learns to rely on systems that not just agree on an output but also the provenance of this output in conjunction with the input instance type.
We demonstrate our approach on three very different and difficult problems: Cold Start Slot Filling, Tri-lingual Entity Discovery and Linking, and ImageNet Object Detection. The first two problems are well known tasks in Natural Language Processing, and the third one is in the domain of Computer Vision. Our algorithm obtains state-of-the-art results on the first two tasks and significant improvements on the ImageNet task, thus verifying the power and generality of our approach. We also present a novel approach using stacking for combining systems that do not have training data in an unsupervised ensemble with systems that do have training data. Our combined approach achieves state-of-the-art on the Cold Start Slot Filling and Tri-lingual Entity Discovery and Linking tasks, beating our own prior performance on ensembling just the supervised systems.
We propose several short-term and long-term extensions to our work. In the short-term, we focus our work on using more semantic instance-level features for all the three tasks, and use non-lexical features that are language independent for the two NLP tasks. In the long-term we propose to demonstrate our ensembling algorithm on the Visual Question Answering task and use textual/visual explanations as auxiliary features to stacking.
ML ID: 340
Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
[Details] [PDF] [Poster]
Subhashini Venugopalan and Lisa Anne Hendricks and Raymond Mooney and Kate Saenko
In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 1961--1966, Austin, Texas, 2016.
This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality.
ML ID: 328
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
[Details] [PDF]
Lisa Anne Hendricks and Subhashini Venugopalan and Marcus Rohrbach and Raymond Mooney and Kate Saenko and Trevor Darrell
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-16), 1--10, 2016.
While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context. In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired image-sentence datasets. Our method achieves this by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts. Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet. In contrast, our model can compose sentences that describe novel objects and their interactions with other objects. We demonstrate our model’s ability to describe novel concepts by empirically evaluating its performance on MSCOCO and show qualitative results on ImageNet images of objects for which no paired image-sentence data exist. Further, we extend our approach to generate descriptions of objects in video clips. Our results show that DCC has distinct advantages over existing image and video captioning approaches for generating descriptions of new objects in context.
ML ID: 327
Natural Language Video Description using Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan
November 2015. PhD proposal, Department of Computer Science, The University of Texas at Austin.
For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting the meaning from video pixels and generating a sentence description is a very complex problem. The goal of my research is to develop models that can automatically generate natural language (NL) descriptions for events in videos. As a first step, this proposal presents deep recurrent neural network models for video to text generation. I build on recent "deep" machine learning approaches to develop video description models using a unified deep neural network with both convolutional and recurrent structure. This technique treats the video domain as another "language" and takes a machine translation approach using the deep network to translate videos to text. In my initial approach, I adapt a model that can learn on images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. The second part of the proposal outlines a set of models to significantly extend work in this area. Specifically, I propose techniques to integrate linguistic knowledge from plain text corpora; and attention methods to focus on objects and track their interactions to generate more diverse and accurate descriptions. To move beyond short video clips, I also outline models to process multi-activity movie videos, learning to jointly segment and describe coherent event sequences. I propose further extensions to take advantage of movie scripts and subtitle information to generate richer descriptions.
ML ID: 324
Sequence to Sequence -- Video to Text
[Details] [PDF]
Subhashini Venugopalan and Marcus Rohrbach and Jeff Donahue and Raymond J. Mooney and Trevor Darrell and Kate Saenko
In Proceedings of the 2015 International Conference on Computer Vision (ICCV-15), Santiago, Chile, December 2015.
Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
ML ID: 319
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
[Details] [PDF] [Slides (PDF)]
Subhashini Venugopalan and Huijuan Xu and Jeff Donahue and Marcus Rohrbach and Raymond Mooney and Kate Saenko
In Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics -- Human Language Technologies (NAACL HLT 2015), 1494--1504, Denver, Colorado, June 2015.
Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.
ML ID: 313
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
[Details] [PDF]
Jesse Thomason and Subhashini Venugopalan and Sergio Guadarrama and Kate Saenko and Raymond Mooney
In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), 1218--1227, Dublin, Ireland, August 2014.
This paper integrates techniques in natural language processing and computer vision to improve recognition and description of entities and activities in real-world videos. We propose a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics. We use state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video. Our factor graph model combines these detection confidences with probabilistic knowledge mined from text corpora to estimate the most likely subject, verb, object, and place. Results on YouTube videos show that our approach improves both the joint detection of these latent, diverse sentence components and the detection of some individual components when compared to using the vision system alone, as well as over a previous n-gram language-modeling approach. The joint detection allows us to automatically generate more accurate, richer sentential descriptions of videos with a wide array of possible content.
ML ID: 304
Integrating Visual and Linguistic Information to Describe Properties of Objects
[Details] [PDF]
Calvin MacKenzie
2014. Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
Generating sentences from images has historically been performed with standalone Computer Vision systems. The idea of combining visual and linguistic information has been gaining traction in the Computer Vision and Natural Language Processing communities over the past several years. The motivation for a combined system is to generate richer linguistic descriptions of images. Standalone vision systems are typically unable to generate linguistically rich descriptions. This approach combines abundant available language data to clean up noisy results from standalone vision systems.
This thesis investigates the performance of several models which integrate information from language and vision systems in order to describe certain attributes of objects. The attributes used were split into two categories: color attributes and other attributes. Our proposed model was found to be statistically significantly more accurate than the vision system alone for both sets of attributes.
ML ID: 302
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
[Details] [PDF] [Poster]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), 2712--2719, Sydney, Australia, December 2013.
Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action, and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
ML ID: 295
A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities
[Details] [PDF]
Stephen Roller and Sabine Schulte im Walde
In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 1146--1157, Seattle, WA, October 2013.
Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal models to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
ML ID: 294
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
[Details] [PDF] [Slides (PPT)]
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
In Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13), 10--19, Atlanta, Georgia, July 2013.
We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with ``real-world'' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.
ML ID: 290
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
[Details] [PDF] [Slides (PPT)]
Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), 541--547, July 2013.
We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61 percent of the time.
ML ID: 282
Improving Video Activity Recognition using Object Recognition and Text Mining
[Details] [PDF] [Slides (PPT)]
Tanvi S. Motwani and Raymond J. Mooney
In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600--605, August 2012.
Recognizing activities in real-world videos is a challenging AI problem. We present a novel combination of standard activity classification, object recognition, and text mining to learn effective activity recognizers without ever explicitly labeling training videos. We cluster verbs used to describe videos to automatically discover classes of activities and produce a labeled training set. This labeled data is then used to train an activity classifier based on spatio-temporal features. Next, text mining is employed to learn the correlations between these verbs and related objects. This knowledge is then used together with the outputs of an off-the-shelf object recognizer and the trained activity classifier to produce an improved activity recognizer. Experiments on a corpus of YouTube videos demonstrate the effectiveness of the overall approach.
ML ID: 274
Using Closed Captions as Supervision for Video Activity Recognition
[Details] [PDF]
Sonal Gupta, Raymond J. Mooney
In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), 1083--1088, Atlanta, GA, July 2010.
Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle and zoom, and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This paper explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting ‘labeled’ data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
ML ID: 242
Activity Retrieval in Closed Captioned Videos
[Details] [PDF]
Sonal Gupta
Masters Thesis, Department of Computer Sciences, University of Texas at Austin, August 2009. 64 pages.
Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle and zoom, occlusion and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This thesis explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting “labeled” data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
ML ID: 236
Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval
[Details] [PDF]
Sonal Gupta and Raymond Mooney
In Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL), Miami, FL, June 2009.
Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle & zoom, rapid camera movements etc. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This paper explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting labeled data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
ML ID: 226
Watch, Listen & Learn: Co-training on Captioned Images and Videos
[Details] [PDF]
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 457--472, Antwerp Belgium, September 2008.
Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.
ML ID: 221