UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Natural Language Video Description using Deep Recurrent Neural Networks (2015)
Subhashini Venugopalan
For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting the meaning from video pixels and generating a sentence description is a very complex problem. The goal of my research is to develop models that can automatically generate natural language (NL) descriptions for events in videos. As a first step, this proposal presents deep recurrent neural network models for video to text generation. I build on recent "deep" machine learning approaches to develop video description models using a unified deep neural network with both convolutional and recurrent structure. This technique treats the video domain as another "language" and takes a machine translation approach using the deep network to translate videos to text. In my initial approach, I adapt a model that can learn on images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. The second part of the proposal outlines a set of models to significantly extend work in this area. Specifically, I propose techniques to integrate linguistic knowledge from plain text corpora; and attention methods to focus on objects and track their interactions to generate more diverse and accurate descriptions. To move beyond short video clips, I also outline models to process multi-activity movie videos, learning to jointly segment and describe coherent event sequences. I propose further extensions to take advantage of movie scripts and subtitle information to generate richer descriptions.
View:
PDF
Citation:
PhD proposal, Department of Computer Science, The University of Texas at Austin.
Bibtex:
@misc{venugopalan:proposal15, title={Natural Language Video Description using Deep Recurrent Neural Networks}, author={Subhashini Venugopalan}, month={November}, note={PhD proposal, Department of Computer Science, The University of Texas at Austin}, url="http://www.cs.utexas.edu/users/ai-labpub-view.php?PubID=127542", year={2015} }
Presentation:
Slides (PDF)
People
Subhashini Venugopalan
Ph.D. Alumni
vsub [at] cs utexas edu
Areas of Interest
Computer Vision
Deep Learning
Language and Vision
Machine Learning
Natural Language Processing
Labs
Machine Learning