University of Texas at Austin | Microsoft Research | |||||||||||||||||||||||||||||||
Department of Computer Science | Natural Language Processing Group | |||||||||||||||||||||||||||||||
David L. Chen | William B. Dolan |
Description of the projectBack to TopTraditional methods of collecting translation and paraphrase data can be prohibitively expensive, making construction of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. In this project we introduce a novel annotation task that uses short video clips (usually less than 10 seconds) as the stimulus to elicit parallel linguistic responses from the annotators. Descriptions of the same video in the same language can then be used as paraphrases of each other while descriptions in different languages can be used as translations of each other. Some of the advantages of this data collection method are:
Over a two-month period from July to September in 2010, we collected 85K English descriptions for 2,089 video clips as well as over a thousand descriptions for each of a dozen more languages. In addition to providing training and testing data for paraphrase and translation engines, this data also provides natural language descriptions for a significant amount of video data. The video clips generally depict a single, unambiguous action or event. | ||||||||||||||||||||||||||||||||
Publication and TalksBack to Top
Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
[Abstract]
[PDF]
[Slides (PPT)]
Collecting Highly Parallel Data for Paraphrase Evaluation
[Abstract]
[PDF]
[Slides (PPT)]
| ||||||||||||||||||||||||||||||||
DataBack to TopOverview
The data consists of 122K descriptions for 2089 video clips. Below is a breakdown of the number of annotations obtained for each language:
We have also included some of the video clips that were used to gather these descriptions. Unfortunately, due to the volatility of YouTube, some of the videos were removed before we could archive them. A total of 1970 out of 2089 video clips are included in the tarball below.
2021 Fall Notes:The links to download the video description dataset from MSR no longer works. Unifortunetely, we are only able to reconstruct the English corpus.Citations
Please use the following citations when referencing the sources of the data: @InProceedings{chen:acl11, title = "Collecting Highly Parallel Data for Paraphrase Evaluation", author = "David L. Chen and William B. Dolan", booktitle = "Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011)", address = "Portland, OR", month = "June", year = 2011 } Downloads
To download the reconstructed English descriptions of the videos, please visit: Here is a tarball of most of the video files (.avi): | ||||||||||||||||||||||||||||||||
Contact InformationBack to TopIf you have any questions or comments, please contact David Chen or Bill Dolan
|