Automatisches Alignment von Text und Video für semantische Multimediaanalyse
Zusammenfassung der Projektergebnisse
With the rise of deep learning, AI has seen rapid progress in conventional vision tasks such as object detection and recognition, semantic segmentation, action recognition, etc. The community is now moving towards a higher level of semantic abstraction, and aiming at joint vision and language tasks such as image/video captioning and question-answering. This project focusses on automatic understanding of stories told in TV series and movies. Motivated by the prior successes of leveraging complementary textual information such as subtitles and transcripts, the project makes the following contributions: (i) We propose to align videos with natural language forms of text, namely, plot synopses and books, unlocking applications such as story-based retrieval, summarization, and rich captioning. To this end, we develop several alignment models that are suitable for linear or nonlinear storylines. We show how characters play a central role in storytelling and based on identity and dialog cues, bridge the gap between video and text. We evaluate our models on diverse, real-world TV series datasets, and obtain encouraging performance. We also demonstrate that we are able to search within large video collections for queries such as “Xander proposes Anya”, and find the relevant clip. Together with a character interaction visualization technique, we are also able to embed short event-like phrases from plots to create visual summaries for TV episodes. Our alignment models are also able to reason about which parts of the movie were not in the original book. (ii) We create a benchmark dataset and challenge to evaluate machine understanding of stories. In particular, we ask machines to answer questions about a story by “watching” or “reading” a movie. The questions cover remembering and reasoning, and require to answer “Who” did “What” to “Whom”, but also, “How” and “Why”. Our dataset covers over 400 movies, and contains almost 15,000 multiple choice questions with 5 answer options. With a user experiment, we show that the multiple choice options are all viable, and without the story can even fool humans. We develop and evaluate several baselines demonstrating the difficulty of the dataset. We believe that the dataset will prove as a strong benchmark to track progress in AI for years to come. As of June 2017, this dataset has received attention from over 130 teams from all over the world, and hosts an active public leaderboard. On first release, the dataset has also been featured in MIT Technology Review and NVidia News Center. (iii) Our third contribution is in the field of improving automatic person identification in TV series and movies. We show how subtitles and transcripts can be used to jointly tag all face tracks in the video; create a large new dataset of face tracks based on the Harry Potter movie series to study the effect of aging on facial appearance; and work on identifying characters just like humans do - based only on name mentions in dialog. Many of our datasets and source codes are publicly available for research in this field.
Projektbezogene Publikationen (Auswahl)
- Accio: A Data Set for Face Track Retrieval in Movies Across Age. ACM International Conf. on Multimedia Retrieval (ICMR), Jun 2015
E. Ghaleb, M. Tapaswi, Z. Al-Halah, H. K. Ekenel and R. Stiefelhagen
- Aligning Plot Synopses to Videos for Story-based Retrieval. International Journal of Multimedia Information Retrieval (IJMIR) 4 (1), pp. 3-16, 2015
M. Tapaswi, M. Bäuml and R. Stiefelhagen
(Siehe online unter https://doi.org/10.1007/s13735-014-0065-9) - Book2Movie: Aligning Video scenes with Book chapters. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Jun 2015
M. Tapaswi, M. Bäuml and R. Stiefelhagen
- Improved Weak Labels using Contextual Cues for Person Identification in Videos. IEEE International Conf. on Automatic Face and Gesture Recognition (FG), May 2015
M. Tapaswi, M. Bäuml and R. Stiefelhagen
- MovieQA: Understanding Stories in Movies through Question-Answering. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Jun 2016
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun and S. Fidler
(Siehe online unter https://doi.org/10.1109/CVPR.2016.501) - Naming TV Characters by Watching and Analyzing Dialogs. IEEE Winter Conf. on Applications of Computer Vision (WACV), Mar 2016
M.-L. Haurilet, M. Tapaswi, Z. Al-Halah and R. Stiefelhagen