ComPLetely Unsupervised Multimodal Character identification On TV series and movies
Final Report Abstract
With the rise of deep learning, AI has seen rapid progress in conventional vision tasks such as object detection and recognition, semantic segmentation, action recognition, etc. The community is now moving towards a higher level of semantic abstraction, and aiming at joint vision and language tasks such as video understanding. This project focussed on completely unsupervised character identification in TV series and movies. Motivated by the prior successes of leveraging complementary temporal and multomodal information, the project makes the following contributions: (i) We aimed to learn a representation that exhibits small distances between samples from the same person, and large inter-person distances in feature space. Using metric learning one could achieve that as it is comprised of a pull-term, pulling data points from the same class closer, and a push-term, pushing data points from a different class further away. Metric learning for improving feature quality is useful but requires some form of external supervision to provide labels for the same or different pairs. In the case of face clustering in TV series, we may obtain this supervision from tracks, clustering, similiarity and other cues. The tracking acts as a form of high precision clustering (grouping detections within a shot) and is used to automatically generate positive and negative pairs of face images. Inspired from that we proposed, (a) two variants of discriminative approaches: Track-supervised Siamese network (TSiam) and Self-supervised Siamese network (Ssiam); (b) Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that utilizes automatically discovered partitions obtained from our clustering algorithm (FINCH) as weak supervision along with inherent video constraints to learn discriminative face features; (c) Face grouping on graphs (FGG), a method for unsupervised fine-tuning of deep face feature representations. (ii) True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We presented a way to learn a compact multimodal feature representation that encodes all these modalities. For this purpose, we propose temporal ordering problem of videos is a new task. Our dataset is build on top of Large Scale Movie Description Challenge (LSMDC). Our dataset consists of 202 movies with 118,081 video clips. In total, there are 25,269 scenes in the training set, 1,784 scenes in the validation set, and 2,443 scenes in the test set. Further, we proposed Temporal Compact Bilinear Pooling (TCBP) an extension of the Tensor Sketch projection algorithm [Pham, 2013] to incorporate a temporal dimension for representing face tracks in videos. We learned multimodal clip representation that jointly encodes images, audio, video, and text using TCBP for video ordering task. Additionally, we showed that TCBP features show exceptional transfer abilities to applications video retrieval and face video face clustering. All of our datasets and source codes are publicly available for research in this field.
Publications
- Deep multimodal feature encoding for video ordering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) workshop on Large Scale Holistic Video Understanding, 2019
Vivek Sharma, Makarand Tapaswi, and Rainer Stiefelhagen
- A simple and effective technique for face clustering in tv series. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR) workshop on Brave New Motion Representations, 2017
Vivek Sharma, M Saquib Sarfraz, and Rainer Stiefelhagen
- Classification-driven dynamic image enhancement. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2018
Vivek Sharma, Ali Diba, Davy Neven, Michael S Brown, Luc Van Gool, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/CVPR.2018.00424) - Dynamonet: Dynamic action and motion network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019
Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/ICCV.2019.00629) - Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
M Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/CVPR.2019.00914) - Self-supervised face-grouping on graphs. In Proceedings of the ACM International Conference on Multimedia (ACMMM), 2019
Veith Röthlingshöfer, Vivek Sharma, and Rainer Stiefelhagen
(See online at https://doi.org/10.1145/3343031.3351071) - Selfsupervised learning of face representations for video face clustering. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2019
Vivek Sharma, Makarand Tapaswi, M Saquib Sarfraz, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/FG.2019.8756609) - Video face clustering with self-supervised representation learning. IEEE Transactions on Biometrics, Behavior, and Identity Science (T-BIOM), 2019
Vivek Sharma, Makarand Tapaswi, M Saquib Sarfraz, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/TBIOM.2019.2947264) - Clustering based contrastive learning for improving face representations. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020
Vivek Sharma, Makarand Tapaswi, Saquib Sarfraz, and Rainer Stiefelhagen
(See online at https://doi.org/10.1109/FG47880.2020.00011) - Large scale holistic video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2020
Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, and Luc Van Gool
(See online at https://doi.org/10.1007/978-3-030-58558-7_35)