Self-Supervised Learning Of Face Representations For Video Face Clustering

2019 14TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2019), (2019): 360-367

被引用21|浏览160
EI
下载 PDF 全文
引用
微博一下

摘要

Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In t...更多

代码

数据

0
简介
  • Large videos such as TV series episodes or movies undergo several preprocessing steps to make the video more accessible, e.g. shot detection.
  • Character identification/clustering has become one such important step with several emerging research areas [25], [35], [37] requiring it.
  • Recent work [24] suggests that more meaningful captions can be achieved from an improved understanding of characters.
  • The ability to predict which characters appear when and where facilitates a deeper video understanding that is grounded in the storyline
重点内容
  • Large videos such as TV series episodes or movies undergo several preprocessing steps to make the video more accessible, e.g. shot detection
  • We focus on the video face clustering problem
  • C=1 where N is the total number of tracks in the video, nc is the number of samples in the cluster c, and cluster purity pc is measured as the fraction of the largest number of samples from the same label to nc. |C| corresponds to the number of main cast members, and in our case the number of clusters
  • We proposed simple, unsupervised approaches for face clustering in videos, by distilling the identity factor from deep face representations
  • We showed that discriminative models can leverage dynamic generation of positive/negative constraints based on ordered face distances and do not have to only rely on track-level information that is typically used
  • Our models are very fast to train and evaluate and outperform the state-of-the-art while operating on datasets that contain more tracks with large changes in appearance
方法
  • The authors report the clustering performance on training videos in Table V
  • Note that both TSiam and SSiam are trained in an unsupervised manner, or with automatically generated labels.
  • Both of the proposed models SSiam and TSiam provide a large performance boost over the base VGG2 features on BBT and BF.
  • Table VI reports clustering accuracy
  • Both SSiam or TSiam perform similar to the base features, possibly due to overfitting
结果
  • The authors present the evaluation on three challenging datasets. The authors first describe the clustering metric, followed by a thorough analysis of the proposed methods, ending with a comparison to state-of-the-art.

    from the dataset.
  • The second column corresponds to nearest neighbors for each frame and can be used to form the set of positive pairs S+.
  • The last column corresponds to farthest neighbors and forms the set of negative pairs S−.
  • Each element of the above sets stores: query index b, nearest/farthest neighbor r, and the Euc. distance d.
  • As the authors compare methods providing equal numbers of clusters, ACC is a fair metric for comparison.
结论
  • The authors proposed simple, unsupervised approaches for face clustering in videos, by distilling the identity factor from deep face representations.
  • The authors' proposed models are unsupervised and can be trained and evaluated efficiently as they involve only a few matrix multiplications.
  • The authors conducted experiments on three challenging video datasets, comparing their differences in usage in past works.
  • The authors' models are very fast to train and evaluate and outperform the state-of-the-art while operating on datasets that contain more tracks with large changes in appearance.
表格
  • Table1: DATASET STATISTICS FOR BBT [<a class="ref-link" id="c41" href="#r41">41</a>], [<a class="ref-link" id="c42" href="#r42">42</a>], [<a class="ref-link" id="c47" href="#r47">47</a>], BF [<a class="ref-link" id="c48" href="#r48">48</a>] AND
  • Table2: CLUSTERING ACCURACY COMPUTED AT TRACK-LEVEL ON THE TRAINING EPISODES, WITH A COMPARISON TO ALL EVALUATED
  • Table3: PERFORMANCE COMPARISON OF TSIAM AND SSIAM WITH JFAC [<a class="ref-link" id="c48" href="#r48">48</a>]
  • Table4: CLUSTERING ACCURACY ON THE BASE FACE REPRESENTATIONS
  • Table5: COMPARISON OF SSiam WITH pseudo-RF
  • Table6: COMPARISON TO STATE-OF-THE-ART. METRIC IS CLUSTERING ACCURACY (%) EVALUATED AT FRAME LEVEL. PLEASE NOTE THAT MANY PREVIOUS WORKS USE FEWER TRACKS (# OF FRAMES) (ALSO INDICATED IN TABLE I) MAKING THE TASK RELATIVELY EASIER. WE USE AN UPDATED VERSION OF FACE TRACKS PROVIDED BY [<a class="ref-link" id="c2" href="#r2">2</a>]
  • Table7: CLUSTERING ACCURACY COMPUTED AT TRACK-LEVEL ACROSS EPISODES WITHIN THE SAME TV SERIES. NUMBERS ARE AVERAGED
  • Table8: PERFORMANCE COMPARISON OF DIFFERENT METHODS ON THE ACCIO
  • Table9: IGNORING SINGLETON TRACKS (AND POSSIBLY CHARACTERS) LEADS TO SIGNIFICANT PERFORMANCE DROP. ACCURACY ON TRACK-LEVEL
  • Table10: CLUSTERING ACCURACY WHEN EVALUATING ACROSS VIDEO SERIES. EACH ROW INDICATES THAT THE MODEL WAS TRAINED ON ONE EPISODE OF BBT / BF, BUT EVALUATED ON ALL 6 EPISODES OF THE TWO SERIES
Download tables as Excel
相关工作
  • Over the last decade, video face clustering is typically modeled using discriminative methods to improve face representations. In the following, we review some related work in this area.

    Video face clustering. Clustering faces in videos commonly uses pairwise constraints obtained by analyzing tracks and some form of representation/metric learning. Face image pairs belonging to the same track are labeled positive (same character), while face images from co-occurring tracks help create negatives (different characters). This strategy has been exploited by learning a metric to obtain cast-specific distances [4] (ULDML); iteratively clustering and associating short sequences based on hidden Markov Random Field (HMRF) [41], [42]; or performing clustering in a sub-space obtained by a weighted block-sparse low-rank representation (WBSLRR) [43]. In addition to pairwise constraints, video editing cues are used in an unsupervised way to merge tracks [34]. Here, track and cluster representations are learned on-the-fly with dense-SIFT Fisher vectors [20]. Recently, Jin et al [15] consider detection and clustering jointly, and propose a link-based clustering (Erdos-Renyi) based on rank-1 counts verification. The linking is done by comparing a given frame with a reference frame and a threshold is learned to merge/not-merge frames.
基金
  • Acknowledgements: This work is supported by the DFG, a German Research Foundation - funded PLUMCOT project
引用论文
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars. Who’s that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images. In ACCV, 2016.
    Google ScholarLocate open access versionFindings
  • M. Bauml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In FG, 2018.
    Google ScholarLocate open access versionFindings
  • R. G. Cinbis, J. Verbeek, and C. Schmid. Unsupervised Metric Learning for Face Identification in TV Video. In ICCV, 2011.
    Google ScholarLocate open access versionFindings
  • S. Datta, G. Sharma, and C. Jawahar. Unsupervised learning of face representations. In FG. IEEE, 2018.
    Google ScholarLocate open access versionFindings
  • A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is... Buffy” Automatic Naming of Characters in TV Video. In BMVC, 2006.
    Google ScholarLocate open access versionFindings
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-out networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5729–573IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • E. Ghaleb, M. Tapaswi, Z. Al-Halah, H. K. Ekenel, and R. Stiefelhagen. Accio: A Dataset for Face Track Retrieval in Movies Across Age. In ICMR, 2015.
    Google ScholarLocate open access versionFindings
  • M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric Learning Approaches for Face Identification. In ICCV, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR, 2006.
    Google ScholarLocate open access versionFindings
  • P. Hu and D. Ramanan. Finding Tiny Faces. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In ECCV Workshop on Faces in Reallife Images, 2008.
    Google ScholarLocate open access versionFindings
  • S. Jin, H. Su, C. Stauffer, and E. Learned-Miller. End-to-end Face Detection and Cast Grouping in Movies using ErdsRnyi Clustering. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • A. Miech, J.-B. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text via large-scale discriminative clustering. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544.
    Google ScholarLocate open access versionFindings
  • Z. A.-H. Monica-Laura Haurilet, Makarand Tapaswi and R. Stiefelhagen. Naming TV Characters by Watching and Analyzing Dialogs. In WACV, 2016.
    Google ScholarLocate open access versionFindings
  • A. Nagrani and A. Zisserman. From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script. In BMVC, 2017.
    Google ScholarLocate open access versionFindings
  • O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman. A Compact and Discriminative Face Track Descriptor. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep Face Recognition. In BMVC, 2015.
    Google ScholarLocate open access versionFindings
  • G. Paul, K. Elie, M. Sylvain, O. Jean-Marc, and D. Paul. A conditional random field approach for audio-visual people diarization. In ICASSP, 2014.
    Google ScholarLocate open access versionFindings
  • V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking people in videos with “their” names using coreference resolution. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele. Generating Descriptions with Grounded and Co-Referenced People. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie Description. IJCV, 123(1):94–120, 2017.
    Google ScholarLocate open access versionFindings
  • M. Roth, M. Bauml, R. Nevatia, and R. Stiefelhagen. Robust Multipose Face Tracking by Multi-stage Tracklet Association. In ICPR, 2012.
    Google ScholarLocate open access versionFindings
  • M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A posesensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
  • V. Sharma, A. Diba, D. Neven, M. S. Brown, L. Van Gool, and R. Stiefelhagen. Classification driven dynamic image enhancement. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • V. Sharma, M. S. Sarfraz, and R. Stiefelhagen. A simple and effective technique for face clustering in tv series. In CVPR: Brave New Motion Representations Workshop. IEEE, 2017.
    Google ScholarLocate open access versionFindings
  • J. Sivic, M. Everingham, and A. Zisserman. “Who are you?” – Learning person specific classifiers from video. In CVPR, 2009.
    Google ScholarLocate open access versionFindings
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR, 2014.
    Google ScholarLocate open access versionFindings
  • M. Tapaswi, M. Bauml, and R. Stiefelhagen. “Knock! Knock! Who is it?” Probabilistic Person Identification in TV-Series. In CVPR, 2012.
    Google ScholarLocate open access versionFindings
  • M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total Cluster: A Person Agnostic Clustering Method for Broadcast Videos. In ICVGIP, 2014.
    Google ScholarLocate open access versionFindings
  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACMMM, 2015.
    Google ScholarLocate open access versionFindings
  • P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler. MovieGraphs: Towards Understanding Human-Centric Situations from Videos. arXiv:1712.06761, 2017.
    Findings
  • X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
    Google ScholarLocate open access versionFindings
  • J. H. Ward Jr. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301):236–
    Google ScholarLocate open access versionFindings
  • L. Wolf, T. Hassner, and I. Maoz. Face Recognition in Unconstrained Videos with Matched Background Similarity. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • B. Wu, S. Lyu, B.-G. Hu, and Q. Ji. Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos. In ICCV, 2013.
    Google ScholarLocate open access versionFindings
  • B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji. Constrained Clustering and its Application to Face Clustering in Videos. In CVPR, 2013.
    Google ScholarLocate open access versionFindings
  • S. Xiao, M. Tan, and D. Xu. Weighted Block-sparse Low Rank Representation for Face Clustering in Videos. In ECCV, 2014.
    Google ScholarLocate open access versionFindings
  • R. Yan, A. Hauptmann, and R. Jin. Multimedia search with pseudorelevance feedback. In International Conference on Image and Video Retrieval, pages 238–247.
    Google ScholarLocate open access versionFindings
  • R. Yan, A. G. Hauptmann, and R. Jin. Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the eleventh ACM international conference on Multimedia, pages 343–
    Google ScholarLocate open access versionFindings
  • L. Zhang, D. V. Kalashnikov, and S. Mehrotra. A unified framework for context assisted face clustering. In ACMMM. ACM.
    Google ScholarLocate open access versionFindings
  • S. Zhang, Y. Gong, and J. Wang. Deep Metric Learning with Improved Triplet Loss for Face Clustering in Videos. In Pacific Rim Conference on Multimedia, 2016.
    Google ScholarLocate open access versionFindings
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Joint Face Representation Adaptation and Clustering in Videos. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao. Multi-cue augmented face clustering. In ACM’MM, 2015.
    Google ScholarLocate open access versionFindings
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科