Self-Supervised Learning Of Face Representations For Video Face Clustering
2019 14TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2019), (2019): 360-367
Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In t...更多
下载 PDF 全文
- Large videos such as TV series episodes or movies undergo several preprocessing steps to make the video more accessible, e.g. shot detection.
- Character identification/clustering has become one such important step with several emerging research areas , ,  requiring it.
- Recent work  suggests that more meaningful captions can be achieved from an improved understanding of characters.
- The ability to predict which characters appear when and where facilitates a deeper video understanding that is grounded in the storyline
- Large videos such as TV series episodes or movies undergo several preprocessing steps to make the video more accessible, e.g. shot detection
- We focus on the video face clustering problem
- C=1 where N is the total number of tracks in the video, nc is the number of samples in the cluster c, and cluster purity pc is measured as the fraction of the largest number of samples from the same label to nc. |C| corresponds to the number of main cast members, and in our case the number of clusters
- We proposed simple, unsupervised approaches for face clustering in videos, by distilling the identity factor from deep face representations
- We showed that discriminative models can leverage dynamic generation of positive/negative constraints based on ordered face distances and do not have to only rely on track-level information that is typically used
- Our models are very fast to train and evaluate and outperform the state-of-the-art while operating on datasets that contain more tracks with large changes in appearance
- The authors report the clustering performance on training videos in Table V
- Note that both TSiam and SSiam are trained in an unsupervised manner, or with automatically generated labels.
- Both of the proposed models SSiam and TSiam provide a large performance boost over the base VGG2 features on BBT and BF.
- Table VI reports clustering accuracy
- Both SSiam or TSiam perform similar to the base features, possibly due to overfitting
- The authors present the evaluation on three challenging datasets. The authors first describe the clustering metric, followed by a thorough analysis of the proposed methods, ending with a comparison to state-of-the-art.
from the dataset.
- The second column corresponds to nearest neighbors for each frame and can be used to form the set of positive pairs S+.
- The last column corresponds to farthest neighbors and forms the set of negative pairs S−.
- Each element of the above sets stores: query index b, nearest/farthest neighbor r, and the Euc. distance d.
- As the authors compare methods providing equal numbers of clusters, ACC is a fair metric for comparison.
- The authors proposed simple, unsupervised approaches for face clustering in videos, by distilling the identity factor from deep face representations.
- The authors' proposed models are unsupervised and can be trained and evaluated efficiently as they involve only a few matrix multiplications.
- The authors conducted experiments on three challenging video datasets, comparing their differences in usage in past works.
- The authors' models are very fast to train and evaluate and outperform the state-of-the-art while operating on datasets that contain more tracks with large changes in appearance.
- Table1: DATASET STATISTICS FOR BBT [<a class="ref-link" id="c41" href="#r41">41</a>], [<a class="ref-link" id="c42" href="#r42">42</a>], [<a class="ref-link" id="c47" href="#r47">47</a>], BF [<a class="ref-link" id="c48" href="#r48">48</a>] AND
- Table2: CLUSTERING ACCURACY COMPUTED AT TRACK-LEVEL ON THE TRAINING EPISODES, WITH A COMPARISON TO ALL EVALUATED
- Table3: PERFORMANCE COMPARISON OF TSIAM AND SSIAM WITH JFAC [<a class="ref-link" id="c48" href="#r48">48</a>]
- Table4: CLUSTERING ACCURACY ON THE BASE FACE REPRESENTATIONS
- Table5: COMPARISON OF SSiam WITH pseudo-RF
- Table6: COMPARISON TO STATE-OF-THE-ART. METRIC IS CLUSTERING ACCURACY (%) EVALUATED AT FRAME LEVEL. PLEASE NOTE THAT MANY PREVIOUS WORKS USE FEWER TRACKS (# OF FRAMES) (ALSO INDICATED IN TABLE I) MAKING THE TASK RELATIVELY EASIER. WE USE AN UPDATED VERSION OF FACE TRACKS PROVIDED BY [<a class="ref-link" id="c2" href="#r2">2</a>]
- Table7: CLUSTERING ACCURACY COMPUTED AT TRACK-LEVEL ACROSS EPISODES WITHIN THE SAME TV SERIES. NUMBERS ARE AVERAGED
- Table8: PERFORMANCE COMPARISON OF DIFFERENT METHODS ON THE ACCIO
- Table9: IGNORING SINGLETON TRACKS (AND POSSIBLY CHARACTERS) LEADS TO SIGNIFICANT PERFORMANCE DROP. ACCURACY ON TRACK-LEVEL
- Table10: CLUSTERING ACCURACY WHEN EVALUATING ACROSS VIDEO SERIES. EACH ROW INDICATES THAT THE MODEL WAS TRAINED ON ONE EPISODE OF BBT / BF, BUT EVALUATED ON ALL 6 EPISODES OF THE TWO SERIES
- Over the last decade, video face clustering is typically modeled using discriminative methods to improve face representations. In the following, we review some related work in this area.
Video face clustering. Clustering faces in videos commonly uses pairwise constraints obtained by analyzing tracks and some form of representation/metric learning. Face image pairs belonging to the same track are labeled positive (same character), while face images from co-occurring tracks help create negatives (different characters). This strategy has been exploited by learning a metric to obtain cast-specific distances  (ULDML); iteratively clustering and associating short sequences based on hidden Markov Random Field (HMRF) , ; or performing clustering in a sub-space obtained by a weighted block-sparse low-rank representation (WBSLRR) . In addition to pairwise constraints, video editing cues are used in an unsupervised way to merge tracks . Here, track and cluster representations are learned on-the-fly with dense-SIFT Fisher vectors . Recently, Jin et al  consider detection and clustering jointly, and propose a link-based clustering (Erdos-Renyi) based on rank-1 counts verification. The linking is done by comparing a given frame with a reference frame and a threshold is learned to merge/not-merge frames.
- Acknowledgements: This work is supported by the DFG, a German Research Foundation - funded PLUMCOT project
- R. Aljundi, P. Chakravarty, and T. Tuytelaars. Who’s that Actor? Automatic Labelling of Actors in TV series starting from IMDB Images. In ACCV, 2016.
- M. Bauml, M. Tapaswi, and R. Stiefelhagen. Semi-supervised Learning with Constraints for Person Identification in Multimedia Data. In CVPR, 2013.
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In FG, 2018.
- R. G. Cinbis, J. Verbeek, and C. Schmid. Unsupervised Metric Learning for Face Identification in TV Video. In ICCV, 2011.
- S. Datta, G. Sharma, and C. Jawahar. Unsupervised learning of face representations. In FG. IEEE, 2018.
- A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, 2017.
- M. Everingham, J. Sivic, and A. Zisserman. “Hello! My name is... Buffy” Automatic Naming of Characters in TV Video. In BMVC, 2006.
- B. Fernando, H. Bilen, E. Gavves, and S. Gould. Self-supervised video representation learning with odd-one-out networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5729–573IEEE, 2017.
- E. Ghaleb, M. Tapaswi, Z. Al-Halah, H. K. Ekenel, and R. Stiefelhagen. Accio: A Dataset for Face Track Retrieval in Movies Across Age. In ICMR, 2015.
- M. Guillaumin, J. Verbeek, and C. Schmid. Is that you? Metric Learning Approaches for Face Identification. In ICCV, 2009.
- Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. In ECCV, 2016.
- R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR, 2006.
- P. Hu and D. Ramanan. Finding Tiny Faces. In CVPR, 2017.
- G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In ECCV Workshop on Faces in Reallife Images, 2008.
- S. Jin, H. Su, C. Stauffer, and E. Learned-Miller. End-to-end Face Detection and Cast Grouping in Movies using ErdsRnyi Clustering. In ICCV, 2017.
- A. Miech, J.-B. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text via large-scale discriminative clustering. In ICCV, 2017.
- I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544.
- Z. A.-H. Monica-Laura Haurilet, Makarand Tapaswi and R. Stiefelhagen. Naming TV Characters by Watching and Analyzing Dialogs. In WACV, 2016.
- A. Nagrani and A. Zisserman. From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script. In BMVC, 2017.
- O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman. A Compact and Discriminative Face Track Descriptor. In CVPR, 2014.
- O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep Face Recognition. In BMVC, 2015.
- G. Paul, K. Elie, M. Sylvain, O. Jean-Marc, and D. Paul. A conditional random field approach for audio-visual people diarization. In ICASSP, 2014.
- V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking people in videos with “their” names using coreference resolution. In ECCV, 2014.
- A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele. Generating Descriptions with Grounded and Co-Referenced People. In CVPR, 2017.
- A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. Movie Description. IJCV, 123(1):94–120, 2017.
- M. Roth, M. Bauml, R. Nevatia, and R. Stiefelhagen. Robust Multipose Face Tracking by Multi-stage Tracklet Association. In ICPR, 2012.
- M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A posesensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, 2018.
- F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A Unified Embedding for Face Recognition and Clustering. In CVPR, 2015.
- V. Sharma, A. Diba, D. Neven, M. S. Brown, L. Van Gool, and R. Stiefelhagen. Classification driven dynamic image enhancement. In CVPR, 2018.
- V. Sharma, M. S. Sarfraz, and R. Stiefelhagen. A simple and effective technique for face clustering in tv series. In CVPR: Brave New Motion Representations Workshop. IEEE, 2017.
- J. Sivic, M. Everingham, and A. Zisserman. “Who are you?” – Learning person specific classifiers from video. In CVPR, 2009.
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR, 2014.
- M. Tapaswi, M. Bauml, and R. Stiefelhagen. “Knock! Knock! Who is it?” Probabilistic Person Identification in TV-Series. In CVPR, 2012.
- M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total Cluster: A Person Agnostic Clustering Method for Broadcast Videos. In ICVGIP, 2014.
- M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. MovieQA: Understanding Stories in Movies through Question-Answering. In CVPR, 2016.
- A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In ACMMM, 2015.
- P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler. MovieGraphs: Towards Understanding Human-Centric Situations from Videos. arXiv:1712.06761, 2017.
- X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2015.
- J. H. Ward Jr. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301):236–
- L. Wolf, T. Hassner, and I. Maoz. Face Recognition in Unconstrained Videos with Matched Background Similarity. In CVPR, 2011.
- B. Wu, S. Lyu, B.-G. Hu, and Q. Ji. Simultaneous Clustering and Tracklet Linking for Multi-face Tracking in Videos. In ICCV, 2013.
- B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji. Constrained Clustering and its Application to Face Clustering in Videos. In CVPR, 2013.
- S. Xiao, M. Tan, and D. Xu. Weighted Block-sparse Low Rank Representation for Face Clustering in Videos. In ECCV, 2014.
- R. Yan, A. Hauptmann, and R. Jin. Multimedia search with pseudorelevance feedback. In International Conference on Image and Video Retrieval, pages 238–247.
- R. Yan, A. G. Hauptmann, and R. Jin. Negative pseudo-relevance feedback in content-based video retrieval. In Proceedings of the eleventh ACM international conference on Multimedia, pages 343–
- L. Zhang, D. V. Kalashnikov, and S. Mehrotra. A unified framework for context assisted face clustering. In ACMMM. ACM.
- S. Zhang, Y. Gong, and J. Wang. Deep Metric Learning with Improved Triplet Loss for Face Clustering in Videos. In Pacific Rim Conference on Multimedia, 2016.
- Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Joint Face Representation Adaptation and Clustering in Videos. In ECCV, 2016.
- C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao. Multi-cue augmented face clustering. In ACM’MM, 2015.