Improving Landmark Localization with Semi-Supervised Learning

computer vision and pattern recognition, 2018.

Cited by: 82|Bibtex|Views297|Links
EI
Keywords:
class labelface alignmentlandmark estimationlandmark localizationequivariant landmark transformationMore(6+)
Weibo:
In addition we developed an architecture to improve landmark estimation using auxiliary attributes such as class labels by backpropagating errors through the landmark localization components of the model

Abstract:

We present two techniques to improve landmark localization in images from partially annotated datasets. Our primary goal is to leverage the common situation where precise landmark locations are only provided for a small data subset, but where class labels for classification or regression tasks related to the landmarks are more abundantly ...More

Code:

Data:

0
Introduction
  • Landmark localization – finding the precise location of specific parts in an image – is a central step in many complex vision problems.
  • The first key element of the work – illustrated in the second diagram of Figure 1, is that the authors use the indirect supervision of class labels to guide classifiers trained to localize landmarks.
  • A common approach [50, 52, 47, 9] to multi-task learning uses a traditional CNN, in which a final common fully-connected (FC) layer feeds into separate branches, each dedicated to the output for a different task.
  • This approach learns shared low-level features across the set of tasks and acts as a regularizer, when the individual tasks have few labeled samples
Highlights
  • Landmark localization – finding the precise location of specific parts in an image – is a central step in many complex vision problems
  • We propose a novel class of neural architectures which force classification predictions to flow through the intermediate step of landmark localization to provide complete supervision during backpropagation
  • In this paper we make the following contributions: 1) We propose a novel multi-tasking neural architecture, which a) predicts landmarks as an intermediate step before classification in order to use the class labels to improve landmark localization, b) uses soft-argmax for a fully-differentiable model in which end-to-end training can be performed, even from examples that do not provide labeled landmarks
  • We introduce a labeled set of ground truth (GT) landmark locations and evaluate the landmark localization accuracy by having different percentages of the training set being labelled with landmarks
  • We presented a new architecture and training procedure for semi-supervised landmark localization
  • We present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5% of labeled images we outperform previous state-of-the-art trained on the AFLW dataset
  • In addition we developed an architecture to improve landmark estimation using auxiliary attributes such as class labels by backpropagating errors through the landmark localization components of the model
Methods
  • To validate the proposed model, the authors begin with two toy datasets in Sections 4.1 and 4.2, in order to verify to what extent the class labels can be used to guide the landmark localization regardless of the complexity of the dataset.
  • Images in the Shapes dataset consist of a white triangle and a white square on black background, with randomly sampled size, location, and orientation.
  • The model is trained with only the cross-entropy cost on the class label without labeled landmarks or the unsupervised ELT cost
Results
  • The authors present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5% of labeled images the authors outperform previous state-of-the-art trained on the AFLW dataset.
  • With only 5% of labeled data the method outperforms the previous state of the art methods.
  • The authors achieve new state of the art performance on public benchmark datasets for fiducial points in the wild, 300W and AFLW
Conclusion
  • The authors presented a new architecture and training procedure for semi-supervised landmark localization.
  • In addition the authors developed an architecture to improve landmark estimation using auxiliary attributes such as class labels by backpropagating errors through the landmark localization components of the model.
  • Experiments show that these achieve high accuracy with far fewer labeled landmark training data in tasks of landmark location for hands and faces.
  • The authors achieve new state of the art performance on public benchmark datasets for fiducial points in the wild, 300W and AFLW
Summary
  • Introduction:

    Landmark localization – finding the precise location of specific parts in an image – is a central step in many complex vision problems.
  • The first key element of the work – illustrated in the second diagram of Figure 1, is that the authors use the indirect supervision of class labels to guide classifiers trained to localize landmarks.
  • A common approach [50, 52, 47, 9] to multi-task learning uses a traditional CNN, in which a final common fully-connected (FC) layer feeds into separate branches, each dedicated to the output for a different task.
  • This approach learns shared low-level features across the set of tasks and acts as a regularizer, when the individual tasks have few labeled samples
  • Methods:

    To validate the proposed model, the authors begin with two toy datasets in Sections 4.1 and 4.2, in order to verify to what extent the class labels can be used to guide the landmark localization regardless of the complexity of the dataset.
  • Images in the Shapes dataset consist of a white triangle and a white square on black background, with randomly sampled size, location, and orientation.
  • The model is trained with only the cross-entropy cost on the class label without labeled landmarks or the unsupervised ELT cost
  • Results:

    The authors present results on two toy datasets and four real datasets, with hands and faces, and report new state-of-the-art on two datasets in the wild, e.g. with only 5% of labeled images the authors outperform previous state-of-the-art trained on the AFLW dataset.
  • With only 5% of labeled data the method outperforms the previous state of the art methods.
  • The authors achieve new state of the art performance on public benchmark datasets for fiducial points in the wild, 300W and AFLW
  • Conclusion:

    The authors presented a new architecture and training procedure for semi-supervised landmark localization.
  • In addition the authors developed an architecture to improve landmark estimation using auxiliary attributes such as class labels by backpropagating errors through the landmark localization components of the model.
  • Experiments show that these achieve high accuracy with far fewer labeled landmark training data in tasks of landmark location for hands and faces.
  • The authors achieve new state of the art performance on public benchmark datasets for fiducial points in the wild, 300W and AFLW
Tables
  • Table1: Error of different architectures on Blocks dataset. The error is reported in pixel space. An error of 1 indicates 1 pixel distance to the target landmark location. The first 4 rows show the results of Seq-MT architecture, as shown in Fig. 2. The 5th and 6th rows show results of Comm-MT, depicted in Fig. 5. The last two rows show the results of Heatmap-MT, depicted in Fig. 6. The results are averaged over five seeds
  • Table2: Performance of architectures on HGR1 hands dataset. The error is Euclidean distance normalized by wrist width. Results are shown as percent; lower is better
  • Table3: Performance of different architectures on Multi-PIE dataset. The error is Euclidean distance normalized by eye-centers (as a percent; lower is better). We do not apply ELT cost on the examples that provide GT landmarks
  • Table4: Comparison of recent models on their training conditions. RAR and Lv et. al [<a class="ref-link" id="c21" href="#r21">21</a>] initialize their models by pre-trained parameters. TCDCN uses 20,000 extra labelled data. Finally, RAR adds manual samples by occluding images with sunglasses, medical masks, phones, etc to make them robust to occlusion. Similar to RCN, Seq-MT and RCN+ both have an explicit validation set for HP selection and therefore use a smaller training set. Neither use any extra data, either through pre-trained models or explicit external data
  • Table5: Comparison with other SOTA models (as a percent; lower is better). (left) Performance of different architectures on 300W test-set using 100% labeled landmarks. The error is Euclidean distance normalized by ocular distance. (right-top) Comparison with four other multi-tasking approaches and RCN. For these comparisons, we have implemented the specific architectures proposed in those papers. Error is as in Sections 4.3 and 4.4. (right-bottom) Comparison of different architectures on AFLW test set. The error is Euclidean distance normalized by face size
  • Table6: Performance of different architectures on 300W testset. The error is Euclidean distance normalized by ocular distance (eye-centers). Error is shown as a percent; lower is better
Download tables as Excel
Funding
  • This work was partially funded by NVIDIA’s NVAIL program
Reference
  • R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.
    Findings
  • A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In CVPR, pages 3444–3451, 2013.
    Google ScholarLocate open access versionFindings
  • P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2930–2940, 2013.
    Google ScholarLocate open access versionFindings
  • X. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark estimation under occlusion. In ICCV, pages 1513– 1520, 2013.
    Google ScholarLocate open access versionFindings
  • X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In IJCV, 107(2):177–190, 2014.
    Google ScholarLocate open access versionFindings
  • O. Chapelle and M. Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.
    Google ScholarLocate open access versionFindings
  • N. Dardas, Q. Chen, N. D. Georganas, and E. M. Petriu. Hand gesture recognition using bag-of-features and multiclass support vector machine. In Haptic Audio-Visual Environments and Games (HAVE), 2010 IEEE International Symposium on, pages 1–5. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • D. Datcu and S. Lukosch. Free-hands interaction in augmented reality. In Proceedings of the 1st symposium on Spatial user interaction, pages 33–40. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • T. Devries, K. Biswaranjan, and G. W. Taylor. Multi-task learning of facial landmarks and expression. In Computer and Robot Vision (CRV), 2014 Canadian Conference on, pages 98–103. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing, 28(5):807–813, 2010.
    Google ScholarLocate open access versionFindings
  • B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 447–456, 2015.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-fine feature aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5743–5752, 2016.
    Google ScholarLocate open access versionFindings
  • K. Hu, S. Canavan, and L. Yin. Hand pointing estimation for human computer interaction based on two orthogonal-views. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3760–3763. IEEE, 2010.
    Google ScholarLocate open access versionFindings
  • S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, C. Gulcehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, et al. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International conference on multimodal interaction, pages 543–550. ACM, 2013.
    Google ScholarLocate open access versionFindings
  • M. Kawulok, J. Kawulok, J. Nalepa, and B. Smolka. Selfadaptive algorithm for segmenting skin regions. EURASIP Journal on Advances in Signal Processing, 2014(1):170, 2014.
    Google ScholarLocate open access versionFindings
  • M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof. Annotated facial landmarks in the wild: A large-scale, realworld database for facial landmark localization. In In: Benchmarking Facial Image Analysis Technologies (ICCV Workshop, 2011.
    Google ScholarLocate open access versionFindings
  • S. Laine and T. Aila. Temporal ensembling for semisupervised learning. In International Conference on Learning Representation, 2017.
    Google ScholarLocate open access versionFindings
  • V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive facial feature localization. In European Conference on Computer Vision, pages 679–692.
    Google ScholarLocate open access versionFindings
  • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
    Google ScholarLocate open access versionFindings
  • J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
    Google ScholarLocate open access versionFindings
  • K. A. F. Mora and J.-M. Odobez. Gaze estimation from multimodal kinect data. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 25–30. IEEE, 2012.
    Google ScholarLocate open access versionFindings
  • J. Nalepa and M. Kawulok. Fast and accurate hand shape classification. In International Conference: Beyond Databases, Architectures and Structures, pages 364–373.
    Google ScholarLocate open access versionFindings
  • R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. arXiv preprint arXiv:1603.01249, 2016.
    Findings
  • A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
    Google ScholarLocate open access versionFindings
  • S. Ren, X. Cao, Y. Wei, and J. Sun. Face alignment at 3000 fps via regressing local binary features. In CVPR, pages 1685–1692, 2014.
    Google ScholarLocate open access versionFindings
  • C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In ICCV Workshop, pages 397–403, 2013.
    Google ScholarLocate open access versionFindings
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
    Google ScholarLocate open access versionFindings
  • A. Sinha, C. Choi, and K. Ramani. Deephand: Robust hand pose estimation by completing a matrix imputed with deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4150–4158, 2016.
    Google ScholarLocate open access versionFindings
  • S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive markerless articulated hand motion tracking using RGB and depth data. In Proceedings of the IEEE International Conference on Computer Vision, pages 2456–2463, 2013.
    Google ScholarLocate open access versionFindings
  • X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 824–832, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in neural information processing systems, pages 1988–1996, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.
    Google ScholarLocate open access versionFindings
  • D. Tang, H. Jin Chang, A. Tejani, and T.-K. Kim. Latent regression forest: Structured estimation of 3d articulated hand posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3786– 3793, 2014.
    Google ScholarLocate open access versionFindings
  • J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
    Google ScholarLocate open access versionFindings
  • J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015.
    Google ScholarLocate open access versionFindings
  • J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG), 33(5):169, 2014.
    Google ScholarLocate open access versionFindings
  • C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 680–689, 2017.
    Google ScholarLocate open access versionFindings
  • W. Wang, S. Tulyakov, and N. Sebe. Recurrent convolutional face alignment. In Asian Conference on Computer Vision, pages 104–120, 2016.
    Google ScholarLocate open access versionFindings
  • J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655.
    Google ScholarLocate open access versionFindings
  • S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim. Robust facial landmark detection via recurrent attentiverefinement networks. In European Conference on Computer Vision, pages 57–72.
    Google ScholarLocate open access versionFindings
  • X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In CVPR, pages 532–539, 2013.
    Google ScholarLocate open access versionFindings
  • F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representation, 2016.
    Google ScholarLocate open access versionFindings
  • X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas. Posefree facial landmark fitting via optimized part mixtures and cascaded deformable shape model. In ICCV, pages 1944– 1951, 2013.
    Google ScholarLocate open access versionFindings
  • X. Yu, F. Zhou, and M. Chandraker. Deep deformation network for object landmark localization. In European Conference on Computer Vision, pages 52–70.
    Google ScholarLocate open access versionFindings
  • S. Yuan, G. Garcia-Hernando, B. Stenger, T.-K. Kim, et al. Depth-based 3d hand pose estimation: From current achievements to future goals. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    Google ScholarLocate open access versionFindings
  • C. Zhang and Z. Zhang. Improving multiview face detection with multi-task deep convolutional neural networks. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 1036–1041. IEEE, 2014.
    Google ScholarLocate open access versionFindings
  • J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In ECCV, pages 1–16. 2014.
    Google ScholarLocate open access versionFindings
  • X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearancebased gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4511–4520, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Zhang, P. Luo, C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV, pages 94– 108, 2014.
    Google ScholarLocate open access versionFindings
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with auxiliary attributes. In PAMI, 2015.
    Google ScholarLocate open access versionFindings
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence, 38(5):918–930, 2016.
    Google ScholarLocate open access versionFindings
  • J. J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where auto-encoders. In International Conference on Learning Representation - Workshop Track, 2016.
    Google ScholarLocate open access versionFindings
  • S. Zhu, C. Li, C. C. Loy, and X. Tang. Face alignment by coarse-to-fine shape searching. In CVPR, pages 4998–5006, 2015.
    Google ScholarLocate open access versionFindings
  • S. Zhu, C. Li, C.-C. Loy, and X. Tang. Unconstrained face alignment via cascaded compositional learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    Google ScholarLocate open access versionFindings
  • X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In CVPR, pages 2879– 2886, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments