Learning What to Learn for Video Object Segmentation

Cited by: 0|Bibtex|Views34|Links
Keywords:
research questionshot learnerneural networkchallenging problemtarget objectMore(9+)
Weibo:
We present a novel Video Object Segmentation approach by integrating a optimization-based fewshot learner

Abstract:

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS a...More

Code:

Data:

0
Introduction
  • Semi-supervised Video Object Segmentation (VOS) is the problem of performing pixels-wise classification of a set of target objects in a video sequence.
  • VOS is an extremely challenging problem, since the target objects are only defined by a reference segmentation in the first video frame, with no other prior information assumed.
  • The VOS method must utilize this very limited information about the target in order to perform segmentation in the subsequent frames.
  • While most state-of-the-art VOS approaches employ similar image feature extractors and segmentation heads, the advances in how to capture and utilize target information has led to much improved performance [14,32,24,28].
Highlights
  • Semi-supervised Video Object Segmentation (VOS) is the problem of performing pixels-wise classification of a set of target objects in a video sequence
  • VOS is an extremely challenging problem, since the target objects are only defined by a reference segmentation in the first video frame, with no other prior information assumed
  • Contributions: Our main contributions are listed as follows. (i) We propose a novel VOS architecture, based on an optimization-based few-shot learner. (ii) We go beyond standard few-shot learning approaches, to learn what the learner should learn in order to maximize segmentation accuracy. (iii) Our learner predicts the target model parameters in an efficient and differentiable manner, enabling end-to-end training. (iv) We utilize our learned mask representation to design a light-weight bounding box initialization module, allowing our approach to generate target segmentations masks in the weakly supervised setting
  • We introduce a trainable convolutional neural network Eθ(y) that takes a ground-truth mask y as input and predicts ground-truth for the few-shot learner
  • We may back-propagate the error measured between the final segmentation output yt = Sθ(It, Tτ (Fθ(It)) and the ground truth yt on a test frame It. This requires the internal learner (2) to be efficient and differentiable w.r.t. both the underlying features x and the parameters of the label generator Eθ and weight predictor Wθ. We address these open questions to achieve an efficient and end-to-end trainable VOS architecture
  • We present a novel VOS approach by integrating a optimization-based fewshot learner
Methods
  • The authors present the method for video object segmentation (VOS). First, the authors describe the few-shot learning formulation for VOS in Sec. 3.1.
  • Sec. 3.3 details the target module and the internal few-shot learner.
  • Sec. 3.7 describes how the approach can be extended to perform VOS with only a bounding box initialization.
  • On DAVIS 2017, the approach is on par with Siam-RCNN with a J &F-score of 70.6.
  • These results demonstrate that the approach can readily generalize to the box-initialization setting thanks to the flexible internal target representation
Results
  • The authors' approach significantly outperforms STM with a relative improvement of over 2.6%, achieving an overall G-score of 81.5.
Conclusion
  • The authors present a novel VOS approach by integrating a optimization-based fewshot learner. The authors' internal learner is differentiable, ensuring an end-to-end trainable VOS architecture.
  • The authors propose to learn what the few-shot learner should learn.
  • This is achieved by designing neural network modules that predict the ground-truth label and importance weights of the few-shot objective.
  • This allows the target model to predict a rich target representation, guiding the VOS network to generate accurate segmentation masks.
Summary
  • Introduction:

    Semi-supervised Video Object Segmentation (VOS) is the problem of performing pixels-wise classification of a set of target objects in a video sequence.
  • VOS is an extremely challenging problem, since the target objects are only defined by a reference segmentation in the first video frame, with no other prior information assumed.
  • The VOS method must utilize this very limited information about the target in order to perform segmentation in the subsequent frames.
  • While most state-of-the-art VOS approaches employ similar image feature extractors and segmentation heads, the advances in how to capture and utilize target information has led to much improved performance [14,32,24,28].
  • Objectives:

    To train the end-to-end network architecture, the authors aim to simulate the inference procedure employed by the approach, described in Section 3.5
  • Methods:

    The authors present the method for video object segmentation (VOS). First, the authors describe the few-shot learning formulation for VOS in Sec. 3.1.
  • Sec. 3.3 details the target module and the internal few-shot learner.
  • Sec. 3.7 describes how the approach can be extended to perform VOS with only a bounding box initialization.
  • On DAVIS 2017, the approach is on par with Siam-RCNN with a J &F-score of 70.6.
  • These results demonstrate that the approach can readily generalize to the box-initialization setting thanks to the flexible internal target representation
  • Results:

    The authors' approach significantly outperforms STM with a relative improvement of over 2.6%, achieving an overall G-score of 81.5.
  • Conclusion:

    The authors present a novel VOS approach by integrating a optimization-based fewshot learner. The authors' internal learner is differentiable, ensuring an end-to-end trainable VOS architecture.
  • The authors propose to learn what the few-shot learner should learn.
  • This is achieved by designing neural network modules that predict the ground-truth label and importance weights of the few-shot objective.
  • This allows the target model to predict a rich target representation, guiding the VOS network to generate accurate segmentation masks.
Tables
  • Table1: Ablative analysis of our approach on a validation set consisting of 300 videos sampled from the YouTube-VOS 2019 training set. We analyze the impact of end-toend training, the label generator module and the weight predictor
  • Table2: State-of-the-art comparison on the large-scale YouTube-VOS 2018 validation dataset. Our approach outperforms all previous methods, both when comparing with additional training data and when training only on YouTube-VOS 2018 train split
  • Table3: State-of-the-art comparison on the DAVIS 2017 validation dataset. Our approach is almost on par with the best performing method STM, while significantly outperforming all previous methods with only the DAVIS 2017 training data
  • Table4: State-of-the-art comparison with bounding box initialization on YouTubeVOS 2018 and DAVIS 2017 validation. Our approach outperforms existing methods on YouTube-VOS, while achieving a J &F score on par with state-of-the art on DAVIS
  • Table5: Comparison of our approach with the recently introduced STM [<a class="ref-link" id="c24" href="#r24">24</a>] on the large-scale YouTube-VOS 2019 validation dataset. Results are reported in terms of mean Jaccard (J ) and boundary (F) scores for object classes that are seen and unseen in the training set, along with the overall mean (G). Our approach outperforms STM with a large margin of +1.8 points in terms of the overall G score
  • Table6: Impact of the weights used for initializing the backbone feature extractor. We compare a network using Mask-RCNN weights for initializing the backbone feature extractor with a network using ImageNet pre-trained weights. The results are reported over a validation set of 300 videos sampled from YouTube-VOS 2019 training set, in terms of mean Jaccard J score
  • Table7: Impact of the segmentation loss employed during training. We compare a network trained using the Lovasz [<a class="ref-link" id="c2" href="#r2">2</a>] loss function, with a network trained using the binary cross-entropy loss
  • Table8: Comparison of a network trained using the shorter training strategy (see Section D) with the network trained using the long training strategy. In contrast to the shorter training strategy, the backbone feature extractor is also trained when using the long training strategy, while employing a larger batch size. The long training however requires 8 times more GPU hours, as compared to the shorter training
  • Table9: Impact of different evaluation modes during inference. We compare a version in which the network operates on a local search region with a version in which the network operates on the full image. The search region in the first version is obtained by using the estimate of the target mask in the previous frame. Operating on a local search region allows the network to better handle small objects, leading to an improvement of +1.0 points in J score over the baseline operating on the full image
Download tables as Excel
Related work
  • In recent years, progress within video object segmentation has surged, leading to rapid performance improvements. Benchmarks such as DAVIS [26] and YouTubeVOS [38] have had a significant impact on this development. Target Models in VOS: Early works mainly adapted semantic segmentation networks to the VOS task through online fine-tuning [27,5,21,37]. However, this strategy easily leads to overfitting to the initial target appearance and impractically long run-times. More recent methods [34,13,32,23,36,24,17] therefore integrate target-specific appearance models into the segmentation architecture. In addition to improved run-times, many of these methods can also benefit from full end-to-end learning, which has been shown to have a crucial impact on performance [32,14,24]. Generally, these works train a target-agnostic segmentation network that is conditioned on a target model. The latter captures information about the target object, deduced from the initial image-mask pair. The generated target-aware representation is then provided to the target-agnostic segmentation network, which outputs the final prediction. Crucially, in order to achieve end-to-end training of the entire network, the target model needs to be differentiable.
Funding
  • Acknowledgments: This work was partly supported by the ETH Zurich Fund (OK), a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and Nvidia
Study subjects and analysis
samples: 32
We set η = 0.9 and ensure the weights sum to one. We ensure a maximum Kmax = 32 samples in the few-shot training dataset D, by removing the oldest. We always keep the first frame since it has the reference target mask y0

Reference
  • Behl, H.S., Najafi, M., Arnab, A., Torr, P.H.S.: Meta learning deep visual words for fast video object segmentation. In: NeurIPS 2019 Workshop on Machine Learning for Autonomous Driving (2018)
    Google ScholarLocate open access versionFindings
  • Berman, M., Rannen Triki, A., Blaschko, M.B.: The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4413–4421 (2018)
    Google ScholarLocate open access versionFindings
  • Bertinetto, L., Henriques, J.F., Torr, P., Vedaldi, A.: Meta-learning with differentiable closed-form solvers. In: International Conference on Learning Representations (2019)
    Google ScholarLocate open access versionFindings
  • Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6182–6191 (2019)
    Google ScholarLocate open access versionFindings
  • Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5320–5329. IEEE (2017)
    Google ScholarLocate open access versionFindings
  • Choi, J., Kwon, J., Lee, K.M.: Deep meta learning for real-time target-aware visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 911–920 (2019)
    Google ScholarLocate open access versionFindings
  • Cohen, I., Medioni, G.: Detecting and tracking moving objects for video surveillance. In: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). vol. 2, pp. 319–325. IEEE (1999)
    Google ScholarLocate open access versionFindings
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A LargeScale Hierarchical Image Database. In: CVPR09 (2009)
    Google ScholarLocate open access versionFindings
  • Erdelyi, A., Barat, T., Valet, P., Winkler, T., Rinner, B.: Adaptive cartooning for privacy protection in camera networks. In: 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). pp. 44–49. IEEE (2014)
    Google ScholarLocate open access versionFindings
  • Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1126–1135. JMLR. org (2017)
    Google ScholarLocate open access versionFindings
  • He, K., Gkioxari, G., Dollar, P., Girshick, R.B.: Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 2980–2988 (2017)
    Google ScholarLocate open access versionFindings
  • He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: ICCV (2015)
    Google ScholarFindings
  • Hu, Y.T., Huang, J.B., Schwing, A.G.: Videomatch: Matching based video object segmentation. In: European Conference on Computer Vision. pp. 56–73. Springer (2018)
    Google ScholarLocate open access versionFindings
  • Johnander, J., Danelljan, M., Brissman, E., Khan, F.S., Felsberg, M.: A generative appearance model for end-to-end video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (12 2014)
    Google ScholarLocate open access versionFindings
  • Lee, K., Maji, S., Ravichandran, A., Soatto, S.: Meta-learning with differentiable convex optimization. In: CVPR (2019)
    Google ScholarFindings
  • Lin, H., Qi, X., Jia, J.: Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3949–3957 (2019)
    Google ScholarLocate open access versionFindings
  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
    Google ScholarLocate open access versionFindings
  • Liu, Y., Liu, L., Zhang, H., Rezatofighi, H., Reid, I.: Meta learning with differentiable closed-form solver for fast video object segmentation. arXiv preprint arXiv:1909.13046 (2019)
    Findings
  • Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Asian Conference on Computer Vision. pp. 565–580. Springer (2018)
    Google ScholarLocate open access versionFindings
  • Maninis, K.K., Caelles, S., Chen, Y., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Van Gool, L.: Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2018)
    Google ScholarLocate open access versionFindings
  • Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark (2018), accessed:04/09/2019
    Findings
  • Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7376–7385. IEEE (2018)
    Google ScholarLocate open access versionFindings
  • Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. Proceedings of the IEEE International Conference on Computer Vision (2019)
    Google ScholarLocate open access versionFindings
  • Park, E., Berg, A.C.: Meta-tracker: Fast and robust online adaptation for visual object trackers. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 569–585 (2018)
    Google ScholarLocate open access versionFindings
  • Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., SorkineHornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)
    Google ScholarLocate open access versionFindings
  • Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2663–2672 (2017)
    Google ScholarLocate open access versionFindings
  • Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation (2020)
    Google ScholarFindings
  • Ros, G., Ramos, S., Granados, M., Bakhtiary, A., Vazquez, D., Lopez, A.M.: Visionbased offline-online perception paradigm for autonomous driving. In: 2015 IEEE Winter Conference on Applications of Computer Vision. pp. 231–238. IEEE (2015)
    Google ScholarLocate open access versionFindings
  • Saleh, K., Hossny, M., Nahavandi, S.: Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network. In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA). pp. 1–7. IEEE
    Google ScholarLocate open access versionFindings
  • Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)
    Google ScholarFindings
  • Voigtlaender, P., Leibe, B.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
    Google ScholarLocate open access versionFindings
  • Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam r-cnn: Visual tracking by re-detection. arXiv preprint arXiv:1911.12836 (2019)
    Findings
  • Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: European Conference on Computer Vision. pp. 402–419.
    Google ScholarLocate open access versionFindings
  • Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1328–1338 (2019)
    Google ScholarLocate open access versionFindings
  • Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3978–3987 (2019)
    Google ScholarLocate open access versionFindings
  • Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, S., Huang, T.: Youtube-vos: Sequence-to-sequence video object segmentation. In: European Conference on Computer Vision. pp. 603–619.
    Google ScholarLocate open access versionFindings
  • Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.: Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
    Findings
  • Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1857–1866 (2018)
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments