RVOS: End-to-End Recurrent Network for Video Object Segmentation

CVPR, 2019.

Cited by: 25|Bibtex|Views54|Links
EI
Keywords:
rnn f0 rnnonline learningrecurrent networkonline video object segmentationart techniqueMore(6+)
Weibo:
We give the first results for zero-shot Video object segmentation on both benchmarks and we outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them

Abstract:

Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to...More

Code:

Data:

0
Introduction
  • Video object segmentation (VOS) aims at separating the foreground from the background given a video sequence.
  • This task has raised a lot of interest in the computer vision community since the appearance of benchmarks [21] that have given access to annotated datasets and standardized metrics.
  • New benchmarks [22, 33] that address multi-object segmentation and provide larger datasets have become available, leading to more challenging tasks.
Highlights
  • Video object segmentation (VOS) aims at separating the foreground from the background given a video sequence
  • We present quantitative results for zero-shot learning for two benchmarks: DAVIS-2017 [22] and YouTube-VOS [33]
  • Our model can be adapted to one-shot and zeroshot scenarios, and we present the first quantitative results for zero-shot video object segmentation for the DAVIS-2017 and Youtube-VOS benchmarks [22, 33]
  • We propose a model based on an encoder-decoder architecture to solve two different tasks for the video object segmentation problem: one-shot and zero-shot VOS
  • In this work we have presented a fully end-to-end trainable model for multiple objects in video object segmentation (VOS) with a recurrence module based on spatial and temporal domains
  • We give the first results for zero-shot VOS on both benchmarks and we outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them
Methods
  • The experiments are carried out for two different tasks of the VOS: the one-shot and the zero-shot
  • In both cases, the authors analyze how important the spatial and the temporal hidden.
  • Evaluation is performed on the YouTube-VOS validation set and on the DAVIS-2017 test-dev set.
  • Both YouTube-VOS and DAVIS2017 videos include multiple objects and have a similar duration in time (3-6 seconds)
Results
  • The authors' model achieves a remarkable performance without needing finetuning for each test sequence, becoming the fastest method.
  • The authors' model outperforms the rest of state-of-the-art techniques [3, 20, 30, 34] for the seen categories.
  • The authors give the first results for zero-shot VOS on both benchmarks and the authors outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them
Conclusion
  • In this work the authors have presented a fully end-to-end trainable model for multiple objects in video object segmentation (VOS) with a recurrence module based on spatial and temporal domains.
  • The model has been designed for both one-shot and zero-shot VOS and tested on YouTube-VOS and DAVIS-2017 benchmarks.
  • The authors give the first results for zero-shot VOS on both benchmarks and the authors outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them.
Summary
  • Introduction:

    Video object segmentation (VOS) aims at separating the foreground from the background given a video sequence.
  • This task has raised a lot of interest in the computer vision community since the appearance of benchmarks [21] that have given access to annotated datasets and standardized metrics.
  • New benchmarks [22, 33] that address multi-object segmentation and provide larger datasets have become available, leading to more challenging tasks.
  • Methods:

    The experiments are carried out for two different tasks of the VOS: the one-shot and the zero-shot
  • In both cases, the authors analyze how important the spatial and the temporal hidden.
  • Evaluation is performed on the YouTube-VOS validation set and on the DAVIS-2017 test-dev set.
  • Both YouTube-VOS and DAVIS2017 videos include multiple objects and have a similar duration in time (3-6 seconds)
  • Results:

    The authors' model achieves a remarkable performance without needing finetuning for each test sequence, becoming the fastest method.
  • The authors' model outperforms the rest of state-of-the-art techniques [3, 20, 30, 34] for the seen categories.
  • The authors give the first results for zero-shot VOS on both benchmarks and the authors outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them
  • Conclusion:

    In this work the authors have presented a fully end-to-end trainable model for multiple objects in video object segmentation (VOS) with a recurrence module based on spatial and temporal domains.
  • The model has been designed for both one-shot and zero-shot VOS and tested on YouTube-VOS and DAVIS-2017 benchmarks.
  • The authors give the first results for zero-shot VOS on both benchmarks and the authors outperform state-of-the-art techniques that do not make use of online learning for one-shot VOS on them.
Tables
  • Table1: Ablation study about spatial and temporal recurrence in the decoder for one-shot VOS in YouTube-VOS dataset. Models have been trained using 80%-20% partition of the training set and evaluated on the validation set. + means that the model has been trained using the inferred masks
  • Table2: Comparison against state of the art VOS techniques for one-shot VOS on YouTube-VOS validation set. OL refers to online learning. The table is split in two parts, depending on whether the techniques use online learning or not
  • Table3: Analysis of our proposed model RVOS-Mask-ST+ depending on the number of instances in one-shot VOS
  • Table4: Comparison against state of the art VOS techniques for one-shot VOS on DAVIS-2017 test-dev set. OL refers to online learning. The model RVOS-Mask-ST+(pre) is the one trained on Youtube-VOS, and the model RVOS-Mask-ST+ (ft) is after finetuning the model for DAVIS-2017. The table is split in two parts, depending on whether the techniques use online learning or not
  • Table5: Ablation study about spatial and temporal recurrence in the decoder for zero-shot VOS in YouTube-VOS dataset. Our models have been trained using 80%-20% partition of the training set and evaluated on the validation set
Download tables as Excel
Related work
  • Deep learning techniques for the object segmentation task have gained attention in the research community during the recent years [3, 5, 7,8,9,10, 13, 14, 20, 26,27,28,29,30,31, 34]. In great

    1https://davischallenge.org/challenge2019/unsupervised.html measure, this is due to the emergence of new challenges and segmentation datasets, from Berkeley Video Segmentation Dataset (2011) [1], SegTrack (2013) [15], FreiburgBerkeley Motion Segmentation Dataset (2014) [19], to more accurate and dense labeled ones as DAVIS (20162017) [21, 22], to the latest segmentation dataset YouTubeVOS (2018) [32], which provides the largest amount of annotated videos up to date.

    Video object segmentation Considering the temporal dimension of video sequences, we differentiate between algorithms that aim to model the temporal dimension of an object segmentation through a video sequence, and those without temporal modeling that predict object segmentations at each frame independently.

    For segmentation without temporal modeling, one-shot VOS has been handled with online learning, where the first annotated frame of the video sequence is used to fine-tune a pretrained network and segment the objects in other frames [3]. Some approaches have worked on top of this idea, by either updating the network online with additional high confident predictions [30], or by using the instance segments of the different objects in the scene as prior knowledge and blend them with the segmentation output [17]. Others have explored data augmentation strategies for video by applying transformations to images and object segments [12], tracking of object parts to obtain regionof-interest segmentation masks [4], or meta-learning approaches to quickly adapt the network to the object mask given in the first frame [34].
Funding
  • This research was supported by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund (TIN2015-66951-C2-2-R, TIN201565316-P & TEC2016-75976-R), the BSC-CNS Severo Ochoa SEV-2015-0493 and LaCaixa-Severo Ochoa International Doctoral Fellowship programs, the 2017 SGR 1414 and the Industrial Doctorates 2017-DI-064 & 2017-DI-028 from the Government of Catalonia. 2https://imatge-upc.github.io/rvos/
Reference
  • Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2012
    Google ScholarLocate open access versionFindings
  • Linchao Bao, Baoyuan Wu, and Wei Liu. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5977–5986, 2018. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixe, Daniel Cremers, and Luc Van Gool. Oneshot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 221–230, 2017. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Jingchun Cheng, Yi-Hsuan Tsai, Wei-Chih Hung, Shengjin Wang, and Ming-Hsuan Yang. Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7415–7424, 2018. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and MingHsuan Yang. Segflow: Joint learning for video object segmentation and optical flow. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 686–695, 2017. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2013
    Google ScholarLocate open access versionFindings
  • Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing. Maskrnn: Instance level video object segmentation. In Advances in Neural Information Processing Systems (NIPS), pages 325–334, 2012, 3
    Google ScholarLocate open access versionFindings
  • Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 786–802, 2012, 3
    Google ScholarLocate open access versionFindings
  • Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2126, 2017. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Varun Jampani, Raghudeep Gadde, and Peter V Gehler. Video propagation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 451–461, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Won-Dong Jang and Chang-Su Kim. Online video object segmentation via convolutional trident network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5849–5858, 2017. 2
    Google ScholarLocate open access versionFindings
  • Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. Lucid data dreaming for multiple object tracking. arXiv preprint arXiv:1703.09554, 2017. 2
    Findings
  • Yeong Jun Koh and Chang-Su Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7417–7425, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Dong Lao and Ganesh Sundaramoorthi. Extending Layered Models to 3D Motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figureground segments. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2192–2199, 2013. 2
    Google ScholarLocate open access versionFindings
  • Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6526–6535, 2018. 3
    Google ScholarLocate open access versionFindings
  • K Maninis, S Caelles, Y Chen, J Pont-Tuset, L Leal-Taixe, D Cremers, and L Van Gool. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 1, 2, 3, 6
    Google ScholarLocate open access versionFindings
  • David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6819–6828, 202
    Google ScholarLocate open access versionFindings
  • Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1187–1200, 2014. 2
    Google ScholarLocate open access versionFindings
  • Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2663–2672, 2017. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 724–732, 2016. 1, 2
    Google ScholarLocate open access versionFindings
  • Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbelaez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017. 1, 2, 3, 4
    Findings
  • Mengye Ren and Richard S Zemel. End-to-end instance segmentation with recurrent attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6656–6664, 2017. 3
    Google ScholarLocate open access versionFindings
  • Bernardino Romera-Paredes and Philip Hilaire Sean Torr. Recurrent instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 312– 329, 2016. 3
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 3
    Google ScholarLocate open access versionFindings
  • Amaia Salvador, Miriam Bellver, Manel Baradad, Ferran Marques, Jordi Torres, and Xavier Giroi Nieto. Recurrent neural networks for semantic instance segmentation. arXiv preprint arXiv:1712.00617, 2017. 2, 3, 4
    Findings
  • Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 715– 731, 2018. 2, 3
    Google ScholarLocate open access versionFindings
  • Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), pages 531–539, 2017. 2, 3
    Google ScholarLocate open access versionFindings
  • Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 4481–4490, 2017. 1, 2, 3
    Google ScholarLocate open access versionFindings
  • Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmentation. In Proceedings of the British Machine Vision Conference (BMVC), 2017. 1, 2, 3, 5, 6
    Google ScholarLocate open access versionFindings
  • SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NIPS), pages 802–810, 2015. 2, 3
    Google ScholarLocate open access versionFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. YouTube-VOS: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 585–601, 2018. 1, 2, 3, 8
    Google ScholarLocate open access versionFindings
  • Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. YouTube-VOS: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. 1, 2, 3, 4, 5, 6, 8
    Findings
  • Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 1, 2, 3, 5, 6, 8
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments