Polar Relative Positional Encoding for Video-Language Segmentation

IJCAI, pp. 948-954, 2020.

Cited by: 0|Bibtex|Views43|Links
EI
Keywords:
action video segmentationnearby objectvisual question answeringnatural languagepink dotted dressMore(11+)
Weibo:
We proposed a novel Polar Relative Positional Encoding mechanism along with a Polar Attention Module for videolanguage segmentation

Abstract:

In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with s...More

Code:

Data:

0
Introduction
  • Given a video and a natural language description, the model is asked to generate pixel-level segmentation maps that segment the target object or the actor on interested frames according to the description.
  • It is a very challenging task.
  • The ability to exploit spatial relations is important to recognize the correct target
Highlights
  • In this paper, we tackle the video-language segmentation task
  • Given a video and a natural language description, the model is asked to generate pixel-level segmentation maps that segment the target object or the actor on interested frames according to the description
  • We proposed a novel Polar Relative Positional Encoding mechanism along with a Polar Attention Module for videolanguage segmentation
  • We proved the importance of spatial relations described in the sentence and the effectiveness of our proposed method
  • Existing methods only considered a short snippet around the interested frames
Methods
  • There are 3,036 training videos and 746 testing videos.
  • There are 5,359 training video-sentence pairs and 1,295 testing video-sentence pairs.
  • The metrics to evaluate models for this dataset are precisions on different IoUs: 0.5, 0.6, 0.7, 0.8 and 0.9.
  • MAP, overall IoU and mean IoU are used for evaluation.
  • Overall IoU measures the total intersection area of all test data over the total union area.
  • Mean IoU measures average over the IoU of each test sample
Results
  • On A2D Sentences, the method outperforms the state-of-the-art method by a large margin of 11.4% absolute improvement in terms of mAP.
Conclusion
  • The authors proposed a novel Polar Relative Positional Encoding mechanism along with a Polar Attention Module for videolanguage segmentation.
  • The authors proved the importance of spatial relations described in the sentence and the effectiveness of the proposed method.
  • There are still many challenges remains in this task.
  • Existing methods only considered a short snippet around the interested frames.
  • Long-term action relations still remain unexplored.
  • The authors leave this part as future work
Summary
  • Introduction:

    Given a video and a natural language description, the model is asked to generate pixel-level segmentation maps that segment the target object or the actor on interested frames according to the description.
  • It is a very challenging task.
  • The ability to exploit spatial relations is important to recognize the correct target
  • Methods:

    There are 3,036 training videos and 746 testing videos.
  • There are 5,359 training video-sentence pairs and 1,295 testing video-sentence pairs.
  • The metrics to evaluate models for this dataset are precisions on different IoUs: 0.5, 0.6, 0.7, 0.8 and 0.9.
  • MAP, overall IoU and mean IoU are used for evaluation.
  • Overall IoU measures the total intersection area of all test data over the total union area.
  • Mean IoU measures average over the IoU of each test sample
  • Results:

    On A2D Sentences, the method outperforms the state-of-the-art method by a large margin of 11.4% absolute improvement in terms of mAP.
  • Conclusion:

    The authors proposed a novel Polar Relative Positional Encoding mechanism along with a Polar Attention Module for videolanguage segmentation.
  • The authors proved the importance of spatial relations described in the sentence and the effectiveness of the proposed method.
  • There are still many challenges remains in this task.
  • Existing methods only considered a short snippet around the interested frames.
  • Long-term action relations still remain unexplored.
  • The authors leave this part as future work
Tables
  • Table1: Ablation studies on the A2D Sentences datasets
  • Table2: Comparison with state-of-the-arts on the A2D Sentences dataset
  • Table3: Comparison with state-of-the-arts on the J-HMDB Sentences dataset. * indicates the method used RGB+Flow visual input
Download tables as Excel
Related work
  • 2.1 Action Recognition and Localization in Videos

    Action recognition is a fundamental research area in computer vision. Two-stream networks [Simonyan and Zisserman, 2014] and 3D ConvNets [Carreira and Zisserman, 2017; Tran et al, 2015] are the most popular models for video feature learning. At finer granularities, temporal localization [Jiang et al, 2014], spatio-temporal localization [Gu et al, 2018] and segmentation [Perazzi et al, 2016] are also important tasks for video analysis. There are also some recent work [Anne Hendricks et al, 2017; Gao et al, 2017] try to localize video clips temporally according to the given natural language description.

    In this paper, we tackle the video-language segmentation task. The model needs to not only recognize the target object and its action but also extract visual information and relations described in the sentence.
Funding
  • This paper is supported by NSFC (61625107, 61751209)
Reference
  • [Anderson et al., 2018] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Visionand-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Anne Hendricks et al., 2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Antol et al., 2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Bello et al., 2019] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Carreira and Zisserman, 2017] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [De Vries et al., 2017] Harm De Vries, Florian Strub, Jeremie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. Modulating early visual processing by language. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • [Gao et al., 2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017.
    Google ScholarLocate open access versionFindings
  • [Gavrilyuk et al., 2018] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Gu et al., 2018] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatiotemporally localized atomic visual actions. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Hu et al., 2016a] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In ECCV, 2016.
    Google ScholarLocate open access versionFindings
  • [Hu et al., 2016b] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Huang et al., 2019] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019.
    Google ScholarLocate open access versionFindings
  • [Jhuang et al., 2013] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, 2013.
    Google ScholarFindings
  • [Jiang et al., 2014] Yu-Gang Jiang, Jingen Liu, A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. Thumos challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/ THUMOS14/, 2014.
    Findings
  • [Johnson et al., 2017] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [Li et al., 2017] Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. Tracking by natural language specification. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • [McIntosh et al., 2018] Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. Multi-modal capsule routing for actor and action video segmentation conditioned on natural language queries. arXiv preprint arXiv:1812.00303, 2018.
    Findings
  • [Parmar et al., 2019] Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. In NeurIPS, 2019.
    Google ScholarLocate open access versionFindings
  • [Perazzi et al., 2016] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
    Google ScholarLocate open access versionFindings
  • [Shaw et al., 2018] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
    Google ScholarLocate open access versionFindings
  • [Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
    Google ScholarLocate open access versionFindings
  • [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • [Wang et al., 2019] Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In ICCV, 2019.
    Google ScholarLocate open access versionFindings
  • [Xu et al., 2015] Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. Can humans fly? action understanding with multiple classes of actors. In CVPR, 2015.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments