Team Efficient Multi-Moments in Time Challenge 2019 Technical Report

Manyuan Zhang,Hao Shao,Guanglu Song,Yu Liu,Junjie Yan

semanticscholar（2019）

引用 0|浏览12

暂无评分

摘要

In this technical report, we briefly introduce the solutions of our team ’Efficient’ for the Multi-Moments in Time challenge in ICCV 2019. We first conduct several experiments with popular Image-Based action recognition methods TRN, TSN and TSM. Then a novel temporal interlacing network is proposed towards fast and accurate recognition. Besides, the SlowFast network and it’s variants are explored. Finally, we ensemble all the above models and achieve 67.22% on the validation set and 60.77% on the test set, which ranks 1st on the final leaderboard. 1. Image-Based Models In this work, we have experimented with different 2D models including TSN [13], TSM [9], TRN [15]and TIN. These methods all use 2D convolution kernels instead of 3D convolution kernels to capture the temporal information. The number of their parameters and FLOPs are small compared to 3D-Based Models, but most of them don’t have better performance than those 3D Networks. 1.1. Temporal Segment Network Temporal segment network (TSN [13]) is a framework for video-based action recognition. TSN takes a strategy of sampling a fixed number of sparse segments from one video to model long-term temporal structure. The final prediction of video is averaged by the logits of each chip. In this competition, we experimented with the temporal segment network with evenly sampling 5 segments form one video. 1.2. Temporal Relational Network Temporal relational network (TRN [15]) is a recognition framework that can model and reason about temporal dependencies between different segments of one video. The *They contributed equally to this work. model is also designed to reason at multiple time scales. However, it doesn’t work well in our attempt. 1.3. Temporal Shift Module Temporal Shift Module (TSM [9]) proposes an operator that shifts part of the channels along the temporal dimension. The operator can help the network fuse the temporal information among neighboring frames. We experimented the model with different backbones and input sequence lengths T. 1.4. Temporal Interlacing Network In this work, we proposed a Temporal Interlacing Network (TIN) which uses a network to model the relation between the shift distance and the specific input data. While TSM can only shift the channels along the temporal dimension by +1 or -1. The differentiable module we designed can infer the suitable displacement length according to different groups and suitable weight for the feature-map along the temporal dimension. Our proposed module has almost the same FLOPs and parameters as the origin TSM model. Moreover, in our experiments between TSM and TIN, TIN obtained about 1% 2% better performance with the same train and test configure. 2. SlowFast-based Models In this part, we conduct experiments on SlowFast [3] network. SlowFast network has two paths, a slow path to capture appearance content while a fast path to capture motion information. For details about the architecture, please refer to its raw publication [3]. For this challenge, we train several SlowFast models and its variants. Note that only RGB input is used, for the reason that flow extraction costs too much computation and storage. The models we select are (a) SlowFast, Slow path 8 × 8 and Fast path 32 × 2, with an input clip of consecutive 64 frames (b) only Fast path 32 × 2, with no channel reduction. This model is pretty heavy. It has above 4x computation consumption than the standard SlowFast network. (c) only Slow path 8× 8, which only keeps the slow path to capture appearance content. (d) SlowFast, Slow path 11×8 and Fast path 44 × 2. Due to most videos has around 90 frames, this model designs to capture the whole video information.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要