Convolutional Transformer with Similarity-based Boundary Prediction for Action Segmentation.


引用 0|浏览3
Action classification has made great progress, but segmenting and recognizing actions from long videos remains a challenging problem. Recently, Transformer-based models with strong sequence modeling ability have succeeded in many sequence modeling tasks. However, the lack of inductive bias and the difficulty of handling long video sequences limit the application of the Transformer in the action segmentation task. In order to explore the potential of the Transformer in this task, we replace some specific linear layers in the vanilla Transformer with dilated temporal convolution, and a sparse attention mechanism is utilized to reduce the time and space complexities to process long video sequences. Besides, directly using frame-wise classification loss to train the model will cause that frames at boundaries of actions are treated equally with those in the middle of actions, and the learned features are not sensitive to boundaries. We propose a new local log-context attention module to predict whether each frame is at the beginning, middle, or end of an action. Since boundary frames are similar to their neighboring frames of different classes, our similarity-based boundary prediction helps learn more discriminative features. Extensive experiments on three datasets show the effectiveness of our method.
Computer Vision,Video Action Segmentation,Transfromer,Temporal Convolutional Neural Network
AI 理解论文
Chat Paper