Graduate University of the Chinese Academy of Sciences

Lei Bao, Shoou-I Yu,Zhen-zhong Lan,Arnold Overwijk,Qin Jin, Brian Langner,Michael Garbus,Susanne Burger,Florian Metze,Alexander Hauptmann

semanticscholar（2015）

引用 0|浏览0

暂无评分

摘要

The Informedia group participated in three tasks this year, including: Multimedia Event Detection (MED), Semantic Indexing (SIN) and Surveillance Event Detection. Generally, all of these tasks consist of three main steps: extracting feature, training detector and fusing. In the feature extraction part, we extracted a lot of low-level features, high-level features and text features. Especially, we used the Spatial-Pyramid Matching technique to represent the low-level visual local features, such as SIFT and MoSIFT, which describe the location information of feature points. In the detector training part, besides the traditional SVM, we proposed a Sequential Boosting SVM classifier to deal with the large-scale unbalance classification problem. In the fusion part, to take the advantages from different features, we tried three different fusion methods: early fusion, late fusion and double fusion. Double fusion is a combination of early fusion and late fusion. The experimental results demonstrated that double fusion is consistently better, or at least comparable than early fusion and late fusion. 1 Multimedia Event Detection (MED) 1.1 Feature Extraction In order to encompass all aspects of a video, we extracted a wide variety of visual and audio features as shown in figure 1. Table 1: Features used for the MED task. Visual Features Audio Features Low-level Features • SIFT [19] • Color SIFT [19] • Transformed Color Histogram [19] • Motion SIFT [3] • STIP [9] Mel-Frequency Cepstral Coefficients High-level Features • PittPatt Face Detection [12] • Semantic Indexing Concepts [15] Acoustic Scene Analysis Text Features Optical Character Recognition Automatic Speech Recognition 1.1.1 SIFT, Color SIFT (CSIFT), Transformed Color Histogram (TCH) These three features describe the gradient and color information of a static image. We used the Harris-Laplace detector for corner detection. For more details, please see [19]. Instead of extracting features from all frames for all videos, we first run shot-break detection and only extract features from the keyframe of a corresponding shot. The shot-break detection algorithm detects large color histogram differences between adjacent frames and a shot-boundary is detected when the histogram difference is larger than a threshold. For the 16507 training videos, we extracted 572,881 keyframes. For the 32061 testing videos, we extracted 1,035,412 keyframes. Once we have the keyframes, we extract the three features by the executable provided by [19]. Given the raw feature files, a 4096 word codebook is acquired using the K-Means clustering algorithm. According to the codebook and given a region in an image, we can create a 4096 dimensional vector representing that region. Using the Spatial-Pyramid Matching [10] technique, we extract 8 regions from an keyframe image and calculate a bag-of-words vector for each region. At the end, we get a 8× 4096 = 32768 dimensional bag-of-words vector. The 8 regions are calculated as follows. • The whole image as one region. • Split the image into 4 quadrants and each quadrant is a region. • Split the image horizontally into 3 equally sized rectangles and each rectangle is a region. Since we only have feature vectors describing a keyframe, and a video is described by many keyframes, we compute a vector representing a whole video by averaging over the feature vectors from each keyframe. The features are then provided to a classifier for classification. 1.1.2 Motion SIFT (MoSIFT) Motion SIFT [3] is a motion-based feature that combines information from SIFT and optical flow. The algorithm first extract SIFT points, and for each SIFT point, it checks whether there is a large enough optical flow near the point. If the optical flow value is larger than a threshold, a 256 dimensional feature is computed for that point. The first 128 dimensions of the feature vector is the SIFT descriptor, and the latter 128 dimensions describes the optical flow near the point. We extracted Motion SIFT by calculating the optical flow between neighboring frames, but due to speed issues, we only extract Motion SIFT for the every third frame. Once we have the raw features, a 4096 dimensional codebook is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.3 Space-Time Interest Points (STIP) Space-Time Interest Points are computed using code from [9]. Given the raw features, a 4096 dimensional code is computed, and using the same process as SIFT, a 32768 dimensional vector is created for classification. 1.1.4 Semantic Indexing (SIN) We predicted the 346 semantic concepts from Semantic Indexing 11 onto the MED keyframes. For details on how we created the models for the 346 concepts, please refer to section 2. Once we have the prediction scores of each concept on each keyframe, we compute a 346 dimensional feature that represents a video. The value of each dimension is the mean value of the concept prediction scores on all keyframes in a given video. We tried out different kinds of score merging techniques, including mean and max, and mean had the best performance. These features are then provided to a classifier for classification.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要