Weakly supervised learning of object segmentations from web-scale video

ECCV Workshops (1), pp. 198-208, 2012.

Cited by: 67|Bibtex|Views76|Links
EI
Keywords:
pre-trained object detectorpixel-level object maskobject segmentationpixel-level segmentationpixel-level annotationMore(8+)
Weibo:
Our current framework implicitly uses segment-level loss whereas the evaluation is at the pixel level; directly optimizing the latter is worth exploring

Abstract:

We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detecto...More

Code:

Data:

0
Introduction
  • The authors believe that internet videos, with their potentially noisy tags, can provide sufficient weak supervision to learn models of visual concepts.
  • The authors' goal is to learn models that can perform pixel-level spatiotemporal segmentation of objects (e.g., “dog”) when trained only using video-level tags.
  • The labels are videolevel, the evaluation is on a spatiotemporal segmentation task with pixel-level error metrics, such as the precision/recall of pixel masks for a concept, measured on a set of manually annotated ground truth videos.
  • The proposed methods should be capable of scaling, both in the number of training videos and the number of object classes that the authors recognize
Highlights
  • We are motivated by the question: What could a computer learn about the real world solely from watching large quantities of internet video? We believe that internet videos, with their potentially noisy tags, can provide sufficient weak supervision to learn models of visual concepts
  • The energy function is efficiently minimized using [22] for each frame in the test video. We present both qualitative and quantitative evaluations of our method on a large corpus of partially groundtruthed internet video
  • Our dataset consists of full-length internet videos that are several minutes in length and contain multiple shots
  • A set of test videos from different classes has been manually annotated to generate a ground truth set of approximately 50,000 frames to generate precision/recall curves
  • Our current framework implicitly uses segment-level loss whereas the evaluation is at the pixel level; directly optimizing the latter is worth exploring
Methods
  • The authors process each of the videos in the training set as follows to ensure uniformity. First, the authors scale each video to a consistent width of 240 pixels, maintaining its original aspect ratio.
  • The authors perform video stabilization [11, 13] to reduce camera motion that could corrupt motion features and shapes of spatiotemporal segments.
  • The authors distribute the job of video stabilization, spatiotemporal segmentation and feature extraction for each video to different machines using the MapReduce framework.
  • The authors are able to process our 20,000 videos using a cluster of 5000 nodes in less than 30 hours
Results
  • The authors present both qualitative and quantitative evaluations of the method on a large corpus of partially groundtruthed internet video.
  • Additional results examining the role of different features, type of video over-segmentation and comparisons with other weakly supervised classifiers are omitted here due to space limitations.
  • A set of test videos from different classes has been manually annotated to generate a ground truth set of approximately 50,000 frames to generate precision/recall curves
Conclusion
  • This paper proposes the idea of learning spatiotemporal object models, with minimal supervision, from large quantities of weakly and noisily tagged video.
  • Since the authors are the first to tackle this problem, at large scale, the authors conduct an evaluation of several computationally scalable approaches to weakly supervised learning.
  • The authors' current framework implicitly uses segment-level loss whereas the evaluation is at the pixel level; directly optimizing the latter is worth exploring.
  • The authors plan to use the object segmentation masks as strongly supervised training data for training traditional object detectors in both image and video domains
Summary
  • Introduction:

    The authors believe that internet videos, with their potentially noisy tags, can provide sufficient weak supervision to learn models of visual concepts.
  • The authors' goal is to learn models that can perform pixel-level spatiotemporal segmentation of objects (e.g., “dog”) when trained only using video-level tags.
  • The labels are videolevel, the evaluation is on a spatiotemporal segmentation task with pixel-level error metrics, such as the precision/recall of pixel masks for a concept, measured on a set of manually annotated ground truth videos.
  • The proposed methods should be capable of scaling, both in the number of training videos and the number of object classes that the authors recognize
  • Objectives:

    Given a large collection of raw YouTube content, along with potentially noisy tags, the goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors.
  • The authors' goal is to learn models that can perform pixel-level spatiotemporal segmentation of objects (e.g., “dog”) when trained only using video-level tags.
  • The authors aim to learn concept models from raw, full-length internet videos containing multiple scenes and several topics.
  • Given such a list of segment “seeds” for a video, the goal is to refine these into object masks using both appearance and spatial consistency
  • Methods:

    The authors process each of the videos in the training set as follows to ensure uniformity. First, the authors scale each video to a consistent width of 240 pixels, maintaining its original aspect ratio.
  • The authors perform video stabilization [11, 13] to reduce camera motion that could corrupt motion features and shapes of spatiotemporal segments.
  • The authors distribute the job of video stabilization, spatiotemporal segmentation and feature extraction for each video to different machines using the MapReduce framework.
  • The authors are able to process our 20,000 videos using a cluster of 5000 nodes in less than 30 hours
  • Results:

    The authors present both qualitative and quantitative evaluations of the method on a large corpus of partially groundtruthed internet video.
  • Additional results examining the role of different features, type of video over-segmentation and comparisons with other weakly supervised classifiers are omitted here due to space limitations.
  • A set of test videos from different classes has been manually annotated to generate a ground truth set of approximately 50,000 frames to generate precision/recall curves
  • Conclusion:

    This paper proposes the idea of learning spatiotemporal object models, with minimal supervision, from large quantities of weakly and noisily tagged video.
  • Since the authors are the first to tackle this problem, at large scale, the authors conduct an evaluation of several computationally scalable approaches to weakly supervised learning.
  • The authors' current framework implicitly uses segment-level loss whereas the evaluation is at the pixel level; directly optimizing the latter is worth exploring.
  • The authors plan to use the object segmentation masks as strongly supervised training data for training traditional object detectors in both image and video domains
Tables
  • Table1: Summary of weakly supervised internet video dataset
Download tables as Excel
Related work
  • The area of learning visual concepts from weakly supervised video is still in its infancy. Ramanan et al [1] construct a single part-based animal model from video. Ommer et al [2] learn from controlled, hand-recorded video and classify at the frame level. Ali et al [3] build an appearance model from a single video. Leistner et al [4] employ weak video primarily to regularize detectors trained using images. Our work is closest in spirit to recent work by Prest et al [5], which trains on a combination of fully annotated images and manually curated labeled video; the task we address is more extreme as we learn exclusively under weak supervision from raw video with noisy labels.
Reference
  • Ramanan, D., Forsyth, D., Barnard, K.: Building models of animals from video. PAMI 28 (2006)
    Google ScholarLocate open access versionFindings
  • Ommer, B., Mader, T., Buhmann, J.: Seeing the objects behind the dots: Recognition in videos from a moving camera. IJCV 83 (2009)
    Google ScholarLocate open access versionFindings
  • Ali, K., Hasler, D., Fleuret, F.: FlowBoost—Appearance learning from sparsely annotated video. In: CVPR. (2011)
    Google ScholarFindings
  • Leistner, C., Godec, M., Schulter, S., Saffari, A., Werlberger, M., Bischof, H.: Improving classifiers with unlabeled weakly-related videos. In: CVPR. (2011)
    Google ScholarFindings
  • Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectors from weakly annotated video. In: CVPR. (2012)
    Google ScholarFindings
  • Kalal, Z., Matas, J., Mikolajczyk, K.: P-N Learning: Bootstrapping binary classifiers by structural constraints. In: CVPR. (2010)
    Google ScholarFindings
  • Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV. (2007)
    Google ScholarFindings
  • Niebles, J.C., Han, B., Ferencz, A., Fei-Fei, L.: Extracting moving people from internet videos. In: ECCV. (2008)
    Google ScholarFindings
  • Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ICCV. (2011)
    Google ScholarFindings
  • Xiao, J., Shah, M.: Motion layer extraction in the presence of occlusion using graph cuts. PAMI 27 (2005) 1644–1659
    Google ScholarLocate open access versionFindings
  • Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: ECCV. (2010)
    Google ScholarFindings
  • Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR. (2011)
    Google ScholarFindings
  • Grundmann, M., Kwatra, V., Essa, I.: Auto-directed video stabilization with robust L1 optimal camera paths. In: CVPR. (2011)
    Google ScholarFindings
  • Zha, Z.J., Hua, X.S., Mei, T., Wang, J., Qi, G.J., Wang, Z.: Joint multi-label multi-instance learning for image classification. In: CVPR. (2008)
    Google ScholarFindings
  • Viola, P., Platt, J., Zhang, C.: Multiple instance boosting for object detection. In: NIPS. (2005)
    Google ScholarFindings
  • Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. PAMI 28 (2006) 1931–1947
    Google ScholarLocate open access versionFindings
  • Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR. (2010)
    Google ScholarFindings
  • Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV. (2009)
    Google ScholarFindings
  • Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. PAMI 32 (2010) 2178–2190
    Google ScholarFindings
  • Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. JMLR 9 (2008) 1871–1874
    Google ScholarLocate open access versionFindings
  • Duchi, J., Singer, Y.: Boosting with structural sparsity. In: ICML. (2009)
    Google ScholarFindings
  • Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23 (2001) 1222–1239 23. Ojala, T., et al.: Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. In: ICPR. (1994)
    Google ScholarLocate open access versionFindings
  • 24. Wang, X., Han, T.: An HOG-LBP human detector with partial occlusion handling. In: ICCV. (2009)
    Google ScholarFindings
  • 25. Chaudhry, R., et al.: Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems. In: CVPR. (2009)
    Google ScholarFindings
Your rating :
0

 

Tags
Comments