BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR, pp. 2633-2642, 2020.

Cited by: 6|Bibtex|Views158|Links
EI
Keywords:
identity switchautonomous drivingdiverse datasetmultiple visual domainmultiple object trackingMore(14+)
Weibo:
Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks

Abstract:

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving. Researchers are usually constrained to study a small set of problems on one dataset, while real-world computer vision applications require performing tasks of v...More

Code:

Data:

0
Introduction
  • Large-scale annotated visual datasets, such as ImageNet [8] and COCO [19], have been the driving force behind recent advances in supervised learning tasks in computer vision.
  • Typical deep learning models can require millions of training examples to achieve state-of-the-art performance for a task [17, 28, 16].
  • For autonomous driving applications leveraging the power of deep learning is not as simple due to the lack of comprehensive datasets.
  • Existing datasets for autonomous driving [15, 7, 24] are limited in one or more significant aspects, including the scene variation, the richness of annotations, and the geographic distribution.
  • Models trained on existing datasets tend to overfit specific domain characteristics [26].
Highlights
  • Diverse, large-scale annotated visual datasets, such as ImageNet [8] and COCO [19], have been the driving force behind recent advances in supervised learning tasks in computer vision
  • We further provide a multiple object tracking and segmentation (MOTS) dataset with 90 videos
  • We aim to improve the performance on the task of multiple object tracking and segmentation by leveraging the diversity from the detection set with 70K images from 70K videos, the multiple object tracking set with 278K frames from 1,400 videos, and the instance segmentation set with 7K images from 7K videos
  • We report instance segmentation AP and multi-object tracking and segmentation accuracy (MOTSA), precision (MOTSP), and other metrics used by [31] in Table 9
  • We presented BDD100K, a large-scale driving video dataset with extensive annotations for heterogeneous tasks
Results
  • The authors aim to improve the performance on the task of MOTS by leveraging the diversity from the detection set with 70K images from 70K videos, the MOT set with 278K frames from 1,400 videos, and the instance segmentation set with 7K images from 7K videos.
Conclusion
  • The authors presented BDD100K, a large-scale driving video dataset with extensive annotations for heterogeneous tasks.
  • The authors built a benchmark for heterogeneous multitask learning where the tasks have various prediction structures and serve different aspects of a complete driving system.
  • The results presented interesting findings about allocating the annotation budgets in multitask learning.
  • The authors hope the work can foster future studies on heterogeneous multitask learning and shed light on this important direction
Summary
  • Introduction:

    Large-scale annotated visual datasets, such as ImageNet [8] and COCO [19], have been the driving force behind recent advances in supervised learning tasks in computer vision.
  • Typical deep learning models can require millions of training examples to achieve state-of-the-art performance for a task [17, 28, 16].
  • For autonomous driving applications leveraging the power of deep learning is not as simple due to the lack of comprehensive datasets.
  • Existing datasets for autonomous driving [15, 7, 24] are limited in one or more significant aspects, including the scene variation, the richness of annotations, and the geographic distribution.
  • Models trained on existing datasets tend to overfit specific domain characteristics [26].
  • Results:

    The authors aim to improve the performance on the task of MOTS by leveraging the diversity from the detection set with 70K images from 70K videos, the MOT set with 278K frames from 1,400 videos, and the instance segmentation set with 7K images from 7K videos.
  • Conclusion:

    The authors presented BDD100K, a large-scale driving video dataset with extensive annotations for heterogeneous tasks.
  • The authors built a benchmark for heterogeneous multitask learning where the tasks have various prediction structures and serve different aspects of a complete driving system.
  • The results presented interesting findings about allocating the annotation budgets in multitask learning.
  • The authors hope the work can foster future studies on heterogeneous multitask learning and shed light on this important direction
Tables
  • Table1: Lane marking statistics. Our lane marking annotations are significantly richer and are more diverse
  • Table2: MOT datasets statistics of training and validation sets. Our dataset has more sequences, frames, identities as well as more box annotations
  • Table3: Comparisons with other MOTS and VOS datasets
  • Table4: Domain discrepancy experiments with object detection. We take the images from one domain and report testing results in AP on the same domain or the opposite domain. We can observe significant domain discrepancies, especially between daytime and nighttime
  • Table5: Evaluation results of homogeneous multitask learning on lane marking and drivable area segmentation. We train lane marking, drivable area segmentation and the joint training of both on training splits with 10K, 20K, and the full 70K images
  • Table6: Evaluation results for instance segmentation when joint training with the object detection set. Additional localization supervision can improve instance segmentation significantly
  • Table7: Evaluation results for multiple object tracking cascaded with object detection. AP is the detection metric. Even though the tracking set has much more boxes, the model can still benefit from the diverse instance examples in the detection set
  • Table8: Evaluation results for semantic segmentation. We explore segmentation joint-training with different tasks. Detection can improve the overall accuracy of segmentation, although their output structures are different. However, although Lane and Drivable area improve the segmentation of road and sidewalk, the overall accuracy drops
  • Table9: MOTS evaluation results. Both instance segmentation AP and MOTS evaluation metrics are reported. Instance segmentation tracking is very hard to label, but we are able to use object detection, tracking, and instance segmentation to improve segmentation tracking accuracy significantly
  • Table10: Comparisons on number of pedestrians with other datasets. The statistics are based on the training set in each dataset
  • Table11: Annotations of the BDD100K MOT dataset by category
  • Table12: Annotations of BDD100K MOTS by category
  • Table13: Full evaluation results of the domain discrepancy experiments with object detection
  • Table14: Full evaluation results of the individual lane marking task and the joint training of lane marking and the drivable area detection. We report the ODS-F scores with different thresholds τ = 1, 2, 10 pixels of direction, continuity as well as each category
Download tables as Excel
Related work
  • Visual datasets are necessary for numerous recognition tasks in computer vision. Especially with the advent of deep learning methods, large scale visual datasets, such as [8, 36, 40, 24], are essential for learning high-level image representations. They are general-purpose and include millions of images with image-level categorical labels. These large datasets are useful in learning representations for image recognition, but most of the complex visual understanding tasks in the real world require more fine-grained recognition such as object localization and segmentation [11]. Our proposed dataset provides these multi-granularity annotations for more in-depth visual reasoning. In addition, we provide these annotations in the context of videos, which provides an additional dimension of visual information. Although large video datasets exist [5, 2, 29], they usually are restricted to image-level labels. Driving datasets have received increasing attention in the recent years, due to the popularity of autonomous vehicle technology. The goal is to understand the challenge of computer vision systems in the context of self-driving. Some of the datasets focus on particular objects such as pedestrians [9, 39]. Cityscapes [7] provides instance-level semantic segmentation on sampled frames of videos collected by their own vehicle. RobotCar [20] and KITTI [15] also provide data of multiple sources such as LiDAR scanned points. Because it is very difficult to collect data that covers a broad range of time and location, the data diversity of these datasets is limited. For a vehicle perception system to be robust, it needs to learn from a variety of road conditions in numerous cities. Our data was collected from the same original source as the videos in [33]. However, the primary contribution of our paper is the video annotations with benchmarks on heterogeneous tasks. Mapillary Vistas [24] provides fine-grained annotations for user uploaded data, which is much more diverse with respect to location. However, these images are one-off frames that are not placed in the context of videos with temporal structure. Like Vistas, our data is crowdsourced, however, our dataset is collected solely from drivers, with each annotated image corresponding to a video sequence, which enables interesting applications for modeling temporal dynamics.
Funding
  • Our experiments show that special training strategies are needed for existing models to perform such heterogeneous tasks
  • Our experiments present many new findings, made possible by the diverse set of tasks on a single dataset
  • Aims to provide a large-scale diverse driving video dataset with comprehensive annotations that can expose the challenges of street-scene understanding
  • Provides image tagging classification results using DLA-34 in Figure 4
  • Provides bounding box annotations of 10 categories for each of the reference frames of 100K videos
Reference
  • Robust Vision Challenge. robustvision.net/. 1, 2 http://www.
    Findings
  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016. 2
    Findings
  • D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive annotation of segmentation datasets with polygonrnn++. 2018. 1
    Google ScholarFindings
  • M. Aly. Real time detection of lane markers in urban streets. In Intelligent Vehicles Symposium, pages 7–12, 2008. 4
    Google ScholarLocate open access versionFindings
  • F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2012
    Google ScholarLocate open access versionFindings
  • R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. 2
    Google ScholarLocate open access versionFindings
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016. 1, 2, 6
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. 1, 2
    Google ScholarLocate open access versionFindings
  • P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 200CVPR 2009. IEEE Conference on, pages 304–311. IEEE, 2009. 2, 10
    Google ScholarLocate open access versionFindings
  • P. Dollar and C. L. Zitnick. Structured forests for fast edge detection. In ICCV, 2013. 7
    Google ScholarLocate open access versionFindings
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 1, 2
    Google ScholarLocate open access versionFindings
  • T. Evgeniou and M. Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004. 2
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track and track to detect. In iccv, 2017. 12
    Google ScholarLocate open access versionFindings
  • J. Fritsch, T. Kuhnl, and A. Geiger. A new performance measure and evaluation benchmark for road detection algorithms. In Intelligent Transportation Systems-(ITSC), 2013 16th International IEEE Conference on, pages 1693–1700. IEEE, 2013. 4
    Google ScholarLocate open access versionFindings
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013. 1, 2, 5, 10
    Google ScholarLocate open access versionFindings
  • K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1, 7
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1, 7
    Google ScholarLocate open access versionFindings
  • S. Lee, J. Kim, J. S. Yoon, S. Shin, O. Bailo, N. Kim, T.H. Lee, H. S. Hong, S.-H. Han, and I. S. Kweon. VPGNet: Vanishing point guided network for lane and road marking detection and recognition. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1965–1973. IEEE, 2017. 4
    Google ScholarLocate open access versionFindings
  • T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1, 6
    Google ScholarLocate open access versionFindings
  • W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset. IJ Robotics Res., 36(1):3–15, 2017. 2
    Google ScholarLocate open access versionFindings
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018. 1, 2
    Findings
  • A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 1, 5
    Findings
  • T. M. Mitchell. The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research, 1980. 2
    Google ScholarFindings
  • G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. SorkineHornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 5
    Findings
  • S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017. 1
    Google ScholarLocate open access versionFindings
  • S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017. 1
    Google ScholarLocate open access versionFindings
  • S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 6, 8, 12
    Google ScholarLocate open access versionFindings
  • K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 2
    Findings
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. arXiv, pages arXiv–1912, 2019. 5
    Google ScholarFindings
  • P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe. Mots: Multi-object tracking and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7942– 7951, 2019. 1, 5, 8
    Google ScholarLocate open access versionFindings
  • T. Wu and A. Ranganathan. A practical system for road marking detection and recognition. In Intelligent Vehicles Symposium, pages 25–30, 2012. 4
    Google ScholarLocate open access versionFindings
  • H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scale video datasets. arXiv preprint, 2017. 2, 6
    Google ScholarFindings
  • N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. 5
    Findings
  • F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017. 6, 10
    Google ScholarLocate open access versionFindings
  • F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. 2
    Findings
  • F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018. 3, 7, 8
    Google ScholarLocate open access versionFindings
  • A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • S. Zhang, R. Benenson, and B. Schiele. Citypersons: A diverse dataset for pedestrian detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2, 10
    Google ScholarLocate open access versionFindings
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014. 2
    Google ScholarLocate open access versionFindings
  • Y. Zhou, X. Wang, J. Jiao, T. Darrell, and F. Yu. Learning saliency propagation for semi-supervised instance segmentation. In Computer Vision and Pattern Recognition (CVPR), 2020. 7
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments