Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
IEEE Transactions on Intelligent Transportation Systems(2023)
摘要
Recognizing the activities causing distraction in real-world driving
scenarios is critical for ensuring the safety and reliability of both drivers
and pedestrians on the roadways. Conventional computer vision techniques are
typically data-intensive and require a large volume of annotated training data
to detect and classify various distracted driving behaviors, thereby limiting
their efficiency and scalability. We aim to develop a generalized framework
that showcases robust performance with access to limited or no annotated
training data. Recently, vision-language models have offered large-scale
visual-textual pretraining that can be adapted to task-specific learning like
distracted driving activity recognition. Vision-language pretraining models,
such as CLIP, have shown significant promise in learning natural
language-guided visual representations. This paper proposes a CLIP-based driver
activity recognition approach that identifies driver distraction from
naturalistic driving images and videos. CLIP's vision embedding offers
zero-shot transfer and task-based finetuning, which can classify distracted
activities from driving video data. Our results show that this framework offers
state-of-the-art performance on zero-shot transfer and video-based CLIP for
predicting the driver's state on two public datasets. We propose both
frame-based and video-based frameworks developed on top of the CLIP's visual
representation for distracted driving detection and classification tasks and
report the results.
更多查看译文
关键词
Distracted driving,computer vision,CLIP,vision-language model,zero-shot transfer,embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要