A team of researchers from the University of Wisconsin-Madison, NVIDIA, the University of Michigan, and Stanford University have developed a new vision-language model (VLM) called Dolphins. It is a conversational driving assistant that can process multimodal inputs to provide informed driving instructions. Dolphins are designed to address the complex driving scenarios faced by autonomous vehicles (AVs) and exhibit human-like features such as rapid learning, adaptation, error recovery, and interpretability during interactive conversations.

LLMs like DriveLikeHuman and GPT-Driver lack rich visual features for autonomous driving. Dolphins combine LLM reasoning with visual understanding, excelling in in-context learning and handling varied video inputs. Inspired by Flamingo’s multimodal in-context learning, Dolphins aligns with works enhancing instruction comprehension in multimodal language models through text-image interleaved datasets.

The study addresses the challenge of achieving full autonomy in vehicular systems, aiming to design AVs with human-like understanding and responsiveness in complex scenarios. Current data-driven and modular autonomous driving systems face various integration and performance issues. Dolphins, a VLM tailored for AVs, demonstrates advanced understanding, instant learning, and error recovery. Emphasizing interpretability for trust and transparency, Dolphins reduce the disparity between existing autonomous systems and human-like driving capabilities.

Dolphins use OpenFlamingo and GCoT to enhance reasoning. They ground VLMs in the AV context and develop fine-grained capabilities using real and synthetic AV datasets. They also create a multimodal in-context instruction tuning dataset for detailed conversation tasks.

Dolphins excel in solving diverse autonomous vehicle tasks with human-like capabilities such as instant adaptation and error recovery. They pinpoint precise driving locations, assess traffic status, and understand road agent behaviors. The model’s fine-grained capabilities result from being grounded in a general image dataset and fine-tuned within the specific context of autonomous driving. A multimodal in-context instruction tuning dataset contributes to their training and evaluation.

Dolphins showcase impressive holistic understanding and human-like reasoning in intricate driving scenarios. As a conversational driving assistant, it handles various AV tasks, excelling in interpretability and rapid adaptation. It acknowledges computational challenges, particularly in achieving high frame rates on edge devices and managing power consumption. Proposing customized and distilled model versions suggests a promising direction to balance computational demands with power efficiency. Continuous exploration and innovation are deemed essential for unlocking the full potential of AVs empowered by advanced AI capabilities like Dolphins.

Further exploration recommends computational efficiency, particularly in achieving high frame rates on edge devices and reducing power consumption for running advanced models in vehicles. Proposing the development of customized and distilled versions of VLMs, such as Dolphins, suggests a potential solution to balance computational demands with power efficiency. Emphasizing the critical role of VLMs in enabling autonomous driving and unlocking full AI potential in AVs.


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Hello, My name is Adnan Hassan. I am a consulting intern at Marktechpost and soon to be a management trainee at American Express. I am currently pursuing a dual degree at the Indian Institute of Technology, Kharagpur. I am passionate about technology and want to create new products that make a difference.




Source link