Although it would be helpful for applications like autonomous driving and mobile robotics, monocular estimation of metric depth in general situations has been difficult to achieve. Indoor and outdoor datasets have drastically different RGB and depth distributions, which presents a challenge. Another issue is the inherent scale ambiguity in photos caused by not knowing the camera’s intrinsicity. As expected, most existing monocular depth models either work with indoor or outdoor settings or only estimate scale-invariant depth if trained for both. 

Present metric depth models are frequently trained using a single dataset collected with fixed camera intrinsics, such as an RGBD camera for indoor images or RGB+LIDAR for outdoor scenes. These datasets are typically limited to either indoor or outdoor situations. Such models sacrifice generalizability to sidestep problems brought on by variations in indoor and outdoor depth distributions. Not only that, they aren’t good at generalizing to data that isn’t normally distributed, and they overfit the training dataset’s camera intrinsics. 

Instead of metric depth, the most common method for combining indoor and outdoor data in models is to estimate depth invariant to scale and shift (e.g., MiDaS). Standardizing the depth distributions may eliminate scale ambiguities caused by cameras with varied intrinsics and bring the indoor and outside depth distributions closer together. Training joint indoor-outdoor models that estimate metric depth has recently attracted a lot of attention as a way to bring these various methods together. ZoeDepth attaches two domain-specific heads to MiDaS to handle indoor and outdoor domains, allowing it to convert scale-invariant depth to metric depth. 

Using several important advances, a new Google Research and Google Deepmind study investigates denoising diffusion models for zero-shot metric depth estimation, achieving state-of-the-art performance. Specifically, field-of-view (FOV) augmentation is employed throughout training to enhance generalizability to various camera intrinsics; FOV conditioning is employed during training and inference to resolve intrinsic scale ambiguities, leading to an additional performance gain. The researchers recommend encoding depth in the log scale to use the model’s representation capability better. A more equitable distribution of model capacity between indoor and outdoor situations is achieved by representing depth in the log domain, leading to improved indoor performance. 

Through their investigations, the researchers also discovered that v-parameterization significantly boosts inference speed in neural network denoising. Compared to ZoeDepth, a newly suggested metric depth model, the final model, DMD (Diffusion for Metric Depth), works better. DMD is a straightforward approach to zero-shot metric depth estimation on generic scenes, which is both simple and successful. Specifically, when fine-tuned on the same data, DMD produces substantially less relative depth error than ZoeDepth on all eight out-of-distributed datasets. Adding more data to the training dataset makes things even better.

DMD achieves a SOTA on zero-shot metric depth, with a relative error that is 25% lower on indoor datasets and 33% lower on outdoor datasets than ZoeDepth. It is efficient since it uses v-parameterization for diffusion. 


Check out the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..


Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.




Source link