Boosting Guided Depth Super-Resolution Through Large Depth Estimation Model and Alignment-then-Fusion Strategy
Yuan-Lin Zhang1,2,#    Xin-Ni Jiang2,#    Chun-Le Guo1,*    Xiong-Xin Tang2,*    
Guo-Qing Wang3    Wei Li4    Xun Liu4    Chong-Yi Li1,*   
1Nankai University   2Chinese Academy of Sciences   3University of Electronic Science and Technology of China   4Beijing Institute of Space Mechanics and Electricity  

🔥 Powerful Super-Resolution Reconstruction Ability 🔥

We introduce a novel approach by utilizing pseudo-depth map generated by large pre-trained monocular depth estimation models, combined with an alignment-then-fusion strategy, fully tapping into the potential of GDSR.

Abstract

Guided Depth Super-Resolution (GDSR) presents two primary challenges: the resolution gap between Low-Resolution (LR) depth maps and High-Resolution (HR) RGB images, and the modality gap between depth and RGB data. In this study, we leverage the powerful zero-shot capabilities of large pre-trained monocular depth estimation models to address these issues. Specifically, we utilize the output of monocular depth estimation as pseudo-depth to mitigate both gaps. The pseudo-depth map is aligned with the resolution of the RGB image, offering more detailed boundary information than the LR depth map, particularly at larger scales. Furthermore, pseudo-depth provides valuable relative positional information about objects, serving as a critical scene prior to enhance edge alignment and reduce texture overtransfer. However, effectively bridging the cross-modal differences between the guidance inputs (RGB and pseudo-depth) and LR depth remains a significant challenge. To tackle this, we analyze the modality gap from three key perspectives: distribution misalignment, geometrical misalignment, and texture inconsistency. Based on these insights, we propose an alignment-then-fusion strategy, introducing a novel and efficient Dynamic Dual-Aligned and Aggregation Network (D2A2). By leveraging large pre-trained monocular depth estimation models, our approach achieves state-of-the-art performance on multiple benchmark datasets, excelling particularly in the challenging ×16 GDSR task.

Motivation

Our motivation is to address the resolution and modality gaps in GDSR tasks by leveraging a large pre-trained monocular depth estimation model. As shown on the left side, pseudo-depth generated from RGB using DepthAnythingv2 effectively reduces irrelevant texture information while preserving rich boundary details with minimal artifacts (red box). The right side of the figure illustrates the changes in mutual information between depth, pseudo-depth, RGB, and GT across different scales. It demonstrates that pseudo-depth maintains scale-invariant information, thereby addressing the resolution gap. Furthermore, pseudo-depth encapsulates more pertinent information compared to RGB, which aids in bridging the modality gap.

Visual Comparisons



BibTex

                    








Contact

Feel free to contact us at xinnijiang@mail.nankai.edu.cn

© Xin-Ni Jiang | Last updated: Jan. 2024