跳至主導覽 跳至搜尋 跳過主要內容

REFINING CLIP'S SPATIAL AWARENESS: A VISUAL-CENTRIC PERSPECTIVE

  • Congpei Qiu
  • , Yanhao Wu
  • , Wei Ke
  • , Xiuxiu Bai
  • , Tong Zhang

研究成果: Conference contribution同行評審

摘要

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguing finding that CLIP naturally capture high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.

原文English
主出版物標題13th International Conference on Learning Representations, ICLR 2025
發行者International Conference on Learning Representations, ICLR
頁面5479-5505
頁數27
ISBN(電子)9798331320850
出版狀態Published - 2025
對外發佈
事件13th International Conference on Learning Representations, ICLR 2025 - Singapore, Singapore
持續時間: 24 4月 202528 4月 2025

出版系列

名字13th International Conference on Learning Representations, ICLR 2025

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
國家/地區Singapore
城市Singapore
期間24/04/2528/04/25

指紋

深入研究「REFINING CLIP'S SPATIAL AWARENESS: A VISUAL-CENTRIC PERSPECTIVE」主題。共同形成了獨特的指紋。

引用此