Skip to main navigation Skip to search Skip to main content

UMRetail: A Unified Multimodal Dataset for Hyper-Dense Shelves in Smart Retail

  • University of Coimbra
  • Guangzhou College of Commerce
  • Macao Polytechnic University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Vision-language models and multi-task learning are advancing scene understanding toward unified multimodal frameworks. However, retail datasets are fragmented: most target a single task, and heterogeneous annotation protocols and semantic granularity impede joint training, inference, and fair benchmarking. We present UMRetail, a unified multimodal dataset of real-world retail shelves with human-verified annotations. It comprises 17,697 high-resolution images covering 3,812 product types and provides instance-level segmentation masks, product detection bounding boxes, shelf-vacancy labels, and hierarchical product descriptions (short, medium, long) ranging from concise names to detailed specifications. These harmonized, cross-task labels enable integrated training and consistent evaluation for detection, segmentation, and vacancy detection. Experimental results demonstrate that UMRetail's rich data labels provide a reliable basis for rigorous evaluations: YOLOv11 Medium achieves state-of-the-art edge-device product detection (mAP 0.551, mAP50 0.806); UMRetail-MTArch raises image-to-text retrieval R@1 by 143.1% vs zero-shot Chinese CLIP and hits 74.55% Top-1 in zero-shot classification for 3,812 classes, which is 5.5 times that of Chinese-CLIP (13.43%) and 31 times that of CLIP (ViT-B, 2.39%). This establishes UMRetail as a research-deployment bridge for retail scene perception.

Original languageEnglish
Title of host publicationProceedings - 2025 International Conference on Virtual Reality and Visualization, ICVRV 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages336-341
Number of pages6
ISBN (Electronic)9798331556297
DOIs
Publication statusPublished - 2025
Event2025 International Conference on Virtual Reality and Visualization, ICVRV 2025 - Bogota, Colombia
Duration: 19 Dec 202521 Dec 2025

Publication series

NameProceedings - 2025 International Conference on Virtual Reality and Visualization, ICVRV 2025

Conference

Conference2025 International Conference on Virtual Reality and Visualization, ICVRV 2025
Country/TerritoryColombia
CityBogota
Period19/12/2521/12/25

Keywords

  • Dense Product Detection
  • Fine-grained Retail
  • Hierarchical Text Description
  • Instance Segmentation
  • Multimodal Dataset

Fingerprint

Dive into the research topics of 'UMRetail: A Unified Multimodal Dataset for Hyper-Dense Shelves in Smart Retail'. Together they form a unique fingerprint.

Cite this