TY - GEN
T1 - UMRetail
T2 - 2025 International Conference on Virtual Reality and Visualization, ICVRV 2025
AU - Chen, Bidong
AU - Li, Lingui
AU - Paiva, Rui Pedro
AU - Guo, Jielong
AU - Cai, Jianxiu
AU - Yang, Xu
AU - Wang, Yapeng
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision-language models and multi-task learning are advancing scene understanding toward unified multimodal frameworks. However, retail datasets are fragmented: most target a single task, and heterogeneous annotation protocols and semantic granularity impede joint training, inference, and fair benchmarking. We present UMRetail, a unified multimodal dataset of real-world retail shelves with human-verified annotations. It comprises 17,697 high-resolution images covering 3,812 product types and provides instance-level segmentation masks, product detection bounding boxes, shelf-vacancy labels, and hierarchical product descriptions (short, medium, long) ranging from concise names to detailed specifications. These harmonized, cross-task labels enable integrated training and consistent evaluation for detection, segmentation, and vacancy detection. Experimental results demonstrate that UMRetail's rich data labels provide a reliable basis for rigorous evaluations: YOLOv11 Medium achieves state-of-the-art edge-device product detection (mAP 0.551, mAP50 0.806); UMRetail-MTArch raises image-to-text retrieval R@1 by 143.1% vs zero-shot Chinese CLIP and hits 74.55% Top-1 in zero-shot classification for 3,812 classes, which is 5.5 times that of Chinese-CLIP (13.43%) and 31 times that of CLIP (ViT-B, 2.39%). This establishes UMRetail as a research-deployment bridge for retail scene perception.
AB - Vision-language models and multi-task learning are advancing scene understanding toward unified multimodal frameworks. However, retail datasets are fragmented: most target a single task, and heterogeneous annotation protocols and semantic granularity impede joint training, inference, and fair benchmarking. We present UMRetail, a unified multimodal dataset of real-world retail shelves with human-verified annotations. It comprises 17,697 high-resolution images covering 3,812 product types and provides instance-level segmentation masks, product detection bounding boxes, shelf-vacancy labels, and hierarchical product descriptions (short, medium, long) ranging from concise names to detailed specifications. These harmonized, cross-task labels enable integrated training and consistent evaluation for detection, segmentation, and vacancy detection. Experimental results demonstrate that UMRetail's rich data labels provide a reliable basis for rigorous evaluations: YOLOv11 Medium achieves state-of-the-art edge-device product detection (mAP 0.551, mAP50 0.806); UMRetail-MTArch raises image-to-text retrieval R@1 by 143.1% vs zero-shot Chinese CLIP and hits 74.55% Top-1 in zero-shot classification for 3,812 classes, which is 5.5 times that of Chinese-CLIP (13.43%) and 31 times that of CLIP (ViT-B, 2.39%). This establishes UMRetail as a research-deployment bridge for retail scene perception.
KW - Dense Product Detection
KW - Fine-grained Retail
KW - Hierarchical Text Description
KW - Instance Segmentation
KW - Multimodal Dataset
UR - https://www.scopus.com/pages/publications/105035374889
U2 - 10.1109/ICVRV67992.2025.00065
DO - 10.1109/ICVRV67992.2025.00065
M3 - Conference contribution
AN - SCOPUS:105035374889
T3 - Proceedings - 2025 International Conference on Virtual Reality and Visualization, ICVRV 2025
SP - 336
EP - 341
BT - Proceedings - 2025 International Conference on Virtual Reality and Visualization, ICVRV 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 December 2025 through 21 December 2025
ER -