TY - GEN
T1 - Towards Further Comprehension on Referring Expression with Rationale
AU - Li, Rengang
AU - Fan, Baoyu
AU - Li, Xiaochuan
AU - Zhang, Runze
AU - Guo, Zhenhua
AU - Zhao, Kun
AU - Zhao, Yaqian
AU - Gong, Weifeng
AU - Wang, Endong
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, existing REC tasks aim at text content filtering and image object locating, which are evaluated based on the precision of the detection boxes. This may lead models to skip the learning process of multimodal comprehension directly and achieve good performance. In this paper, we work on how to enable an artificial agent to understand RE further and propose a more comprehensive task, called Further Comprehension on Referring Expression (FREC). In this task, we mainly focus on three sub-tasks: 1) correcting the erroneous text expression based on visual information; 2) generating the rationale of this input expression; 3) localizing the proper object based on the corrected expression. Accordingly, we make a new dataset named Further-RefCOCOs based on the RefCOCO, RefCOCO+, RefCOCOg benchmark datasets for this new task and make it publicly available. After that, we design a novel end-to-end pipeline to achieve these sub-tasks simultaneously. The experimental results demonstrate the validity of the proposed pipeline. We believe this work will motivate more researchers to explore along with this direction, and promote the development of visual grounding.
AB - Referring Expression Comprehension (REC) is one important research branch in visual grounding, where the goal of REC is to localize a relevant object in the image, given an expression in the form of text to exactly describe a specific object. However, existing REC tasks aim at text content filtering and image object locating, which are evaluated based on the precision of the detection boxes. This may lead models to skip the learning process of multimodal comprehension directly and achieve good performance. In this paper, we work on how to enable an artificial agent to understand RE further and propose a more comprehensive task, called Further Comprehension on Referring Expression (FREC). In this task, we mainly focus on three sub-tasks: 1) correcting the erroneous text expression based on visual information; 2) generating the rationale of this input expression; 3) localizing the proper object based on the corrected expression. Accordingly, we make a new dataset named Further-RefCOCOs based on the RefCOCO, RefCOCO+, RefCOCOg benchmark datasets for this new task and make it publicly available. After that, we design a novel end-to-end pipeline to achieve these sub-tasks simultaneously. The experimental results demonstrate the validity of the proposed pipeline. We believe this work will motivate more researchers to explore along with this direction, and promote the development of visual grounding.
KW - computational linguistics
KW - multimodal learning
KW - referring expression comprehension
KW - visual grounding
UR - http://www.scopus.com/inward/record.url?scp=85151164827&partnerID=8YFLogxK
U2 - 10.1145/3503161.3548417
DO - 10.1145/3503161.3548417
M3 - Conference contribution
AN - SCOPUS:85151164827
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 4336
EP - 4344
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 30th ACM International Conference on Multimedia, MM 2022
Y2 - 10 October 2022 through 14 October 2022
ER -