TY - JOUR
T1 - Inexactly Matched Referring Expression Comprehension With Rationale
AU - Li, Xiaochuan
AU - Fan, Baoyu
AU - Zhang, Runze
AU - Zhao, Kun
AU - Guo, Zhenhua
AU - Zhao, Yaqian
AU - Li, Rengang
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Referring Expression Comprehension (REC) is a multimodal comprehension task that aims to locate an object in an image, given a text description. Traditionally, during the existing REC tasks, there has been a basic assumption that the given text expression and the image are usually exactly matched to each other. However, in real-world scenarios, there is uncertainty in how well the image and text match each other exactly. Illegible objects in the image or ambiguous phrases in the text have the potential to significantly degrade the performance of conventional REC tasks. To overcome these limitations, we consider a more practical and comprehensive REC task, where the given image and its referring text expression can be inexactly matched. Our models aim to correct such inexact matching and supply corresponding interpretations. We refer to this task as Further REC (FREC). This task is divided into three subtasks: 1) correcting the erroneous text expression using visual information, 2) generating the rationale for this input expression, and 3) localizing the proper object based on the corrected expression. We introduce three new datasets for FREC: Further-RefCOCOs, Further-Copsref and Further-Talk2Car. These datasets are based on the existing REC datasets, including RefCOCO and Talk2Car. We developed a novel pipeline architecture to execute the three subtasks simultaneously in an end-to-end fashion. Next, we developed an elastic masked language modeling (EMLM) training head to rectify text errors with uncertain lengths. Our experimental results demonstrate the validity of our proposed pipeline. We hope this work sparks more research focused on inexactly matched REC.
AB - Referring Expression Comprehension (REC) is a multimodal comprehension task that aims to locate an object in an image, given a text description. Traditionally, during the existing REC tasks, there has been a basic assumption that the given text expression and the image are usually exactly matched to each other. However, in real-world scenarios, there is uncertainty in how well the image and text match each other exactly. Illegible objects in the image or ambiguous phrases in the text have the potential to significantly degrade the performance of conventional REC tasks. To overcome these limitations, we consider a more practical and comprehensive REC task, where the given image and its referring text expression can be inexactly matched. Our models aim to correct such inexact matching and supply corresponding interpretations. We refer to this task as Further REC (FREC). This task is divided into three subtasks: 1) correcting the erroneous text expression using visual information, 2) generating the rationale for this input expression, and 3) localizing the proper object based on the corrected expression. We introduce three new datasets for FREC: Further-RefCOCOs, Further-Copsref and Further-Talk2Car. These datasets are based on the existing REC datasets, including RefCOCO and Talk2Car. We developed a novel pipeline architecture to execute the three subtasks simultaneously in an end-to-end fashion. Next, we developed an elastic masked language modeling (EMLM) training head to rectify text errors with uncertain lengths. Our experimental results demonstrate the validity of our proposed pipeline. We hope this work sparks more research focused on inexactly matched REC.
KW - Referring expression comprehension
KW - computational linguistics
KW - multimodal learning
UR - http://www.scopus.com/inward/record.url?scp=85181568979&partnerID=8YFLogxK
U2 - 10.1109/TMM.2023.3318289
DO - 10.1109/TMM.2023.3318289
M3 - Article
AN - SCOPUS:85181568979
SN - 1520-9210
VL - 26
SP - 3937
EP - 3950
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -