Referring Expression Comprehension (REC) is a multimodal comprehension task that aims to locate an object in an image, given a text description. Traditionally, during the existing REC tasks, there has been a basic assumption that the given text expression and the image are usually exactly matched to each other. However, in real-world scenarios, there is uncertainty in how well the image and text match each other exactly. Illegible objects in the image or ambiguous phrases in the text have the potential to significantly degrade the performance of conventional REC tasks. To overcome these limitations, we consider a more practical and comprehensive REC task, where the given image and its referring text expression can be inexactly matched. Our models aim to correct such inexact matching and supply corresponding interpretations. We refer to this task as <italic>Further REC (FREC)</italic>. This task is divided into three subtasks: 1) correcting the erroneous text expression using visual information, 2) generating the rationale for this input expression, and 3) localizing the proper object based on the corrected expression. We introduce three new datasets for FREC: <italic>Further-RefCOCOs</italic>, <italic>Further-Copsref</italic> and <italic>Further-Talk2Car</italic>. These datasets are based on the existing REC datasets, including RefCOCO and Talk2Car. We developed a novel pipeline architecture to execute the three subtasks simultaneously in an end-to-end fashion. Next, we developed an elastic masked language modeling (EMLM) training head to rectify text errors with uncertain lengths. Our experimental results demonstrate the validity of our proposed pipeline. We hope this work sparks more research focused on inexactly matched REC.
- Computational Linguistics
- Multimodal Learning
- Referring Expression Comprehension
- Task analysis