AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Rengang Li, Cong Xu, Zhenhua Guo, Baoyu Fan, Runze Zhang, Wei Liu, Yaqian Zhao, Weifeng Gong, Endong Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.

Original languageEnglish
Title of host publicationMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages5274-5282
Number of pages9
ISBN (Electronic)9781450392037
DOIs
Publication statusPublished - 10 Oct 2022
Externally publishedYes
Event30th ACM International Conference on Multimedia, MM 2022 - Lisboa, Portugal
Duration: 10 Oct 202214 Oct 2022

Publication series

NameMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

Conference

Conference30th ACM International Conference on Multimedia, MM 2022
Country/TerritoryPortugal
CityLisboa
Period10/10/2214/10/22

Keywords

  • dataset
  • vision and language
  • visual question answer

Fingerprint

Dive into the research topics of 'AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability'. Together they form a unique fingerprint.

Cite this