Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding visual content in the scene, such as simple counting, visual attributes, and even a little challenging questions that require extra encyclopedic knowledge. However, humans have a remarkable capacity to reason dynamic interaction on the scene, which is beyond the literal content of an image and has not been investigated so far. In this paper, we propose Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action. For this task, a model not only needs to answer action-related questions but also to locate the objects in which the interaction occurs for guaranteeing it truly comprehends the action. Accordingly, we make a new dataset based on Visual Genome and ATOMIC knowledge graph, including more than 19,000 manually annotated questions, and will make it publicly available. Besides, we also provide an annotation of the reasoning path while developing the answer for each question. Based on the dataset, we further propose a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base. Experimental results show that our proposed method outperforms the baseline by a clear margin.