Visual question answering (VQA), aiming to answer a question in natural language related to a given image, is still in its infancy. Current approaches lack flexibility and generalizability to handling diverse questions without training. It is therefore desirable to explorep explainable VQA (or X-VQA) that can provide explanations of its reasoning in natural language in addition to answers. This requires integrating computer vision, natural language, and knowledge representation, and it is an incredibly challenging task. By exploring X-VQA this project advances and enriches the fundamental computer vision, image understanding, visual semantic analysis, machine learning, and knowledge representation. And it also greatly facilitates a wide range of applications including visual chatbots, visual retrieval and recommendation, and human-computer interaction. This research also contributes to education through curriculum development, student training, and knowledge dissemination. It includes interactions with K-12 students for participation and research opportunities.
The major goal of this research is to develop a novel computational model with solid theoretical foundation and effective methods, to facilitate X-VQA that provides explanations of its visual reasoning. This challenging task involves many fundamental aspects and needs to integrate vision, language, learning and knowledge. This project focuses on: (1) A unified computational model of X-VQA and its theoretical foundation. This model integrates domain knowledge and visual observations for reasoning: what and how hidden facts can be inferred from incomplete and inaccurate visual observations; how visual observation, hidden facts, and domain knowledge can be represented for efficient question answering; and how the question answering can be scalable. The study of these critical issues creates the foundation for X-VQA; (2) A new model for question-driven task-oriented visual observation. It is inefficient to collect all visual observations before answering a question. Vision needs to be question-driven and task-oriented. This project pursues a new model for the interaction of questions, visual reasoning and visual observation, so as to automatically steer attention to the question-related aspects of an image; (3) An innovative approach to self-questioning for training X-VQA agents. Training simply based on question-answer data is not viable for X-VQA, as it is unable to provide explanations for and insights into the answer. This project pursues a novel approach to self-questioning, in which the VQA agents can also generate and ask questions. It investigates how self-questioning can be combined with reinforcement learning, and how it can deal with versatile questions to improve the scalability of X-VQA; and (4) A solid case study on X-VQA.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.