A Survey on Visual Question Answering Methodologies

Al-Zoghby, Aya M.; Saleh, Aya Salah; Awad, Wael Abd Elkader

doi:10.21608/ejle.2024.244720.1058

A Survey on Visual Question Answering Methodologies

Document Type : Original Article

Authors

¹ Department of Computer Science, Faculty of Computers and Information Science Damietta University Damietta, Egypt

² Computer Science,Computer and Artificial Intelligence, Damietta University, New Damietta, Damietta

³ Computer Science Department, Faculty of Computer and Artificial Intelligence, Damietta University

10.21608/ejle.2024.244720.1058

Abstract

Understanding visual question-answering (VQA) will be essential for many human tasks. However, it poses significant obstacles at the core of artificial intelligence as a multimodal system. This article provides a summary of the challenges in multimodal architectures that have lately been demonstrated by the enormous rise in research. We need to keep our eyes on these challenges to enhance the design of visual question-answering systems. Then we will introduce the recent rapid developments in methods for answering visual questions with images. Providing the right response to a natural language question concerning an input image, it is a difficult multi-modal activity as we don’t need only to extract features from both modal (text and image) but also getting attention on relation between them. Many deep learning researchers are drawn to it because of their outstanding contributions to text, voice, and vision technologies (images and videos) in fields like welfare, robotics, security, and medicine, etc.

Keywords