Al-Zoghby, A., Saleh, A., awad, W. (2024). A Survey on Visual Question Answering Methodologies. The Egyptian Journal of Language Engineering, 11(1), 57-65. doi: 10.21608/ejle.2024.244720.1058
Aya M. Al-Zoghby; Aya Salah Saleh; wael abd elkader awad. "A Survey on Visual Question Answering Methodologies". The Egyptian Journal of Language Engineering, 11, 1, 2024, 57-65. doi: 10.21608/ejle.2024.244720.1058
Al-Zoghby, A., Saleh, A., awad, W. (2024). 'A Survey on Visual Question Answering Methodologies', The Egyptian Journal of Language Engineering, 11(1), pp. 57-65. doi: 10.21608/ejle.2024.244720.1058
Al-Zoghby, A., Saleh, A., awad, W. A Survey on Visual Question Answering Methodologies. The Egyptian Journal of Language Engineering, 2024; 11(1): 57-65. doi: 10.21608/ejle.2024.244720.1058
A Survey on Visual Question Answering Methodologies
1Department of Computer Science, Faculty of Computers and Information Science Damietta University Damietta, Egypt
2Computer Science,Computer and Artificial Intelligence, Damietta University, New Damietta, Damietta
3Computer Science Department, Faculty of Computer and Artificial Intelligence, Damietta University
Abstract
Understanding visual question-answering (VQA) will be essential for many human tasks. However, it poses significant obstacles at the core of artificial intelligence as a multimodal system. This article provides a summary of the challenges in multimodal architectures that have lately been demonstrated by the enormous rise in research. We need to keep our eyes on these challenges to enhance the design of visual question-answering systems. Then we will introduce the recent rapid developments in methods for answering visual questions with images. Providing the right response to a natural language question concerning an input image, it is a difficult multi-modal activity as we don’t need only to extract features from both modal (text and image) but also getting attention on relation between them. Many deep learning researchers are drawn to it because of their outstanding contributions to text, voice, and vision technologies (images and videos) in fields like welfare, robotics, security, and medicine, etc.