A Unified Multimodal Framework for Joint Visual Question Answering and Image Captioning
DOI:
https://doi.org/10.59088/7m3hce68Keywords:
Multimodal Learning, Visual Question Answering, Image Captioning, Vision–Language Models, Cross-Modal Attention, Joint LearningAbstract
Recent advances in vision–language models have significantly improved performance on multimodal tasks such as Visual Question Answering (VQA) and image captioning. However, most existing approaches address these tasks independently, resulting in redundant model architectures and limited cross-task knowledge transfer. In this paper, we propose a unified multimodal framework that jointly learns VQA and image captioning within a single architecture. The proposed model employs a shared vision–language encoder combined with task-specific decoding heads, enabling efficient parameter sharing and improved generalization across tasks. To enhance cross-modal alignment, we introduce a cross-attention mechanism that jointly models interactions between visual features, questions, and captions. In addition, a multi-task learning objective is designed to balance generative and discriminative training signals. We evaluate the proposed framework on the VQA v2 and MSCOCO benchmarks. Experimental results show that our approach achieves +1.7% improvement in VQA accuracy and +4.2 CIDEr score in captioning, while reducing model parameters by approximately 30% compared to separate task-specific models. Furthermore, the unified model demonstrates improved robustness and generalization by leveraging complementary information across tasks. These findings highlight the effectiveness of joint multimodal learning for efficient and scalable vision–language understanding.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Wu Zhan, Anwar Saif

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
This publication is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). This is a human-readable summary of (and not a substitute for) the license. You are free to: (a) Share — copy and redistribute the material in any medium or format; (b) Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms:

