A Unified Multimodal Framework for Joint Visual Question Answering and Image Captioning

Wu Lingyi; Anwar Saif

doi:10.59088/7m3hce68

A Unified Multimodal Framework for Joint Visual Question Answering and Image Captioning

Authors

Wu Lingyi
Anwar Saif

DOI:

https://doi.org/10.59088/7m3hce68

Keywords:

Multimodal Learning, Visual Question Answering, Image Captioning, Vision–Language Models, Cross-Modal Attention, Joint Learning

Abstract

Recent advances in vision–language models have significantly improved performance on multimodal tasks such as Visual Question Answering (VQA) and image captioning. However, most existing approaches address these tasks independently, resulting in redundant model architectures and limited cross-task knowledge transfer. In this paper, we propose a unified multimodal framework that jointly learns VQA and image captioning within a single architecture. The proposed model employs a shared vision–language encoder combined with task-specific decoding heads, enabling efficient parameter sharing and improved generalization across tasks. To enhance cross-modal alignment, we introduce a cross-attention mechanism that jointly models interactions between visual features, questions, and captions. In addition, a multi-task learning objective is designed to balance generative and discriminative training signals. We evaluate the proposed framework on the VQA v2 and MSCOCO benchmarks. Experimental results show that our approach achieves +1.7% improvement in VQA accuracy and +4.2 CIDEr score in captioning, while reducing model parameters by approximately 30% compared to separate task-specific models. Furthermore, the unified model demonstrates improved robustness and generalization by leveraging complementary information across tasks. These findings highlight the effectiveness of joint multimodal learning for efficient and scalable vision–language understanding.

Downloads

Download

Published

2026-03-25

Issue

Vol. 5 No. 1 (2026)

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This publication is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA). This is a human-readable summary of (and not a substitute for) the license. You are free to: (a) Share — copy and redistribute the material in any medium or format; (b) Adapt — remix, transform, and build upon the material for any purpose, even commercially. The licensor cannot revoke these freedoms as long as you follow the license terms. The licensor cannot revoke these freedoms as long as you follow the license terms. Under the following terms:

How to Cite

A Unified Multimodal Framework for Joint Visual Question Answering and Image Captioning. (2026). Peta International Journal of Social Science and Humanity, 5(1), 1-14. https://doi.org/10.59088/7m3hce68

Download Citation

A Unified Multimodal Framework for Joint Visual Question Answering and Image Captioning

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Most read articles by the same author(s)

Make a Submission

Journal Info

Quick Links

Information

Guidelines

Contact

Follow