Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Visual Question Answering (VQA) is a challenging task that demands not only accurate alignment between images and language, but also multi-step reasoning, contextual understanding, and the ability to incorporate external knowledge, especially in multi-turn settings where follow-up questions depend on previous dialogue. In this work, we present a novel framework for generating knowledge-grounded, multi-turn VQAs datasets that has been integrated into the IBM Granite-Vision development pipeline. The main novelty of our method is the generation of multi-turn conversations using large language models (LLMs), but heavily supported by symbolic reasoning over the KGs: we leverage structured and unstructured knowledge sources from Wikipedia articles, associated images, and the Wikidata knowledge graph (KG). By combining both unstructured and structured knowledge sources, our approach advances VQA beyond shallow perception tasks toward more profound, knowledge- and entity-aware reasoning. We demonstrate the effectiveness of this approach by using it to fine-tune and evaluate existing vision-language models (beyond the Granite-Vision models), and share valuable insights about the complexity of the task and the nature of available benchmarks.
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011