VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Hyeongcheol Park1, MinHyuk Jang1, Ha Dam Baek1, Gyusam Chang1,
Jiyoung Seo1, Jiwan Park1, Hogun Park2, Sangpil Kim1

1Korea University 2Sungkyunkwan University

Overview

Teaser
VAT-KG Overview. An overview of the VAT-KG construction pipeline. The construction of VAT-KG involves four stages: (1) Multimodal Alignment Filtering, ensuring correlation across modalities; (2) Knowledge-Intensive Recaptioning, which transforms base text into rich, knowledge-intensive caption based on meta information; (3) Multimodal Triplet Grounding, which aligns triplets with corresponding multimodal context; (4) Cross-Modal Description Alignment, which retrieves fine-grained descriptions from external knowledge bases and matches them to each multimodal triplet.

Abstract

Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.




VAT-KG Dataset (100K)

Dataset Year Text Image Audio Video Concept centric Downstream task Data source
IMGpedia 2017 ✔︎✔︎ Link-prediction Wikimedia Commons, DBpedia
ImageGraph 2017 ✔︎✔︎ Local Ranking FB15k
MMKG 2019 ✔︎✔︎ Link-prediction, Reasoning FB15k, DB15k, YAGO15k, Search Engine
Richpedia 2020 ✔︎✔︎ Retrieval Wikidata, Wikimedia, Search Engine
VisualSem 2020 ✔︎✔︎ Retrieval BabelNet
MarKG 2023 ✔︎✔︎ ✔︎ Link-prediction, Reasoning Wikipedia, Search Engine
AspectMMKG 2023 ✔︎✔︎ ✔︎ Entity aspect linking Wikipedia, Search Engine
VCTKG 2023 ✔︎✔︎ ✔︎ Link-prediction ConceptNet, WordNet
TIVA-KG 2023 ✔︎✔︎✔︎ Link-prediction Wikipedia, Search Engine
UKnow 2024 ✔︎✔︎✔︎ ✔︎ Reasoning, Retrieval Wikipedia, News
M2ConceptBase 2024 ✔︎✔︎✔︎ ✔︎ VQA Image-text Corpora, Encyclopedia, LLM
VAT-KG 2025 ✔︎✔︎✔︎✔︎ ✔︎ AQA, VQA, AVQA Video-Audio-Text Corpora,Encyclopedia, LLM
Comparison of VAT-KG with existing MMKGs.


Principle InternVid-FLT(10%) AudioCaps AVQA VALOR-32k Total
Original 1,000,000 93,726 40,150 28,823 1,162,699
Audio Tagging 389,965 86,578 36,144 22,521 535,208
Audio-Text 15,490 77,964 27,864 15,250 136,568
Video-Text 13,941 70,167 25,077 13,725 124,295
Final 12,464 59,808 24,999 12,947 110,218
Data sample counts at each stage of the filtering process

Statistics

Dataset statistics
Statistics of VAT-KG. (a) VAT-KG contains diverse concepts that are represented through varied multimodal data. (b) Concept-level descriptions linked to VAT-KG concepts are sufficiently comprehensive and informative. (c) VAT-KG ensures diversity across categories.

MMKGs

(a) A non-concept-centric case involving various modalities. (b) Concept-centric case with limited modalities. (c) Our proposed VAT-KG, which is concept-centric and covers four modalities.

Teaser

Example Triplet in VAT-KG

A multi-modal triplet from VAT-KG, composed of video, audio, and text. Each head and tail is linked to a detailed concept-level description.

Teaser


Multimodal RAG Framework

Dataset statistics
An overview of Multimodal RAG Framework. Given a query from any modality (audio, video, or text), (1) Modality-Agnostic Retrieval retrieves up to five semantically relevant triplets from VAT-KG based on embedding similarity; (2) Retrieval Checker filters out misaligned triplets by using the text encoder of the same multimodal foundation model; (3) Augmented Generation with MLLMs that support audio-visual understanding.


Question-Answering Examples



Overall Performance

Dataset statistics
Overall performance. We report M.J. (Model-as-Judge) scores; higher is better. Note that M2 ConceptBase does not contain audio, and thus cannot be applied to the Audio-QA task. VAT-KG yields the highest performance improvements, highlighted in bold.