VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Abstract

Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.

VAT-KG Dataset (100K)

Dataset	Year	Text	Image	Audio	Video	Concept centric	Downstream task	Data source
IMGpedia	2017	✔︎	✔︎	✗	✗	✗	Link-prediction	Wikimedia Commons, DBpedia
ImageGraph	2017	✔︎	✔︎	✗	✗	✗	Local Ranking	FB15k
MMKG	2019	✔︎	✔︎	✗	✗	✗	Link-prediction, Reasoning	FB15k, DB15k, YAGO15k, Search Engine
Richpedia	2020	✔︎	✔︎	✗	✗	✗	Retrieval	Wikidata, Wikimedia, Search Engine
VisualSem	2020	✔︎	✔︎	✗	✗	✗	Retrieval	BabelNet
MarKG	2023	✔︎	✔︎	✗	✗	✔︎	Link-prediction, Reasoning	Wikipedia, Search Engine
AspectMMKG	2023	✔︎	✔︎	✗	✗	✔︎	Entity aspect linking	Wikipedia, Search Engine
VCTKG	2023	✔︎	✔︎	✗	✗	✔︎	Link-prediction	ConceptNet, WordNet
TIVA-KG	2023	✔︎	✔︎	✗	✔︎	✗	Link-prediction	Wikipedia, Search Engine
UKnow	2024	✔︎	✔︎	✗	✔︎	✔︎	Reasoning, Retrieval	Wikipedia, News
M²ConceptBase	2024	✔︎	✔︎	✗	✔︎	✔︎	VQA	Image-text Corpora, Encyclopedia, LLM
VAT-KG	2025	✔︎	✔︎	✔︎	✔︎	✔︎	AQA, VQA, AVQA	Video-Audio-Text Corpora,Encyclopedia, LLM

Comparison of VAT-KG with existing MMKGs.

Principle	InternVid-FLT(10%)	AudioCaps	AVQA	VALOR-32k	Total
Original	1,000,000	93,726	40,150	28,823	1,162,699
Audio Tagging	389,965	86,578	36,144	22,521	535,208
Audio-Text	15,490	77,964	27,864	15,250	136,568
Video-Text	13,941	70,167	25,077	13,725	124,295
Final	12,464	59,808	24,999	12,947	110,218

Data sample counts at each stage of the filtering process

Statistics

Statistics of VAT-KG. (a) VAT-KG contains diverse concepts that are represented through varied multimodal data. (b) Concept-level descriptions linked to VAT-KG concepts are sufficiently comprehensive and informative. (c) VAT-KG ensures diversity across categories.

MMKGs

(a) A non-concept-centric case involving various modalities. (b) Concept-centric case with limited modalities. (c) Our proposed VAT-KG, which is concept-centric and covers four modalities.

Example Triplet in VAT-KG

A multi-modal triplet from VAT-KG, composed of video, audio, and text. Each head and tail is linked to a detailed concept-level description.

Multimodal RAG Framework

An overview of Multimodal RAG Framework. Given a query from any modality (audio, video, or text), (1) Modality-Agnostic Retrieval retrieves up to five semantically relevant triplets from VAT-KG based on embedding similarity; (2) Retrieval Checker filters out misaligned triplets by using the text encoder of the same multimodal foundation model; (3) Augmented Generation with MLLMs that support audio-visual understanding.

Question-Answering Examples

Overall Performance

Overall performance. We report M.J. (Model-as-Judge) scores; higher is better. Note that M2 ConceptBase does not contain audio, and thus cannot be applied to the Audio-QA task. VAT-KG yields the highest performance improvements, highlighted in bold.