대규모 언어 모델과 지식 그래프를 매끄럽게 통합하기 위한 자기 지도 양자화 표현

초록

지식 그래프(Knowledge Graph, KG) 구조와 자연 언어 간의 간극이 존재하기 때문에, KG의 전체적인 구조 정보를 대형 언어 모델(Large Language Models, LLMs)과 효과적으로 통합하는 것이 중요한 문제로 부상했습니다. 이를 위해 우리는 각 엔티티에 대한 양자화된 코드를 학습하고 적용하기 위한 이차적인 프레임워크를 제안합니다. 먼저, 자가 지도 학습 양자화 표현(Self-Supervised Quantized Representation, SSQR) 방법을 제안하여 KG 구조적 및 의미적 지식을 언어 문장 형식에 일치시키는 이산 코드(즉, 토큰)로 압축합니다. 이후, 이러한 학습된 코드를 LLMs에 직접 입력할 기능으로 보고 KG 지시 따르기 데이터를 설계하여 원활한 통합을 달성합니다. 실험 결과는 SSQR이 기존의 비지도 학습 양자화 방법을 능가하며, 더욱 구별력 있는 코드를 생성한다는 것을 보여줍니다. 더불어, 세밀하게 조정된 LLaMA2 및 LLaMA3.1은 기존의 프롬프팅 방법에서 수천 개의 토큰 대신 각 엔티티 당 단지 16개의 토큰을 활용하여 KG 링크 예측 및 트리플 분류 작업에서 우수한 성능을 보입니다.

English

Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.

대규모 언어 모델과 지식 그래프를 매끄럽게 통합하기 위한 자기 지도 양자화 표현

Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models

초록

Support