나는 모른다: [IDK] 토큰을 사용한 불확실성의 명시적 모델링

초록

대형 언어 모델은 실제 세계 지식을 포착하여 많은 하위 작업에서 뛰어난 성과를 거둘 수 있는 것으로 알려져 있습니다. 최근의 발전에도 불구하고, 이러한 모델은 여전히 일반적으로 알려진 환각으로 인해 원치 않는 사실적으로 부정확한 텍스트를 생성할 수 있는 취약점을 가지고 있습니다. 본 연구에서는 환각을 대항할 수 있는 새로운 보정 방법을 제안합니다. 우리는 모델의 어휘에 특별한 "[IDK] (I don't know)" 토큰을 추가하고, 잘못된 예측에 대해 [IDK] 토큰으로 확률을 이동시키는 목적 함수를 도입합니다. 이 접근 방식은 모델이 출력에서 불확실성을 명시적으로 표현할 수 있게 합니다. 우리는 제안한 방법을 여러 모델 아키텍처와 사실적인 하위 작업을 통해 평가합니다. 우리는 우리의 방법으로 훈련된 모델이 이전에 실수를 저지르던 곳에서 불확실성을 표현할 수 있으며, 인코딩된 지식의 손실이 거의 없다는 것을 발견합니다. 또한 우리의 접근 방식의 여러 변형에 대한 철저한 제거 연구를 수행하고, 우리의 방법의 정밀도-재현율 균형에 대한 상세한 분석을 제공합니다.

English

Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we propose a novel calibration method that can be used to combat hallucinations. We add a special [IDK] ("I don't know") token to the model's vocabulary and introduce an objective function that shifts probability mass to the [IDK] token for incorrect predictions. This approach allows the model to express uncertainty in its output explicitly. We evaluate our proposed method across multiple model architectures and factual downstream tasks. We find that models trained with our method are able to express uncertainty in places where they would previously make mistakes while suffering only a small loss of encoded knowledge. We further perform extensive ablation studies of multiple variations of our approach and provide a detailed analysis of the precision-recall tradeoff of our method.

나는 모른다: [IDK] 토큰을 사용한 불확실성의 명시적 모델링

I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token

초록

Summary

Support