예상치 못한 것에 대비하라: 금융 분야를 위한 장기 문맥 QA 안전장치

초록

우리는 금융 분야의 LLM 기반 질의응답 시스템에서 인간-인터페이스 상호작용의 여섯 가지 변형에 대한 LLM의 견고성과 문맥 인식을 테스트하기 위해 설계된 새로운 긴 문맥 금융 벤치마크인 FailSafeQA를 제안합니다. 우리는 쿼리 실패와 문맥 실패 두 가지 케이스 스터디에 집중합니다. 쿼리 실패 시나리오에서는 원본 쿼리를 도메인 전문성, 완전성 및 언어적 정확성에 따라 변형시킵니다. 문맥 실패 케이스에서는 저하된, 관련성이 없는 및 비어 있는 문서를 업로드한 것을 시뮬레이션합니다. 우리는 Qwen2.5-72B-Instruct를 사용하여 LLM-판사 방법론을 적용하고 섬세한 등급 기준을 사용하여 24개의 오프더셀프 모델에 대한 견고성, 문맥 기반 및 준수 점수를 정의하고 계산합니다. 결과는 일부 모델이 입력 변형을 완화하는 데 뛰어나지만 견고한 답변과 환각을 자제할 능력을 균형있게 유지해야 한다는 것을 시사합니다. 특히, 가장 준수 모델로 인정받는 Palmyra-Fin-128k-Instruct는 강력한 기준 성능을 유지했지만 테스트 케이스의 17%에서 견고한 예측을 유지하는 데 어려움을 겪었습니다. 반면에 가장 견고한 모델인 OpenAI o3-mini는 테스트된 케이스의 41%에서 정보를 날조했습니다. 결과는 높은 성능을 보이는 모델들도 큰 개선 여지가 있음을 보여주며, 금융 응용 프로그램에서 신뢰성을 최적화하기 위해 개발된 LLM에 대한 도구로서 FailSafeQA의 역할을 강조합니다. 데이터셋은 다음에서 이용할 수 있습니다: https://huggingface.co/datasets/Writer/FailSafeQA

English

We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA

예상치 못한 것에 대비하라: 금융 분야를 위한 장기 문맥 QA 안전장치

Expect the Unexpected: FailSafe Long Context QA for Finance

초록

Support