料想不到:金融领域的故障安全长文本问答
Expect the Unexpected: FailSafe Long Context QA for Finance
February 10, 2025
作者: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh
cs.AI
摘要
我们提出了一个新的长文本金融基准测试集FailSafeQA,旨在测试基于LLM的问答系统在金融领域中的六种人机交互变化下的鲁棒性和上下文感知能力。我们专注于两个案例研究:查询失败和上下文失败。在查询失败场景中,我们扰动原始查询以在领域专业知识、完整性和语言准确性上产生变化。在上下文失败案例中,我们模拟了降级、无关和空文档的上传。我们采用LLM作为评判者的方法,使用Qwen2.5-72B-Instruct,并使用细粒度评分标准为24个现成模型定义和计算鲁棒性、上下文基础和符合性得分。结果表明,尽管一些模型擅长缓解输入扰动,但它们必须在稳健回答和避免产生幻觉的能力之间取得平衡。值得注意的是,被认为是最符合规范的模型Palmyra-Fin-128k-Instruct,在保持强大基准性能的同时,在17%的测试案例中遇到了维持稳健预测的挑战。另一方面,最具鲁棒性的模型OpenAI o3-mini,在41%的测试案例中捏造信息。结果表明,即使高性能模型也有很大的改进空间,并突出了FailSafeQA作为开发针对金融应用中可靠性优化的LLM的工具的作用。数据集可在以下链接获取:https://huggingface.co/datasets/Writer/FailSafeQA
English
We propose a new long-context financial benchmark, FailSafeQA, designed to
test the robustness and context-awareness of LLMs against six variations in
human-interface interactions in LLM-based query-answer systems within finance.
We concentrate on two case studies: Query Failure and Context Failure. In the
Query Failure scenario, we perturb the original query to vary in domain
expertise, completeness, and linguistic accuracy. In the Context Failure case,
we simulate the uploads of degraded, irrelevant, and empty documents. We employ
the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained
rating criteria to define and calculate Robustness, Context Grounding, and
Compliance scores for 24 off-the-shelf models. The results suggest that
although some models excel at mitigating input perturbations, they must balance
robust answering with the ability to refrain from hallucinating. Notably,
Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained
strong baseline performance but encountered challenges in sustaining robust
predictions in 17% of test cases. On the other hand, the most robust model,
OpenAI o3-mini, fabricated information in 41% of tested cases. The results
demonstrate that even high-performing models have significant room for
improvement and highlight the role of FailSafeQA as a tool for developing LLMs
optimized for dependability in financial applications. The dataset is available
at: https://huggingface.co/datasets/Writer/FailSafeQASummary
AI-Generated Summary