ChatPaper.aiChatPaper

黄金试金石:用于评估金融大型语言模型的全面双语基准

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

November 9, 2024
作者: Xiaojun Wu, Junxi Liu, Huanyi Su, Zhouchi Lin, Yiyan Qi, Chengjin Xu, Jiajun Su, Jiajie Zhong, Fuwei Wang, Saizhuo Wang, Fengrui Hua, Jia Li, Jian Guo
cs.AI

摘要

随着大型语言模型在金融领域日益普及,迫切需要一种标准化方法来全面评估它们的性能。然而,现有的金融基准往往存在语言和任务覆盖范围有限,以及数据集质量低和不适合LLM评估等挑战。为了解决这些限制,我们提出了“Golden Touchstone”,这是第一个针对金融LLM的全面双语基准,涵盖了来自中英文的代表性数据集,涵盖了八个核心金融自然语言处理任务。该基准是从广泛的开源数据收集和行业特定需求中开发而来,包括各种金融任务,旨在全面评估模型的语言理解和生成能力。通过对基准上主要模型(如GPT-4o Llama3、FinGPT和FinMA)的比较分析,我们揭示了它们在处理复杂金融信息方面的优势和局限性。此外,我们开源了Touchstone-GPT,这是通过持续预训练和金融指导调整训练的金融LLM,在双语基准上表现出色,但在特定任务上仍存在局限性。这项研究不仅为金融大型语言模型提供了实用的评估工具,还指导了未来研究的发展和优化。Golden Touchstone的源代码和Touchstone-GPT的模型权重已公开在https://github.com/IDEA-FinAI/Golden-Touchstone,有助于金融LLM的持续发展,并促进这一关键领域的进一步研究。
English
As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.

Summary

AI-Generated Summary

PDF32November 12, 2024