황금 타우치스톤: 금융 대형 언어 모델을 평가하기 위한 포괄적인 이중 언어 벤치마크

초록

대규모 언어 모델이 금융 분야에서 점차 보편화되면서, 이러한 성능을 철저히 평가할 수 있는 표준화된 방법이 절실하게 필요합니다. 그러나 기존의 금융 벤치마크는 종종 언어와 작업 범위가 제한되어 있고, 저품질 데이터셋과 LLM 평가를 위한 부적응성과 같은 도전에 직면하고 있습니다. 이러한 한계를 극복하기 위해, 우리는 금융 LLM을 위한 첫 번째 포괄적인 이중 언어 벤치마크인 "Golden Touchstone"을 제안합니다. 이 벤치마크는 중국어와 영어의 대표적인 데이터셋을 포함한 여덟 가지 핵심 금융 자연어 처리 작업을 아우릅니다. 광범위한 오픈 소스 데이터 수집과 산업 특화 요구사항에서 개발된 이 벤치마크는 모델의 언어 이해 및 생성 능력을 철저히 평가하기 위한 다양한 금융 작업을 포함하고 있습니다. GPT-4o Llama3, FinGPT 및 FinMA와 같은 주요 모델들을 비교 분석하여, 이러한 모델들이 복잡한 금융 정보를 처리하는 데 갖는 장단점을 밝히고 있습니다. 또한, 지속적인 사전 훈련과 금융 지침 튜닝을 통해 훈련된 금융 LLM인 Touchstone-GPT의 소스 코드를 공개하였으며, 이 모델은 이중 언어 벤치마크에서 강력한 성능을 보여주지만 특정 작업에서는 여전히 한계가 있습니다. 이 연구는 금융 대규모 언어 모델에 실용적인 평가 도구를 제공할 뿐만 아니라, 향후 연구의 개발과 최적화를 이끌어내는 역할을 합니다. Golden Touchstone의 소스 코드와 Touchstone-GPT의 모델 가중치는 https://github.com/IDEA-FinAI/Golden-Touchstone에서 공개되어 있으며, FinLLM의 지속적인 진화에 기여하고 이 중요한 영역에서의 추가 연구를 촉진하고 있습니다.

English

As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.

황금 타우치스톤: 금융 대형 언어 모델을 평가하기 위한 포괄적인 이중 언어 벤치마크

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

초록

Support