SynFinTabs:用于信息提取和表格提取的合成金融表数据集
SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
December 5, 2024
作者: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
cs.AI
摘要
从文档图像中提取表格是一个具有挑战性的人工智能问题,许多内容领域的标记数据难以获取。现有的表格提取数据集通常侧重于科学表格,因为大量学术文章及其源代码是readily available。然而,在科学、金融和其他领域的表格之间存在显著的布局和排版差异。当前的数据集通常缺乏表格中包含的单词及其位置,而是依赖不可靠的OCR来提取这些特征,以训练现代机器学习模型进行自然语言处理任务。因此,需要一种更通用的获取标记数据的方法。我们提出SynFinTabs,一个大规模的合成金融表格标记数据集。我们希望我们生成这些合成表格的方法可以迁移到其他领域。为了展示我们的数据集在训练模型从表格图像中提取信息方面的有效性,我们创建了FinTabQA,一个基于抽取式问答任务训练的大型语言模型。我们使用真实世界的金融表格测试我们的模型,并将其与最先进的生成模型进行比较,并讨论结果。我们公开提供数据集、模型和数据集生成代码。
English
Table extraction from document images is a challenging AI problem, and
labelled data for many content domains is difficult to come by. Existing table
extraction datasets often focus on scientific tables due to the vast amount of
academic articles that are readily available, along with their source code.
However, there are significant layout and typographical differences between
tables found across scientific, financial, and other domains. Current datasets
often lack the words, and their positions, contained within the tables, instead
relying on unreliable OCR to extract these features for training modern machine
learning models on natural language processing tasks. Therefore, there is a
need for a more general method of obtaining labelled data. We present
SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our
hope is that our method of generating these synthetic tables is transferable to
other domains. To demonstrate the effectiveness of our dataset in training
models to extract information from table images, we create FinTabQA, a layout
large language model trained on an extractive question-answering task. We test
our model using real-world financial tables and compare it to a
state-of-the-art generative model and discuss the results. We make the dataset,
model, and dataset generation code publicly available.Summary
AI-Generated Summary