SynFinTabs:一個用於資訊和表格提取的合成金融表格數據集。

SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction

December 5, 2024
作者: Ethan Bradley, Muhammad Roman, Karen Rafferty, Barry Devereux
cs.AI

摘要

從文件圖像中提取表格是一個具有挑戰性的人工智慧問題,對許多內容領域來說,標記數據很難獲得。現有的表格提取數據集通常專注於科學表格,因為有大量學術文章和其源代碼可供使用。然而,在科學、金融和其他領域找到的表格之間存在顯著的版面和排印差異。目前的數據集通常缺乏表格中包含的文字及其位置,而是依賴不可靠的OCR來提取這些特徵,以訓練現代機器學習模型進行自然語言處理任務。因此,需要一種更通用的方法來獲取標記數據。我們提出SynFinTabs,這是一個大規模的、標記的合成金融表格數據集。我們希望我們生成這些合成表格的方法可以應用到其他領域。為了展示我們的數據集在訓練模型從表格圖像中提取信息方面的有效性,我們創建了FinTabQA,這是一個基於提取式問答任務訓練的佈局大型語言模型。我們使用真實世界的金融表格來測試我們的模型,並將其與最先進的生成模型進行比較,並討論結果。我們將數據集、模型和數據集生成代碼公開提供。
English
Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.

Summary

AI-Generated Summary

PDF42December 6, 2024