BigDocs:用於在文件和程式碼任務上訓練多模型的開放且採用寬鬆授權的數據集
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
December 5, 2024
作者: Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, Rabiul Awal, Mahsa Massoud, Amirhossein Abaskohi, Zichao Li, Suyuchen Wang, Pierre-André Noël, Mats Leon Richter, Saverio Vadacchino, Shubbam Agarwal, Sanket Biswas, Sara Shanian, Ying Zhang, Noah Bolger, Kurt MacDonald, Simon Fauvel, Sathwik Tejaswi, Srinivas Sunkara, Joao Monteiro, Krishnamurthy DJ Dvijotham, Torsten Scholak, Nicolas Chapados, Sepideh Kharagani, Sean Hughes, M. Özsu, Siva Reddy, Marco Pedersoli, Yoshua Bengio, Christopher Pal, Issam Laradji, Spandanna Gella, Perouz Taslakian, David Vazquez, Sai Rajeswar
cs.AI
摘要
多模式人工智慧具有顯著增強文件理解任務的潛力,例如處理收據、理解工作流程、從文件中提取數據和總結報告。同時,需要生成長結構輸出的代碼生成任務也可以通過多模式進行增強。儘管如此,由於訓練數據的訪問受限以及限制性許可,它們在商業應用中的使用往往受到限制,這阻礙了開放訪問。為了解決這些限制,我們引入了BigDocs-7.5M,這是一個高質量的、開放訪問的數據集,包括了30個任務中的750萬個多模式文件。我們使用高效的數據策劃過程來確保我們的數據是高質量且許可權開放的。我們的過程通過篩選規則、可追溯的元數據和仔細的內容分析來強調問責、責任和透明度。此外,我們引入了BigDocs-Bench,這是一個基準套件,包含了10個新任務,我們創建了反映現實用例的數據集,涉及對圖形用戶界面(GUI)進行推理和從圖像生成代碼。我們的實驗表明,使用BigDocs-Bench進行訓練可以將文件推理和結構輸出任務的平均性能提高高達25.8%,超過了閉源GPT-4o,如Screenshot2HTML或Image2Latex生成。最後,人類評估表明,從在BigDocs上訓練的模型輸出更受歡迎,而不是GPT-4o。這表明BigDocs可以幫助學術界和開源社區利用和改進人工智慧工具,以增強多模式功能和文件推理。該項目托管在https://bigdocs.github.io。
English
Multimodal AI has the potential to significantly enhance
document-understanding tasks, such as processing receipts, understanding
workflows, extracting data from documents, and summarizing reports. Code
generation tasks that require long-structured outputs can also be enhanced by
multimodality. Despite this, their use in commercial applications is often
limited due to limited access to training data and restrictive licensing, which
hinders open access. To address these limitations, we introduce BigDocs-7.5M, a
high-quality, open-access dataset comprising 7.5 million multimodal documents
across 30 tasks. We use an efficient data curation process to ensure our data
is high-quality and license-permissive. Our process emphasizes accountability,
responsibility, and transparency through filtering rules, traceable metadata,
and careful content analysis. Additionally, we introduce BigDocs-Bench, a
benchmark suite with 10 novel tasks where we create datasets that reflect
real-world use cases involving reasoning over Graphical User Interfaces (GUI)
and code generation from images. Our experiments show that training with
BigDocs-Bench improves average performance up to 25.8% over closed-source
GPT-4o in document reasoning and structured output tasks such as
Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a
preference for outputs from models trained on BigDocs over GPT-4o. This
suggests that BigDocs can help both academics and the open-source community
utilize and improve AI tools to enhance multimodal capabilities and document
reasoning. The project is hosted at https://bigdocs.github.io .Summary
AI-Generated Summary