BigDocs:一个开放且许可证自由的数据集,用于在文档和代码任务上训练多模态模型
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
December 5, 2024
作者: Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, Rabiul Awal, Mahsa Massoud, Amirhossein Abaskohi, Zichao Li, Suyuchen Wang, Pierre-André Noël, Mats Leon Richter, Saverio Vadacchino, Shubbam Agarwal, Sanket Biswas, Sara Shanian, Ying Zhang, Noah Bolger, Kurt MacDonald, Simon Fauvel, Sathwik Tejaswi, Srinivas Sunkara, Joao Monteiro, Krishnamurthy DJ Dvijotham, Torsten Scholak, Nicolas Chapados, Sepideh Kharagani, Sean Hughes, M. Özsu, Siva Reddy, Marco Pedersoli, Yoshua Bengio, Christopher Pal, Issam Laradji, Spandanna Gella, Perouz Taslakian, David Vazquez, Sai Rajeswar
cs.AI
摘要
多模态人工智能有潜力显著增强文档理解任务,如处理收据、理解工作流程、从文档中提取数据和总结报告。需要生成长结构化输出的代码生成任务也可以通过多模态方式得到增强。尽管如此,它们在商业应用中的使用通常受限于训练数据的有限获取和限制性许可,这限制了开放获取。为了解决这些限制,我们引入了BigDocs-7.5M,这是一个高质量的、开放获取的数据集,包括了涵盖30个任务的750万个多模态文档。我们使用高效的数据筛选过程来确保我们的数据是高质量的并且许可宽松的。我们的过程强调通过过滤规则、可追溯的元数据和仔细的内容分析来保证问责、责任和透明度。此外,我们引入了BigDocs-Bench,一个基准套件,其中包括了10个新颖任务,我们创建的数据集反映了涉及对图形用户界面(GUI)进行推理和从图像生成代码的实际用例。我们的实验表明,使用BigDocs-Bench进行训练可以将文档推理和结构化输出任务的平均性能提高高达25.8%,超过了封闭源GPT-4o。最后,人类评估显示,模型在BigDocs上训练的输出优于GPT-4o。这表明BigDocs可以帮助学术界和开源社区利用和改进人工智能工具,以增强多模态能力和文档推理。该项目托管在 https://bigdocs.github.io。
English
Multimodal AI has the potential to significantly enhance
document-understanding tasks, such as processing receipts, understanding
workflows, extracting data from documents, and summarizing reports. Code
generation tasks that require long-structured outputs can also be enhanced by
multimodality. Despite this, their use in commercial applications is often
limited due to limited access to training data and restrictive licensing, which
hinders open access. To address these limitations, we introduce BigDocs-7.5M, a
high-quality, open-access dataset comprising 7.5 million multimodal documents
across 30 tasks. We use an efficient data curation process to ensure our data
is high-quality and license-permissive. Our process emphasizes accountability,
responsibility, and transparency through filtering rules, traceable metadata,
and careful content analysis. Additionally, we introduce BigDocs-Bench, a
benchmark suite with 10 novel tasks where we create datasets that reflect
real-world use cases involving reasoning over Graphical User Interfaces (GUI)
and code generation from images. Our experiments show that training with
BigDocs-Bench improves average performance up to 25.8% over closed-source
GPT-4o in document reasoning and structured output tasks such as
Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a
preference for outputs from models trained on BigDocs over GPT-4o. This
suggests that BigDocs can help both academics and the open-source community
utilize and improve AI tools to enhance multimodal capabilities and document
reasoning. The project is hosted at https://bigdocs.github.io .Summary
AI-Generated Summary