面向LLM训练的开放数据集最佳实践指南

Towards Best Practices for Open Datasets for LLM Training

January 14, 2025
作者: Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang, Leandro von Werra, Mitchell Baker, Julie Belião, Kasia Chmielinski, Marzieh Fadaee, Lisa Gutermuth, Hynek Kydlíček, Greg Leppert, EM Lewis-Jong, Solana Larsen, Shayne Longpre, Angela Oduor Lungati, Cullen Miller, Victor Miller, Max Ryabinin, Kathleen Siminyu, Andrew Strait, Mark Surman, Anna Tumadóttir, Maurice Weber, Rebecca Weiss, Lee White, Thomas Wolf
cs.AI

摘要

许多人工智能公司正在未经版权所有者许可的情况下,对数据进行大规模语言模型(LLMs)的训练。这样做的可行性因司法管辖区而异:在欧盟和日本等国家,这在一定限制下是允许的,而在美国,法律环境更加模糊。无论法律地位如何,创意生产者的担忧导致了一些备受关注的版权诉讼,诉讼威胁通常被引用为最近趋势中减少企业和公益行为者分享有关训练数据集信息的原因。这种限制数据信息的趋势会通过拒绝研究人员、审计员和受影响个人获取理解人工智能模型所需信息,从而损害生态系统中的透明度、问责制和创新。 尽管通过在开放获取和公共领域数据上训练语言模型可以缓解这一问题,但在撰写本文时,由于在组装必要语料库方面存在重大技术和社会挑战,尚无此类模型(以有意义的规模进行训练)。这些挑战包括不完整和不可靠的元数据、数字化实体记录的成本和复杂性,以及确保在快速变化的环境中具有相关性和责任性所需的多样化的法律和技术技能。朝着未来的方向努力,即人工智能系统可以在负责任策划和管理的开放许可数据上进行训练,需要跨法律、技术和政策领域的合作,以及对元数据标准、数字化和培育开放文化的投资。
English
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

Summary

AI-Generated Summary

PDF403January 16, 2025