朝向用於大型語言模型訓練的開放數據集最佳實踐
Towards Best Practices for Open Datasets for LLM Training
January 14, 2025
作者: Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang, Leandro von Werra, Mitchell Baker, Julie Belião, Kasia Chmielinski, Marzieh Fadaee, Lisa Gutermuth, Hynek Kydlíček, Greg Leppert, EM Lewis-Jong, Solana Larsen, Shayne Longpre, Angela Oduor Lungati, Cullen Miller, Victor Miller, Max Ryabinin, Kathleen Siminyu, Andrew Strait, Mark Surman, Anna Tumadóttir, Maurice Weber, Rebecca Weiss, Lee White, Thomas Wolf
cs.AI
摘要
許多人工智慧公司正在未經版權所有者許可的情況下,對大型語言模型(LLMs)進行訓練。這樣做的可行性因司法管轄範圍而異:在歐盟和日本等國家,這在一定限制下是被允許的,而在美國,法律環境則更加模糊。無論法律地位如何,來自創意生產者的擔憂已導致多起知名的版權訴訟,訴訟威脅通常被引用為最近限制企業和公眾利益行為者分享訓練數據集信息的趨勢的原因。這種限制數據信息的趨勢對於阻礙廣泛生態系統中的透明度、責任和創新造成了傷害,因為這樣做剝奪了研究人員、審計人員和受影響個人瞭解人工智慧模型所需信息的訪問權。
儘管這可能通過在開放訪問和公共領域數據上訓練語言模型來緩解,但在撰寫本文時,由於組建必要語料庫所面臨的重大技術和社會挑戰,目前尚無此類模型(在有意義的規模上進行訓練)。這些挑戰包括不完整和不可靠的元數據、將實體記錄數字化的成本和複雜性,以及確保在快速變化的環境中具有相關性和責任性所需的多樣化法律和技術技能。朝著未來的方向努力,人工智慧系統可以在負責任策劃和管理的開放許可數據上進行訓練,這需要跨法律、技術和政策領域的合作,以及對元數據標準、數字化和促進開放文化的投資。
English
Many AI companies are training their large language models (LLMs) on data
without the permission of the copyright owners. The permissibility of doing so
varies by jurisdiction: in countries like the EU and Japan, this is allowed
under certain restrictions, while in the United States, the legal landscape is
more ambiguous. Regardless of the legal status, concerns from creative
producers have led to several high-profile copyright lawsuits, and the threat
of litigation is commonly cited as a reason for the recent trend towards
minimizing the information shared about training datasets by both corporate and
public interest actors. This trend in limiting data information causes harm by
hindering transparency, accountability, and innovation in the broader ecosystem
by denying researchers, auditors, and impacted individuals access to the
information needed to understand AI models.
While this could be mitigated by training language models on open access and
public domain data, at the time of writing, there are no such models (trained
at a meaningful scale) due to the substantial technical and sociological
challenges in assembling the necessary corpus. These challenges include
incomplete and unreliable metadata, the cost and complexity of digitizing
physical records, and the diverse set of legal and technical skills required to
ensure relevance and responsibility in a quickly changing landscape. Building
towards a future where AI systems can be trained on openly licensed data that
is responsibly curated and governed requires collaboration across legal,
technical, and policy domains, along with investments in metadata standards,
digitization, and fostering a culture of openness.Summary
AI-Generated Summary