跨文本、语音和视频数据溯源之间的桥梁
Bridging the Data Provenance Gap Across Text, Speech and Video
December 19, 2024
作者: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
cs.AI
摘要
人工智能领域的进展在很大程度上受到训练数据的规模和质量的驱动。尽管如此,目前存在一个研究不足,缺乏对除文本以外的成熟数据集属性进行经验分析。在本研究中,我们进行了跨模态的最大规模和首个纵向审计,涵盖了流行的文本、语音和视频数据集,从它们的详细来源趋势和使用限制到地理和语言表示。我们的手动分析涵盖了1990年至2024年间近4000个公共数据集,涵盖了608种语言、798个来源、659个组织和67个国家。我们发现,多模态机器学习应用普遍转向网络抓取、合成和社交媒体平台(如YouTube)作为它们的训练集,自2019年以来超越了所有其他来源。其次,追溯数据集派生链,我们发现虽然不到33%的数据集受到限制性许可,但在广泛使用的文本、语音和视频数据集中,超过80%的源内容带有非商业限制。最后,尽管公共人工智能训练数据集中代表的语言和地理位置数量不断增加,但我们的审计表明,相对地理和多语言表示的度量自2013年以来未能显著改善其覆盖范围。我们认为,我们审计的广度使我们能够从生态系统层面对数据来源、限制和西方中心主义的趋势进行经验性检验,而对这些问题的可见性对于负责任的人工智能进展至关重要。作为对数据透明度和负责任使用持续改进的贡献,我们发布了整个多模态审计,使从业者能够跟踪文本、语音和视频数据的数据来源。
English
Progress in AI is driven largely by the scale and quality of training data.
Despite this, there is a deficit of empirical analysis examining the attributes
of well-established datasets beyond text. In this work we conduct the largest
and first-of-its-kind longitudinal audit across modalities--popular text,
speech, and video datasets--from their detailed sourcing trends and use
restrictions to their geographical and linguistic representation. Our manual
analysis covers nearly 4000 public datasets between 1990-2024, spanning 608
languages, 798 sources, 659 organizations, and 67 countries. We find that
multimodal machine learning applications have overwhelmingly turned to
web-crawled, synthetic, and social media platforms, such as YouTube, for their
training sets, eclipsing all other sources since 2019. Secondly, tracing the
chain of dataset derivations we find that while less than 33% of datasets are
restrictively licensed, over 80% of the source content in widely-used text,
speech, and video datasets, carry non-commercial restrictions. Finally, counter
to the rising number of languages and geographies represented in public AI
training datasets, our audit demonstrates measures of relative geographical and
multilingual representation have failed to significantly improve their coverage
since 2013. We believe the breadth of our audit enables us to empirically
examine trends in data sourcing, restrictions, and Western-centricity at an
ecosystem-level, and that visibility into these questions are essential to
progress in responsible AI. As a contribution to ongoing improvements in
dataset transparency and responsible use, we release our entire multimodal
audit, allowing practitioners to trace data provenance across text, speech, and
video.Summary
AI-Generated Summary