跨越文字、語音和影片之間的數據來源差距
Bridging the Data Provenance Gap Across Text, Speech and Video
December 19, 2024
作者: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
cs.AI
摘要
人工智慧的進展在很大程度上受到訓練數據的規模和質量的推動。
儘管如此,對於除了文本以外的眾所周知數據集的屬性缺乏實證分析。在這項工作中,我們進行了跨模態的最大型和首創的長期審計,涵蓋了流行的文本、語音和視頻數據集,從它們的詳細來源趨勢和使用限制到它們的地理和語言表示。我們的手動分析涵蓋了從1990年至2024年間的近4000個公共數據集,涵蓋了608種語言、798個來源、659個組織和67個國家。我們發現,多模態機器學習應用在訓練集方面主要轉向了網絡爬蟲、合成和社交媒體平台,例如YouTube,自2019年以來已超越所有其他來源。其次,追蹤數據集派生鏈,我們發現,儘管不到33%的數據集受到限制性許可,但在廣泛使用的文本、語音和視頻數據集中,超過80%的來源內容帶有非商業限制。最後,與公共人工智慧訓練數據中代表的語言和地理數量不斷增加相反,我們的審計顯示,相對地理和多語言代表性的度量自2013年以來未能顯著改善其覆蓋範圍。我們相信,我們審計的廣度使我們能夠實證地檢視數據來源、限制和西方中心性的趨勢,並且對這些問題的可見性對於負責任的人工智慧進展至關重要。作為對數據集透明度和負責任使用持續改進的貢獻,我們公開了整個多模態審計,讓從業者能夠追蹤文本、語音和視頻的數據來源。
English
Progress in AI is driven largely by the scale and quality of training data.
Despite this, there is a deficit of empirical analysis examining the attributes
of well-established datasets beyond text. In this work we conduct the largest
and first-of-its-kind longitudinal audit across modalities--popular text,
speech, and video datasets--from their detailed sourcing trends and use
restrictions to their geographical and linguistic representation. Our manual
analysis covers nearly 4000 public datasets between 1990-2024, spanning 608
languages, 798 sources, 659 organizations, and 67 countries. We find that
multimodal machine learning applications have overwhelmingly turned to
web-crawled, synthetic, and social media platforms, such as YouTube, for their
training sets, eclipsing all other sources since 2019. Secondly, tracing the
chain of dataset derivations we find that while less than 33% of datasets are
restrictively licensed, over 80% of the source content in widely-used text,
speech, and video datasets, carry non-commercial restrictions. Finally, counter
to the rising number of languages and geographies represented in public AI
training datasets, our audit demonstrates measures of relative geographical and
multilingual representation have failed to significantly improve their coverage
since 2013. We believe the breadth of our audit enables us to empirically
examine trends in data sourcing, restrictions, and Western-centricity at an
ecosystem-level, and that visibility into these questions are essential to
progress in responsible AI. As a contribution to ongoing improvements in
dataset transparency and responsible use, we release our entire multimodal
audit, allowing practitioners to trace data provenance across text, speech, and
video.Summary
AI-Generated Summary