WikiNER-fr-gold:一個黃金標準的命名實體識別語料庫
WikiNER-fr-gold: A Gold-Standard NER Corpus
October 29, 2024
作者: Danrun Cao, Nicolas Béchet, Pierre-François Marteau
cs.AI
摘要
本文討論了WikiNER語料庫的質量,這是一個多語言命名實體識別語料庫,並提供了其整合版本。WikiNER的標註是以半監督的方式生成的,即沒有進行事後手動驗證。這樣的語料庫被稱為銀標準。在本文中,我們提出了WikiNER-fr-gold,這是WikiNER法語部分的修訂版本。我們的語料庫包括原始法語子語料庫的隨機抽樣的20%(26,818個句子,700k個標記)。我們首先總結了每個類別中包含的實體類型,以制定標註準則,然後我們開始修訂語料庫。最後,我們對WikiNER-fr語料庫中觀察到的錯誤和不一致性進行了分析,並討論了潛在的未來工作方向。
English
We address in this article the the quality of the WikiNER corpus, a
multilingual Named Entity Recognition corpus, and provide a consolidated
version of it. The annotation of WikiNER was produced in a semi-supervised
manner i.e. no manual verification has been carried out a posteriori. Such
corpus is called silver-standard. In this paper we propose WikiNER-fr-gold
which is a revised version of the French proportion of WikiNER. Our corpus
consists of randomly sampled 20% of the original French sub-corpus (26,818
sentences with 700k tokens). We start by summarizing the entity types included
in each category in order to define an annotation guideline, and then we
proceed to revise the corpus. Finally we present an analysis of errors and
inconsistency observed in the WikiNER-fr corpus, and we discuss potential
future work directions.Summary
AI-Generated Summary