论文标题
Nusacrowd:呼吁印尼语言开放和可重现的NLP研究
NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages
论文作者
论文摘要
在阻止印尼自然语言处理(NLP)研究进步的基本问题的中心,我们发现数据稀缺。印尼语言,尤其是当地语言的资源极为稀缺且代表性不足。许多印尼研究人员没有发布其数据集。此外,我们拥有的少数公共数据集散布在不同的平台上,因此使印尼NLP的可重复性和以数据为中心的研究更加艰巨。面对这一挑战,我们开始了第一个印尼NLP众包努力,Nusacrowd。 Nusacrowd致力于为所有印尼语言中的NLP任务提供标准化数据加载最大的数据集合。通过使印尼NLP资源的开放式和集中式访问能力,我们希望Nusacrowd能够解决印度尼西亚NLP进展的数据稀缺问题,并将NLP从业者带来合作。
At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP even more arduous. Rising to this challenge, we initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd strives to provide the largest datasheets aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.
