论文标题
Kleister:信息提取的新任务,涉及带有复杂布局的长文档
Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout
论文作者
论文摘要
自然语言处理的最新解决方案(NLP)能够捕获广泛的上下文,例如句子级上下文或简短文档的文档级上下文。但是,当涉及到文档的空间结构中编码的信息,例如表,表格,表单,标题,开口或页脚等页面元素,这些解决方案仍在挣扎。复杂的页面布局或多个页面的存在。 为了鼓励在更深入,更复杂的信息提取(IE)上进展(IE),我们使用两个新数据集介绍了一项新任务(名为Kleister)。 NLP系统使用文本和结构布局特征,必须在长期正式文档中找到有关各种实体的最重要信息。我们将管道方法作为仅文本基线的,具有不同命名的实体识别体系结构(Flair,Bert,Roberta)。此外,我们检查了最受欢迎的PDF处理工具用于文本提取(PDF2DJVU,Tesseract和swextract),以便在存在这些工具引入的错误的情况下分析IE系统的行为。
State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we checked the most popular PDF processing tools for text extraction (pdf2djvu, Tesseract and Textract) in order to analyze behavior of IE system in presence of errors introduced by these tools.
