论文标题
Tweetbert:Twitter文本分析的验证语言表示模型
TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis
论文作者
论文摘要
Twitter是一个著名的微博社会网站,用户可以实时表达自己的观点和意见。结果,推文倾向于包含有价值的信息。随着自然语言处理领域深度学习的进步,从推文中提取有意义的信息已成为自然语言研究人员的日益兴趣。应用现有的语言表示模型从Twitter提取信息通常不会产生良好的结果。此外,没有针对社交媒体领域的文本分析的现有语言表示模型。因此,在本文中,我们介绍了两个Tweetbert模型,这些模型是域特定的语言演示模型,在数百万推文中预先培训。我们表明,在每个Twitter数据集中,Tweetbert模型在Twitter文本挖掘任务中大大优于传统的BERT模型。我们还通过评估31个不同数据集上的七个BERT模型来提供广泛的分析。我们的结果证明了我们的假设,即在Twitter语料库上不断训练语言模型在Twitter上有助于表现。
Twitter is a well-known microblogging social site where users express their views and opinions in real-time. As a result, tweets tend to contain valuable information. With the advancements of deep learning in the domain of natural language processing, extracting meaningful information from tweets has become a growing interest among natural language researchers. Applying existing language representation models to extract information from Twitter does not often produce good results. Moreover, there is no existing language representation models for text analysis specific to the social media domain. Hence, in this article, we introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets. We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset. We also provide an extensive analysis by evaluating seven BERT models on 31 different datasets. Our results validate our hypothesis that continuously training language models on twitter corpus help performance with Twitter.
