Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent progress in language model pre-training has improved Named Entity Recognition (NER).
  • NER has mainly been tested in well-formatted documents.
  • Social media adds complexity due to its noisy and dynamic nature.
  • A new NER dataset, TweetNER7, was constructed for Twitter.
  • Language model baselines were provided and an analysis was performed.
  • Three temporal aspects were analyzed: short-term degradation, fine-tuning strategies, and self-labeling.
  • TweetNER7 was released publicly.

Paper Content

Introduction

  • Named Entity Recognition (NER) is a longstanding NLP task
  • Common and successful type of NER system is achieved by fine-tuning pre-trained language models
  • LM finetuning based NER models achieve over 90% F1 score in standard NER datasets
  • Specialized domains such as financial news, biochemical, or biomedical still pose additional challenges
  • Social media is one of the most challenging domains for NER
  • Social media texts are generally more noisy and less formal
  • Presence of (quick) temporal shifts in the text semantics
  • Recent approaches to deal with temporal shifts in social media
  • Proposed new NER dataset for Twitter (TweetNER7)
  • 11,382 annotated tweets in total, spanning seven entity types
  • Baseline results with language model finetuning show difficulty of TweetNER7
  • Temporal analysis with different strategies including self-labeling
  • CoNLL2003 and OntoNotes5 are widely used NER datasets
  • WikiAnn and MultiNERD are multilingual NER datasets
  • FIN is a NER dataset of financial news
  • BioNLP2004 and BioCreative are constructed from scientific documents
  • BTC is a pioneering NER dataset for social media
  • WNUT2017 contains unseen entities from social media
  • Twee-BankNER dataset annotates TweeBank with entity labels
  • TTC is a temporal Twitter Corpus NER dataset
  • TweetNER7 is a new NER dataset with seven general entity types

Data collection

  • A computer science paper is discussing a NER dataset that annotates a tweet collection.
  • The tweet collection is from September 2019 to August 2021.
  • The tweets were filtered using weekly trending topics and other types of filtering.
  • The tweets were split into two periods: September 2019 to August 2020 and September 2020 to August 2021.

Dataset annotation

  • Conducted manual annotation on Amazon Mechanical Turk
  • Split tweets into two periods: September 2019 to August 2020 and September 2020 to August 2021
  • Collected 36,000 annotations in total
  • Employed seven labels: person, location, corporation, creative work, group, product, and event
  • Pre-processed tweets before annotation
  • Quality control by taking agreement into account

Statistics

  • Dataset contains 10k+ annotations
  • Covers a wide range of entity types
  • Includes recent tweets from 2019-2021
  • Distribution of tweets is uniform over time
  • Uneven distribution of instances per year

Baseline results

  • Introduced baselines with language model fine-tuning on TweetNER7 in temporal-shift setup
  • Used BERT, RoBERTa, BERTweet, and TimeLMs
  • Evaluated models using micro/macro F1 score and type-ignored F1 score
  • Used two-phase grid search to find best combination of hyperparameters
  • RoBERTa LARGE best across metrics
  • Overall metrics lower than standard NER datasets
  • TimeLM 2020 performs worse than other RoBERTa models
  • BERTweet performs better on 2020 test set
  • Person entity type has highest F1 score (around 80%)
  • Creative work and location have lowest F1 scores (around 40% and 60%)

Temporal analysis

  • Compare temporal vs. random splits
  • Compare joint vs. continuous fine-tuning
  • Explore self-labeling as a solution to temporal shifts

Short-term temporal effect

  • TweetNER7 performance is tested without temporal-shift
  • Training and validation sets are randomly sampled from September 2019 to August 2021
  • Test set is not changed
  • F1 scores on 2021 test set are improved, F1 scores on 2020 test set are decreased
  • Benefit of having a human annotated training set from the test period is highlighted

Continuous vs. joint fine-tuning

  • Previous experiments showed differences between training and testing on the time period or not.
  • Aim of this analysis is to explore strategies to improve the original model.
  • Employed a continuous finetuning scheme, fine-tuning LMs on the 2020-set and then continuing on the 2021-set.
  • Table 7 shows results of all strategies for different language models.
  • Continuous fine-tuning provides best results in terms of micro F1 and type-ignored F1 in the 2021 test sets.

Self-labeling

  • Compared different strategies when a human-annotated training dataset from the test period was considered
  • Improvements can be obtained when the time between training and test data is reduced
  • Not practical to require a large amount of human resources to annotate newer tweets
  • Alternative approach to rely on distantly annotated tweets by the already fine-tuned model
  • Reproduced experiments in TweetNER7 dataset focusing on short-term temporal shift
  • Self-labeling does not help to mitigate temporal-shift in TweetNER7
  • Analysis of self-labeled tweets to find ratio of correct predictions within the retrieved predictions
  • Most frequent predictions are usually the same as the original predictions
  • Second most frequent predictions are on average the correct ones

Conclusion

  • Constructed TweetNER7, a new NER dataset for Twitter
  • 11,382 tweets annotated with seven entity types
  • Tweets distributed uniformly over time from September 2019 to August 2021
  • Leveraged weekly trending topics to query tweets
  • Established baselines on TweetNER7 by fine-tuning standard and Twitter-specific LMs
  • Performed targeted temporal-related analyses
  • Self-labeling not enough to mitigate temporal-shift
  • Designed to study short-term temporal-shift
  • Future work to extend to languages other than English
  • Future work to add data from other social media platforms