Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent progress in language model pre-training has improved Named Entity Recognition (NER).
- NER has mainly been tested in well-formatted documents.
- Social media adds complexity due to its noisy and dynamic nature.
- A new NER dataset, TweetNER7, was constructed for Twitter.
- Language model baselines were provided and an analysis was performed.
- Three temporal aspects were analyzed: short-term degradation, fine-tuning strategies, and self-labeling.
- TweetNER7 was released publicly.
Paper Content
Introduction
- Named Entity Recognition (NER) is a longstanding NLP task
- Common and successful type of NER system is achieved by fine-tuning pre-trained language models
- LM finetuning based NER models achieve over 90% F1 score in standard NER datasets
- Specialized domains such as financial news, biochemical, or biomedical still pose additional challenges
- Social media is one of the most challenging domains for NER
- Social media texts are generally more noisy and less formal
- Presence of (quick) temporal shifts in the text semantics
- Recent approaches to deal with temporal shifts in social media
- Proposed new NER dataset for Twitter (TweetNER7)
- 11,382 annotated tweets in total, spanning seven entity types
- Baseline results with language model finetuning show difficulty of TweetNER7
- Temporal analysis with different strategies including self-labeling
Related work
- CoNLL2003 and OntoNotes5 are widely used NER datasets
- WikiAnn and MultiNERD are multilingual NER datasets
- FIN is a NER dataset of financial news
- BioNLP2004 and BioCreative are constructed from scientific documents
- BTC is a pioneering NER dataset for social media
- WNUT2017 contains unseen entities from social media
- Twee-BankNER dataset annotates TweeBank with entity labels
- TTC is a temporal Twitter Corpus NER dataset
- TweetNER7 is a new NER dataset with seven general entity types
Data collection
- A computer science paper is discussing a NER dataset that annotates a tweet collection.
- The tweet collection is from September 2019 to August 2021.
- The tweets were filtered using weekly trending topics and other types of filtering.
- The tweets were split into two periods: September 2019 to August 2020 and September 2020 to August 2021.
Dataset annotation
- Conducted manual annotation on Amazon Mechanical Turk
- Split tweets into two periods: September 2019 to August 2020 and September 2020 to August 2021
- Collected 36,000 annotations in total
- Employed seven labels: person, location, corporation, creative work, group, product, and event
- Pre-processed tweets before annotation
- Quality control by taking agreement into account
Statistics
- Dataset contains 10k+ annotations
- Covers a wide range of entity types
- Includes recent tweets from 2019-2021
- Distribution of tweets is uniform over time
- Uneven distribution of instances per year
Baseline results
- Introduced baselines with language model fine-tuning on TweetNER7 in temporal-shift setup
- Used BERT, RoBERTa, BERTweet, and TimeLMs
- Evaluated models using micro/macro F1 score and type-ignored F1 score
- Used two-phase grid search to find best combination of hyperparameters
- RoBERTa LARGE best across metrics
- Overall metrics lower than standard NER datasets
- TimeLM 2020 performs worse than other RoBERTa models
- BERTweet performs better on 2020 test set
- Person entity type has highest F1 score (around 80%)
- Creative work and location have lowest F1 scores (around 40% and 60%)
Temporal analysis
- Compare temporal vs. random splits
- Compare joint vs. continuous fine-tuning
- Explore self-labeling as a solution to temporal shifts
Short-term temporal effect
- TweetNER7 performance is tested without temporal-shift
- Training and validation sets are randomly sampled from September 2019 to August 2021
- Test set is not changed
- F1 scores on 2021 test set are improved, F1 scores on 2020 test set are decreased
- Benefit of having a human annotated training set from the test period is highlighted
Continuous vs. joint fine-tuning
- Previous experiments showed differences between training and testing on the time period or not.
- Aim of this analysis is to explore strategies to improve the original model.
- Employed a continuous finetuning scheme, fine-tuning LMs on the 2020-set and then continuing on the 2021-set.
- Table 7 shows results of all strategies for different language models.
- Continuous fine-tuning provides best results in terms of micro F1 and type-ignored F1 in the 2021 test sets.
Self-labeling
- Compared different strategies when a human-annotated training dataset from the test period was considered
- Improvements can be obtained when the time between training and test data is reduced
- Not practical to require a large amount of human resources to annotate newer tweets
- Alternative approach to rely on distantly annotated tweets by the already fine-tuned model
- Reproduced experiments in TweetNER7 dataset focusing on short-term temporal shift
- Self-labeling does not help to mitigate temporal-shift in TweetNER7
- Analysis of self-labeled tweets to find ratio of correct predictions within the retrieved predictions
- Most frequent predictions are usually the same as the original predictions
- Second most frequent predictions are on average the correct ones
Conclusion
- Constructed TweetNER7, a new NER dataset for Twitter
- 11,382 tweets annotated with seven entity types
- Tweets distributed uniformly over time from September 2019 to August 2021
- Leveraged weekly trending topics to query tweets
- Established baselines on TweetNER7 by fine-tuning standard and Twitter-specific LMs
- Performed targeted temporal-related analyses
- Self-labeling not enough to mitigate temporal-shift
- Designed to study short-term temporal-shift
- Future work to extend to languages other than English
- Future work to add data from other social media platforms