Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Models for textual entailment have been applied to settings like fact-checking and question answering.
- We propose WiCE, a new dataset for verifying claims in text.
- WiCE is built on real-world claims and evidence from Wikipedia.
- Annotations are over sub-sentence units of the hypothesis, decomposed automatically by GPT-3.
- Real claims in WiCE involve challenging verification problems.
Paper Content
Introduction
- Textual entailment and natural language inference are longstanding problems in NLP
- SNLI dataset uses NLI to evaluate domain-general approaches to semantic representation
- NLI is used for validating answers from QA systems, evaluating generated summaries, and understanding knowledge-grounded dialog
- NLI is used for attribution and factual consistency
- NLI datasets target short premises
- WICE dataset is for verification of real-world claims in Wikipedia with fine-grained annotations
- CLAIM-SPLIT decomposes complex claims into simpler independent sub-claims
- WICE is a challenging problem
- Off-the-shelf models perform poorly at the claim level
- Chunk-level processing is a strong starting point for future systems
Background
Past nli datasets
- NLI datasets involve single-sentence premises and hypotheses
- Some datasets involve multi-sentence premises, but they are often short paragraphs
- NLI datasets favor inference types like hypernymy that are not necessarily matched to the attribution problem
- Fact checking datasets focus on the multihop verification problem, but only a small number of claims require multiple sentences as evidence
Hypothesis decomposition
- Breaking down complex hypotheses into smaller units has been studied
- Pyramid method proposed one way to decompose a summary into semantic content units
- Frameworks have looked at breaking statements down into propositions
- Factual consistency in summarization has looked at entailment of sub-sentence units
- Question-answer pairs used to isolate specific pieces of information
- SuperPAL and PropSegmEnt propose methods of retrieving sub-sentences for factuality evaluation
- WICE dataset collects real world claims from Wikipedia grounded in cited documents
- WICE provides annotation of entailment labels, supporting sentences, and non-supported tokens
Dataset preprocessing
- Uses same base set of Wikipedia claims as SIDE dataset
- Automatically parses cited articles’ HTML to extract article text
- Uses GPT-3 to decompose claims into sub-claims
- Filters out trivially entailed claims using RoBERTa-large
- Retains 16.3% of claims after filtering
Dataset collection
- Task interface presents evidence sentences and corresponding claim to crowd workers
- Annotation done at sub-claim level
- Entailment classification: annotator provides entailment label (SUPPORTED, PARTIALLY-SUPPORTED, NOT-SUPPORTED)
- Supporting sentences: if entailment label is SUPPORTED or PARTIALLY-SUPPORTED, annotator selects subset of evidence sentences
- Unsupported tokens in sub-claim: if entailment label is PARTIALLY-SUPPORTED, annotator highlights tokens not supported by evidence sentences
- Workers recruited using Amazon Mechanical Turk
- Inter-annotator agreement: Krippendorff’s α = 0.62
- Aggregating worker annotations: majority vote for entailment label, union of supporting sentences
- Claim-level labels from sub-claim annotations: union of sub-claim level supporting sentences
Tasks
- Intended to support three tasks: entailment, sentence retrieval, and non-supported tokens
- Given a claim and evidence articles, models must return a label, select a group of supporting sentences, and return a group of tokens not supported by the evidence
Dataset statistics
- Average of 3.0 sub-claims per claim
- Approximately 5.9K sub-claim level examples
- 56% of sub-claims supported by evidence sentences
- 33% of claims supported by evidence sentences
- 1.9 evidence sentences per sub-claim, 3.1 per claim
- 90% of claims and 64% of sub-claims require multiple supporting sentences
- 374 partially supported sub-claims in training data
- 12.8 tokens per sub-claim in training data
- 3.3 non-supported tokens per sub-claim in dev set
Analysis of phenomena
- WICE dataset contains contextualized claims that require multiple supporting sentences for verification
- WICE verification problems are categorized into compression, compression w/ decontextualization, paraphrasing, require calculation, require inference, and require background knowledge
- 50 randomly sampled claims in the development set of three datasets were manually analyzed
- Table 4 shows the estimated distribution of verification types in WICE, FEVER, and VitaminC
- CLAIM-SPLIT decomposes claims into sub-claims to provide a fine-grained view and improve annotation agreement
- Human annotation study was conducted to evaluate the performance of CLAIM-SPLIT
- 7.7% of the claims fail the completeness criterion and 2.3% fail the correctness criterion
- CLAIM-SPLIT is tested on VitaminC and PAWS datasets
- T5-large and T5-3B models are fine-tuned on ANLI dataset
- Table 5 shows that using CLAIM-SPLIT and aggregating sub-claim scores improves performance on the entailment classification task
- CLAIM-SPLIT is effective at simplifying the entailment classification problem and improving performance
- Better prompts and aggregation methods could lead to further improvement
Experiments
- How well do existing entailment models do in off-the-shelf settings when using the “stretching” paradigm?
- How much can fine-tuning on the dataset improve accuracy?
- Would retrieving relevant context sentences improve accuracy further?
Experimental setup
- Benchmarked performance of baseline entailment models on WICE dataset
- Considered multiple combinations of transformer encoders and entailment datasets
- Defaulted to T5-3B as it performed best
- Evaluated models using F1 score and accuracy
- Used “stretching” technique to make document-level judgments
- Fine-tuned T5 models on WICE dataset
- Off-the-shelf results showed sentence-level models performed less well than chunk-level models
- Fine-tuned results improved from off-the-shelf performance but still lower than human performance
- Used weighted sum of prediction probability for SUPPORTED and PARTIALLY-SUPPORTED classes as retrieval score
- Retrieval strategy with automatic retrieval not better than MAX strategy
- Oracle retrieved sentences substantially improved performance
Task: supporting sentence retrieval
- Performed intrinsic evaluation of supporting sentence retrieval task
- Retrieve sentences with retrieval scores larger than a threshold τ
- Take maximum F1 score over reference supporting sentences as score for each data
- Threshold used for test set is τ that returns maximum F1 score on dev set
- Investigated performance of retrieving correct sentences
- Model with context improves retrieval performance
- Performance lower than human performance
Conclusion
- WICE is a new dataset constructed from claims on Wikipedia
- WICE contains a variety of real-world entailment phenomena distinct from prior annotated datasets
- Decomposing sentences into sub-claims can be a valuable preprocessing step
- WICE consists of sentences with either 1 or 2 cited articles
- Performance of finetuning on WICE is saturating with current training procedure and models
- Filtering by using the baseline entailment classification model removes 45% of the data
- Median work time for each HIT was about 5 minutes
- Annotation interface includes cited websites and a paragraph and sentence in a Wikipedia article
- Supported and partially supported sub-claims have almost the same number of supporting sentences
- Word overlap between claims and evidence in WICE is competitively low
- Claims in different entailment labels have similar word overlap distributions
- GPT-3 used to split sentences into sub-claims
- Human annotators used to detect mistakes in dev and test sets
- Mistakes caused by removing first or intermediate clauses, parentheses, and “and”
- Mistakes caused by over-splitting and multiple decomposed sentences
- PyTorch and Hugging Face Transformers libraries used in implementation
- 4 NVIDIA Quadro RTX 8000 used in experiments
- Publicly available pretrained models in Hugging Face Hub used for evaluation
- Sentence- and chunk-level data used for training
- Chunks split into overlapping tokens shorter than 256
- Oracle train data includes chunks of all supporting sentences