Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Models for textual entailment have been applied to settings like fact-checking and question answering.
We propose WiCE, a new dataset for verifying claims in text.
WiCE is built on real-world claims and evidence from Wikipedia.
Annotations are over sub-sentence units of the hypothesis, decomposed automatically by GPT-3.
Real claims in WiCE involve challenging verification problems.

Paper Content

Introduction

Textual entailment and natural language inference are longstanding problems in NLP
SNLI dataset uses NLI to evaluate domain-general approaches to semantic representation
NLI is used for validating answers from QA systems, evaluating generated summaries, and understanding knowledge-grounded dialog
NLI is used for attribution and factual consistency
NLI datasets target short premises
WICE dataset is for verification of real-world claims in Wikipedia with fine-grained annotations
CLAIM-SPLIT decomposes complex claims into simpler independent sub-claims
WICE is a challenging problem
Off-the-shelf models perform poorly at the claim level
Chunk-level processing is a strong starting point for future systems

Background

Past nli datasets

NLI datasets involve single-sentence premises and hypotheses
Some datasets involve multi-sentence premises, but they are often short paragraphs
NLI datasets favor inference types like hypernymy that are not necessarily matched to the attribution problem
Fact checking datasets focus on the multihop verification problem, but only a small number of claims require multiple sentences as evidence

Hypothesis decomposition

Breaking down complex hypotheses into smaller units has been studied
Pyramid method proposed one way to decompose a summary into semantic content units
Frameworks have looked at breaking statements down into propositions
Factual consistency in summarization has looked at entailment of sub-sentence units
Question-answer pairs used to isolate specific pieces of information
SuperPAL and PropSegmEnt propose methods of retrieving sub-sentences for factuality evaluation
WICE dataset collects real world claims from Wikipedia grounded in cited documents
WICE provides annotation of entailment labels, supporting sentences, and non-supported tokens

Dataset preprocessing

Uses same base set of Wikipedia claims as SIDE dataset
Automatically parses cited articles’ HTML to extract article text
Uses GPT-3 to decompose claims into sub-claims
Filters out trivially entailed claims using RoBERTa-large
Retains 16.3% of claims after filtering

Dataset collection

Task interface presents evidence sentences and corresponding claim to crowd workers
Annotation done at sub-claim level
Entailment classification: annotator provides entailment label (SUPPORTED, PARTIALLY-SUPPORTED, NOT-SUPPORTED)
Supporting sentences: if entailment label is SUPPORTED or PARTIALLY-SUPPORTED, annotator selects subset of evidence sentences
Unsupported tokens in sub-claim: if entailment label is PARTIALLY-SUPPORTED, annotator highlights tokens not supported by evidence sentences
Workers recruited using Amazon Mechanical Turk
Inter-annotator agreement: Krippendorff’s α = 0.62
Aggregating worker annotations: majority vote for entailment label, union of supporting sentences
Claim-level labels from sub-claim annotations: union of sub-claim level supporting sentences

Tasks

Intended to support three tasks: entailment, sentence retrieval, and non-supported tokens
Given a claim and evidence articles, models must return a label, select a group of supporting sentences, and return a group of tokens not supported by the evidence

Dataset statistics

Average of 3.0 sub-claims per claim
Approximately 5.9K sub-claim level examples
56% of sub-claims supported by evidence sentences
33% of claims supported by evidence sentences
1.9 evidence sentences per sub-claim, 3.1 per claim
90% of claims and 64% of sub-claims require multiple supporting sentences
374 partially supported sub-claims in training data
12.8 tokens per sub-claim in training data
3.3 non-supported tokens per sub-claim in dev set

Analysis of phenomena

WICE dataset contains contextualized claims that require multiple supporting sentences for verification
WICE verification problems are categorized into compression, compression w/ decontextualization, paraphrasing, require calculation, require inference, and require background knowledge
50 randomly sampled claims in the development set of three datasets were manually analyzed
Table 4 shows the estimated distribution of verification types in WICE, FEVER, and VitaminC
CLAIM-SPLIT decomposes claims into sub-claims to provide a fine-grained view and improve annotation agreement
Human annotation study was conducted to evaluate the performance of CLAIM-SPLIT
7.7% of the claims fail the completeness criterion and 2.3% fail the correctness criterion
CLAIM-SPLIT is tested on VitaminC and PAWS datasets
T5-large and T5-3B models are fine-tuned on ANLI dataset
Table 5 shows that using CLAIM-SPLIT and aggregating sub-claim scores improves performance on the entailment classification task
CLAIM-SPLIT is effective at simplifying the entailment classification problem and improving performance
Better prompts and aggregation methods could lead to further improvement

Experiments

How well do existing entailment models do in off-the-shelf settings when using the “stretching” paradigm?
How much can fine-tuning on the dataset improve accuracy?
Would retrieving relevant context sentences improve accuracy further?

Experimental setup

Benchmarked performance of baseline entailment models on WICE dataset
Considered multiple combinations of transformer encoders and entailment datasets
Defaulted to T5-3B as it performed best
Evaluated models using F1 score and accuracy
Used “stretching” technique to make document-level judgments
Fine-tuned T5 models on WICE dataset
Off-the-shelf results showed sentence-level models performed less well than chunk-level models
Fine-tuned results improved from off-the-shelf performance but still lower than human performance
Used weighted sum of prediction probability for SUPPORTED and PARTIALLY-SUPPORTED classes as retrieval score
Retrieval strategy with automatic retrieval not better than MAX strategy
Oracle retrieved sentences substantially improved performance

Task: supporting sentence retrieval

Performed intrinsic evaluation of supporting sentence retrieval task
Retrieve sentences with retrieval scores larger than a threshold τ
Take maximum F1 score over reference supporting sentences as score for each data
Threshold used for test set is τ that returns maximum F1 score on dev set
Investigated performance of retrieving correct sentences
Model with context improves retrieval performance
Performance lower than human performance

Conclusion

WICE is a new dataset constructed from claims on Wikipedia
WICE contains a variety of real-world entailment phenomena distinct from prior annotated datasets
Decomposing sentences into sub-claims can be a valuable preprocessing step
WICE consists of sentences with either 1 or 2 cited articles
Performance of finetuning on WICE is saturating with current training procedure and models
Filtering by using the baseline entailment classification model removes 45% of the data
Median work time for each HIT was about 5 minutes
Annotation interface includes cited websites and a paragraph and sentence in a Wikipedia article
Supported and partially supported sub-claims have almost the same number of supporting sentences
Word overlap between claims and evidence in WICE is competitively low
Claims in different entailment labels have similar word overlap distributions
GPT-3 used to split sentences into sub-claims
Human annotators used to detect mistakes in dev and test sets
Mistakes caused by removing first or intermediate clauses, parentheses, and “and”
Mistakes caused by over-splitting and multiple decomposed sentences
PyTorch and Hugging Face Transformers libraries used in implementation
4 NVIDIA Quadro RTX 8000 used in experiments
Publicly available pretrained models in Hugging Face Hub used for evaluation
Sentence- and chunk-level data used for training
Chunks split into overlapping tokens shorter than 256
Oracle train data includes chunks of all supporting sentences

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Past nli datasets#

Hypothesis decomposition#

Dataset preprocessing#

Dataset collection#

Tasks#

Dataset statistics#

Analysis of phenomena#

Experiments#

Experimental setup#

Task: supporting sentence retrieval#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Past nli datasets

Hypothesis decomposition

Dataset preprocessing

Dataset collection

Tasks

Dataset statistics

Analysis of phenomena

Experiments

Experimental setup

Task: supporting sentence retrieval

Conclusion