Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- 73% of papers do not perform any human evaluation on model-generated summaries
- LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries
- LongEval addresses challenges of high inter-annotator agreement, minimizing annotator workload, and automated alignment between summary and source snippets
- Switching to finer granularity of judgment reduces inter-annotator variance in faithfulness scores
- Partial annotation of fine-grained units highly correlates with scores from a full annotation workload
Paper Content
Introduction
- Human evaluation is labor-intensive, expensive and difficult to design
- Large number of judged examples needed to draw statistically significant conclusions
- Human evaluation is especially challenging for long sequences of generated text
- Survey of 162 publications and preprints on long-form summarization
- 73% of papers do not perform human evaluation on long-form summaries
- Lack of standardization in design decisions can impact inter-annotator agreement
- Human evaluation is expensive, difficult and time-consuming
- LONGEVAL guidelines for human evaluation of faithfulness in long-form summarization
- Empirical evaluation of LONGEVAL on two long-form summarization datasets
- Dataset with 3-way fine-grained human faithfulness judgments for 120 summaries
Survey of human evaluation practices
- Human evaluation of long-form summaries is rarely done
- Most studies do not follow reproducible practices
- Human evaluation setups lack standardization
- Human evaluation is challenging and expensive
- LONGEVAL guidelines proposed to improve efficiency and standardization
- Experiments conducted on two long-form summarization datasets
- COARSE annotations have lower inter-annotator agreement than FINE
- FINE annotations should be preferred for long-form summaries
- Partial annotation proposed to reduce annotator workload
- Partial annotation has high correlation to full annotation
Related work
- Recent work has focused on automatic evaluation methods for summarization
- Human evaluation is the gold standard for developing automatic metrics
- Pyramid method is a notable effort in this space
- Efficient Pyramid-like protocols have been used to collect large-scale datasets
- Focus on faithfulness and operate in a reference-free setting
- Focus on long-form summarization tasks like SQuALITY and PubMed
- Faithfulness in summarization differs from fact verification in three ways
Conclusion
- We present the LONGEVAL guidelines for standardized human evaluation of long-form summarization.
- FINE-grained annotations have lower inter-annotator variance than COARSE-grained annotations.
- Partially annotating a summary reduces annotator workload while maintaining accuracy.
- Highlighting hints in the source document has limited usefulness for evaluating long-form summaries.
- Experiments conducted on other aspects of summarization evaluation like salience and coherence.
- Variables kept constant among experiments on a dataset, but modifying them could change the results.
- Non-uniform weighing of FINE units may be a good strategy.
- Human evaluation data collected with FINE and COARSE annotation methods.
- FINE annotations lead to narrower confidence intervals than COARSE annotations.
- Lower variation in FINE-grained annotations means higher agreement.
- Little difference in annotator performance with different types of source document highlight hints.
- 44 papers conducting human evaluation of long-form summarization.
- 10 papers use non-experts while 17 papers use expert annotators.
- Recommend hiring freelancers on Upwork or experts who are well-versed with the domain for annotation.