Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • 73% of papers do not perform any human evaluation on model-generated summaries
  • LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries
  • LongEval addresses challenges of high inter-annotator agreement, minimizing annotator workload, and automated alignment between summary and source snippets
  • Switching to finer granularity of judgment reduces inter-annotator variance in faithfulness scores
  • Partial annotation of fine-grained units highly correlates with scores from a full annotation workload

Paper Content

Introduction

  • Human evaluation is labor-intensive, expensive and difficult to design
  • Large number of judged examples needed to draw statistically significant conclusions
  • Human evaluation is especially challenging for long sequences of generated text
  • Survey of 162 publications and preprints on long-form summarization
  • 73% of papers do not perform human evaluation on long-form summaries
  • Lack of standardization in design decisions can impact inter-annotator agreement
  • Human evaluation is expensive, difficult and time-consuming
  • LONGEVAL guidelines for human evaluation of faithfulness in long-form summarization
  • Empirical evaluation of LONGEVAL on two long-form summarization datasets
  • Dataset with 3-way fine-grained human faithfulness judgments for 120 summaries

Survey of human evaluation practices

  • Human evaluation of long-form summaries is rarely done
  • Most studies do not follow reproducible practices
  • Human evaluation setups lack standardization
  • Human evaluation is challenging and expensive
  • LONGEVAL guidelines proposed to improve efficiency and standardization
  • Experiments conducted on two long-form summarization datasets
  • COARSE annotations have lower inter-annotator agreement than FINE
  • FINE annotations should be preferred for long-form summaries
  • Partial annotation proposed to reduce annotator workload
  • Partial annotation has high correlation to full annotation
  • Recent work has focused on automatic evaluation methods for summarization
  • Human evaluation is the gold standard for developing automatic metrics
  • Pyramid method is a notable effort in this space
  • Efficient Pyramid-like protocols have been used to collect large-scale datasets
  • Focus on faithfulness and operate in a reference-free setting
  • Focus on long-form summarization tasks like SQuALITY and PubMed
  • Faithfulness in summarization differs from fact verification in three ways

Conclusion

  • We present the LONGEVAL guidelines for standardized human evaluation of long-form summarization.
  • FINE-grained annotations have lower inter-annotator variance than COARSE-grained annotations.
  • Partially annotating a summary reduces annotator workload while maintaining accuracy.
  • Highlighting hints in the source document has limited usefulness for evaluating long-form summaries.
  • Experiments conducted on other aspects of summarization evaluation like salience and coherence.
  • Variables kept constant among experiments on a dataset, but modifying them could change the results.
  • Non-uniform weighing of FINE units may be a good strategy.
  • Human evaluation data collected with FINE and COARSE annotation methods.
  • FINE annotations lead to narrower confidence intervals than COARSE annotations.
  • Lower variation in FINE-grained annotations means higher agreement.
  • Little difference in annotator performance with different types of source document highlight hints.
  • 44 papers conducting human evaluation of long-form summarization.
  • 10 papers use non-experts while 17 papers use expert annotators.
  • Recommend hiring freelancers on Upwork or experts who are well-versed with the domain for annotation.