Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

73% of papers do not perform any human evaluation on model-generated summaries
LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries
LongEval addresses challenges of high inter-annotator agreement, minimizing annotator workload, and automated alignment between summary and source snippets
Switching to finer granularity of judgment reduces inter-annotator variance in faithfulness scores
Partial annotation of fine-grained units highly correlates with scores from a full annotation workload

Human evaluation is labor-intensive, expensive and difficult to design
Large number of judged examples needed to draw statistically significant conclusions
Human evaluation is especially challenging for long sequences of generated text
Survey of 162 publications and preprints on long-form summarization
73% of papers do not perform human evaluation on long-form summaries
Lack of standardization in design decisions can impact inter-annotator agreement
Human evaluation is expensive, difficult and time-consuming
LONGEVAL guidelines for human evaluation of faithfulness in long-form summarization
Empirical evaluation of LONGEVAL on two long-form summarization datasets
Dataset with 3-way fine-grained human faithfulness judgments for 120 summaries

We present the LONGEVAL guidelines for standardized human evaluation of long-form summarization.
FINE-grained annotations have lower inter-annotator variance than COARSE-grained annotations.
Partially annotating a summary reduces annotator workload while maintaining accuracy.
Highlighting hints in the source document has limited usefulness for evaluating long-form summaries.
Experiments conducted on other aspects of summarization evaluation like salience and coherence.
Variables kept constant among experiments on a dataset, but modifying them could change the results.
Non-uniform weighing of FINE units may be a good strategy.
Human evaluation data collected with FINE and COARSE annotation methods.
FINE annotations lead to narrower confidence intervals than COARSE annotations.
Lower variation in FINE-grained annotations means higher agreement.
Little difference in annotator performance with different types of source document highlight hints.
44 papers conducting human evaluation of long-form summarization.
10 papers use non-experts while 17 papers use expert annotators.
Recommend hiring freelancers on Upwork or experts who are well-versed with the domain for annotation.