Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract 73% of papers do not perform any human evaluation on model-generated summaries LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries LongEval addresses challenges of high inter-annotator agreement, minimizing annotator workload, and automated alignment between summary and source snippets Switching to finer granularity of judgment reduces inter-annotator variance in faithfulness scores Partial annotation of fine-grained units highly correlates with scores from a full annotation workload Paper Content Introduction Human evaluation is labor-intensive, expensive and difficult to design Large number of judged examples needed to draw statistically significant conclusions Human evaluation is especially challenging for long sequences of generated text Survey of 162 publications and preprints on long-form summarization 73% of papers do not perform human evaluation on long-form summaries Lack of standardization in design decisions can impact inter-annotator agreement Human evaluation is expensive, difficult and time-consuming LONGEVAL guidelines for human evaluation of faithfulness in long-form summarization Empirical evaluation of LONGEVAL on two long-form summarization datasets Dataset with 3-way fine-grained human faithfulness judgments for 120 summaries Survey of human evaluation practices Human evaluation of long-form summaries is rarely done Most studies do not follow reproducible practices Human evaluation setups lack standardization Human evaluation is challenging and expensive LONGEVAL guidelines proposed to improve efficiency and standardization Experiments conducted on two long-form summarization datasets COARSE annotations have lower inter-annotator agreement than FINE FINE annotations should be preferred for long-form summaries Partial annotation proposed to reduce annotator workload Partial annotation has high correlation to full annotation Related work Recent work has focused on automatic evaluation methods for summarization Human evaluation is the gold standard for developing automatic metrics Pyramid method is a notable effort in this space Efficient Pyramid-like protocols have been used to collect large-scale datasets Focus on faithfulness and operate in a reference-free setting Focus on long-form summarization tasks like SQuALITY and PubMed Faithfulness in summarization differs from fact verification in three ways Conclusion We present the LONGEVAL guidelines for standardized human evaluation of long-form summarization....