Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Real-world applications of language models involve human-LM interaction. HALIE is a new framework to evaluate human-LM interaction. HALIE captures interactive process, subjective experience, and preference. Five tasks are designed to capture different forms of interaction. Non-interactive performance does not always result in better human-LM interaction. Paper Content Introduction Language models have advanced and can be used for a wide range of tasks Evaluation of language models is currently non-interactive Most benchmarks focus on non-interactive evaluation HALIE framework expands on non-interactive evaluation by considering the full process, first-person experience, and preference beyond quality LMs are already used interactively to brainstorm, paraphrase, reformulate, autocomplete, and write code Goal is to augment human capabilities rather than automate them Social dialogue Dialogue is a popular mode of interaction for language models We evaluate human-LM interaction in the context of open-ended dialogue about social situations Task: given a social scenario, users converse with a system until they choose to finish System logic: user input, possible actions, updating dialogue history User study: 189 crowd workers, 10 scenarios, survey questions Results: instruction tuning improves performance on most quality metrics, but not specificity Users may prefer to interact with a more specific LM Question answering Question answering is a task in NLP Users can query a system multiple times to answer a question System consists of multiple-choice question, user input, and system output 342 crowd workers recruited to answer questions with and without assistance from an LM Users with LM assistance generally outperformed an LM alone Count number of queries needed to answer each question as a proxy measurement for efficiency TextDavinci achieved highest accuracy while requiring least effort TextBabbage performed better than Davinci on most metrics Instruction tuned models were perceived most favorably in survey evaluation Crossword puzzles Crossword puzzles have been studied as a challenging task for AI systems Solving a crossword puzzle is a generative task requiring open-ended responses to clues Crossword puzzle task provides additional structure, whereby a user can check whether a candidate answer satisfies the lexical constraints of the puzzle Clues are often not straightforward and a user might need to reformulate the query to find the desired information System logic includes a state of a crossword puzzle, selected clue, user letters entered in the puzzle, dialogue history, and user input User study recruited 350 workers on Amazon Mechanical Turk, split across four language models and five puzzles Survey questions asked users to rank different qualities of the AI assistant on a 5-point Likert scale Results show that users significantly preferred Text-Davinci over other models with respect to helpfulness Misinformation was particularly pernicious using TextBabbage Short prompts exacerbate misinformation and toxicity Users demonstrate diverse engagement behavior Text summarization Text summarization is a long-standing problem in NLP We focus on human-LM interaction for single-document summarization System provides previous human-edited summaries as examples to the system to improve future summaries Task is to edit model-generated summary to be consistent, relevant, and coherent 964 documents randomly selected from XSum dataset 39 crowd workers recruited on Amazon Mechanical Turk Summary-level questions ask consistency, relevance, and coherence of the original and edited summaries Session-level questions evaluate users’ overall perceptions of the summarization system 100 documents randomly sampled and assessed by 3 different evaluators Metaphor generation Metaphors are used to communicate complex or abstract ideas Creating metaphors requires divergent, lateral thinking Prior work designed metaphor generation tools to help with ideation Task is to write metaphorical sentences that evoke a given metaphor System logic consists of seed metaphor, user sentence history, user input, and system suggestions 32 workers recruited on Amazon Mechanical Turk to come up with metaphorical sentences using the system 10 minutes given to each user to come up with as many sentences as possible Evaluation criteria from Gero and Chilton (2019) used Framework Introduce HALIE framework for evaluating human-LM interaction Describe tasks and system construction for studying human-LM interaction Use interaction traces to represent interaction process Propose dimensions and metrics for evaluating human-LM interaction Solving tasks interactively Studying human-LM interaction in the context of tasks Five tasks studied: social dialogue, question answering, crossword puzzles, text summarization, and metaphor generation Tasks span from goal-oriented to open-ended High coverage on common usages of LMs reported by Ouyang et al....