SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Present SODA: a million-scale high-quality social dialogue dataset Train COSMO: a generalizable conversation agent Dialogues in SODA are more consistent, specific, and natural than prior datasets COSMO is more natural and consistent than best-performing dialogue models Data, models, and code are made public Paper Content Introduction Progress on open-domain social dialogue agents has been hindered by lack of diversity, scale, and quality of training corpora Most dialogue agents are trained on large amounts of unfiltered conversations or highly curated/specialized crowdsourced dialogues Issues of unnaturalness, toxicity, incoherence, blandness, and lack of commonsense remain Introduce SODA, a million-scale dialogue dataset covering a wide variety of social interactions SODA is the largest publicly available open-domain conversation dataset Human evaluation shows SODA surpasses existing human-authored dialogue corpora Proposed CO 3 framework for distilling conversations from large pre-trained language models CO 3 adds context information to social commonsense knowledge step-by-step COSMO conversation model trained on SODA outperforms existing dialogue models Background Conversation is a form of social interaction Narratives and scripts are abstracted from social experiences Social experiences form our knowledge for explaining everyday events and inferring the mental states of others Attribution in social psychology has been studied in NLP as social commonsense Commonsense knowledge graph Start with a commonsense knowledge graph Represented by symbolic triples Use Atomic 10x as knowledge graph Retrieve triples with social commonsense relations Prompt PLM to rewrite commonsense into narrative PLMs known for writing capabilities, especially in narratives From narrative to conversation Inferring who is speaking in the dialogue is easier when the narrative contains person variables....