Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Investigated how humans dub video content from one language to another Leveraged a novel corpus of 319.57 hours of video from 54 professionally produced titles Challenged assumptions made in qualitative and machine-learning literature on dubbing Argued for importance of vocal naturalness and translation quality over isometric and lip-sync constraints Found influence of source-side audio on human dubs beyond words of translation Paper Content Introduction Considerable attention has been paid to the dubbing of video content from one language to another Human dubbing has been studied from a qualitative perspective Machine-learning practitioners have taken up the task of building multimodal systems for automatic dubbing Human dubbing involves a sequence of contributors with control over different aspects of the process A data-driven examination of the way humans actually perform this task is missing Human dubbing is a “constrained translation” Questions about isochrony, isometry, speech tempo, lip sync, translation quality, and source influence are explored Insights are provided on research directions to address weaknesses in current automatic dubbing approaches Related work Qualitative Dubbing is a type of constrained translation Dubs need to match the original video track Dubs need to be isochronic, phonetic and kinesic synchrony Dubs need to be intelligible to the target language and culture Dubs should sound natural Dubs should preserve the semantic meaning of the source Dubbing is a form of non-literal translation called “transcreation” Scholars have investigated the role of power, ideology, identity, and similar considerations in dubbing Automatic dubbing Automatic dub generation has been explored with a variety of constraints Lip sync constraints have been integrated into dub generation Adjusting mouth movements in the original video to match a dubbed audio track has been explored Isometric machine translation has been used to produce a translation with similar length to the input Controlling speaking rate in automatic dubbing systems to achieve prosodic alignment has been studied Time-boundary relaxation has been used to control speaking rate and speech fluency Integrating pause constraints directly into MT has been examined End-to-end dubbing has been explored Empirical studies Studies have attempted to examine human dubbing through a quantitative lens Di Giovanni and Romero-Fresco found that audiences may not be as sensitive to lip sync as traditionally believed Karakanta et al....