Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Building a dataset, BEAT, with 76 hours of multi-modal data from 30 speakers 32 million frame-level emotion and semantic relevance annotations Correlation of conversational gestures with facial expressions, emotions, and semantics Proposing a baseline model, Cascaded Motion Network (CaMN) Introducing a metric, Semantic Relevance Gesture Recall (SRGR) BEAT is the largest motion capture dataset for investigating human gestures Paper Content Related work Mo-cap and pseudo-label conversational gesture datasets exist Most common mo-cap dataset is 4-hour Trinity dataset Datasets for talking-face generation exist, but cannot be used for gesture synthesis Semantic or emotion-aware motion synthesis studied in action recognition and sign-language analysis/synthesis Baseline models for conversational gesture synthesis exist Efforts to improve performance of baseline models by input/output representation selection, adversarial training, and generative modeling techniques Probabilistic gesture generation enables generating diversity based on noise Beat: body-expression-audio-text dataset Dataset acquisition process described Text, emotion, and semantic relevance information annotation introduced Correlation between conversational gestures and emotions analyzed using BEAT Distribution of semantic relevance shown Motion capture system based on 16 synchronized cameras recording motion at 120 Hz Facial capture system uses ARKit with a depth camera on iPhone 12 Pro Audio recorded in 48KHz stereo Data acquisition BEAT is divided into conversation and self-talk sessions Speaker’s gestures are divided into four categories Topics are selected from 20 predefined topics Self-talk sessions consist of 120 1-minute recordings 8 emotions are covered in the dataset Proportion of languages and accents is strictly controlled Mainly English data, with some Chinese, Spanish and Japanese 30 speakers from different ethnicities Speakers asked to read answers proficiently and show natural, personal, daily style of conversational gestures Professional speaker instructs them to elicit corresponding emotion correctly Data annotation Used an Automatic Speech Recognizer (ASR) to obtain initial text for conversation session Used Montreal Forced Aligner (MFA) for temporal alignment of text with audio Confirmed 8-class emotion label of self-talk Annotators watched video with audio and gestures to perform frame-level annotation 600 annotators from Amazon Mechanical Turk (AMT) scored semantic relevance on a scale of 0-10 Data analysis BEAT collection and annotation enables analysis of correlations between conversational gestures and other modalities....