TY - JOUR
T1 - Impact of Annotation Modality on Label Quality and Model Performance in the Automatic Assessment of Laughter In-the-Wild
AU - Vargas-Quiros, Jose
AU - Cabrera-Quiros, Laura
AU - Oertel, Catharine
AU - Hung, Hayley
N1 - Publisher Copyright:
IEEE
PY - 2023
Y1 - 2023
N2 - Although laughter is known to be a multimodal signal, it is primarily annotated from audio. It is unclear how laughter labels may differ when annotated from modalities like video, which capture body movements and are relevant in in-the-wild studies. In this work we ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance. We compare annotations and models for laughter detection, intensity estimation, and segmentation, using a challenging in-the-wild conversational dataset with a variety of camera angles, noise conditions and voices. Our study with 48 annotators revealed evidence for incongruity in the perception of laughter and its intensity between modalities, mainly due to lower recall in the video condition. Our machine learning experiments compared the performance of modern unimodal and multi-modal models for different combinations of input modalities, training, and testing label modalities. In addition to the same input modalities rated by annotators (audio and video), we trained models with body acceleration inputs, robust to cross-contamination, occlusion and perspective differences. Our results show that performance of models with body movement inputs does not suffer when trained with video-acquired labels, despite their lower inter-rater agreement.
AB - Although laughter is known to be a multimodal signal, it is primarily annotated from audio. It is unclear how laughter labels may differ when annotated from modalities like video, which capture body movements and are relevant in in-the-wild studies. In this work we ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance. We compare annotations and models for laughter detection, intensity estimation, and segmentation, using a challenging in-the-wild conversational dataset with a variety of camera angles, noise conditions and voices. Our study with 48 annotators revealed evidence for incongruity in the perception of laughter and its intensity between modalities, mainly due to lower recall in the video condition. Our machine learning experiments compared the performance of modern unimodal and multi-modal models for different combinations of input modalities, training, and testing label modalities. In addition to the same input modalities rated by annotators (audio and video), we trained models with body acceleration inputs, robust to cross-contamination, occlusion and perspective differences. Our results show that performance of models with body movement inputs does not suffer when trained with video-acquired labels, despite their lower inter-rater agreement.
KW - Action recognition
KW - annotation
KW - Annotations
KW - Cameras
KW - continuous annotation
KW - Face recognition
KW - Labeling
KW - laughter
KW - laughter detection
KW - laughter intensity
KW - Machine learning
KW - mingling datasets
KW - Physiology
KW - Task analysis
UR - http://www.scopus.com/inward/record.url?scp=85161034458&partnerID=8YFLogxK
U2 - 10.1109/TAFFC.2023.3269003
DO - 10.1109/TAFFC.2023.3269003
M3 - Artículo
AN - SCOPUS:85161034458
SN - 1949-3045
SP - 1
EP - 17
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
ER -