Quantifying Cognitive Load from Voice using Transformer-Based Models and a Cross-Dataset Evaluation. Hecker, Pascal; Kappattanavar, Arpita M.; Schmitt, Maximilian; Moontaha, Sidratul; Wagner, Johannes; Eyben, Florian; Schuller, Björn W.; Arnrich, Bert (2022). 337–344.
Cognitive load is frequently induced in laboratory setups to measure responses to stress, and its impact on voice has been studied in the field of computational paralinguistics. One dataset on this topic was provided in the Computational Paralinguistics Challenge (ComParE) 2014, and therefore offers great comparability. Recently, transformer-based deep learning architectures established a new state-of-the-art and are finding their way gradually into the audio domain. In this context, we investigate the performance of popular transformer architectures in the audio domain on the ComParE 2014 dataset, and the impact of different pre-training and fine-tuning setups on these models. Further, we recorded a small custom dataset, designed to be comparable with the ComParE 2014 one, to assess cross-corpus model generalisability. We find that the transformer models outperform the challenge baseline, the challenge winner, and more recent deep learning approaches. Models based on the ‘large’ architecture perform well on the task at hand, while models based on the ‘base’ architecture perform at chance level. Fine-tuning on related domains (such as ASR or emotion), before fine-tuning on the targets, yields no higher performance compared to models pre-trained only in a self-supervised manner. The generalisability of the models between datasets is more intricate than expected, as seen in an unexpected low performance on the small custom dataset, and we discuss potential ‘hidden’ underlying discrepancies between the datasets. In summary, transformer-based architectures outperform previous attempts to quantify cognitive load from voice. This is promising, in particular for healthcare-related problems in computational paralinguistics applications, since datasets are sparse in that realm.
Further Information
AbstractCognitive load is frequently induced in laboratory setups to measure responses to stress, and its impact on voice has been studied in the field of computational paralinguistics. One dataset on this topic was provided in the Computational Paralinguistics Challenge (ComParE) 2014, and therefore offers great comparability. Recently, transformer-based deep learning architectures established a new state-of-the-art and are finding their way gradually into the audio domain. In this context, we investigate the performance of popular transformer architectures in the audio domain on the ComParE 2014 dataset, and the impact of different pre-training and fine-tuning setups on these models. Further, we recorded a small custom dataset, designed to be comparable with the ComParE 2014 one, to assess cross-corpus model generalisability. We find that the transformer models outperform the challenge baseline, the challenge winner, and more recent deep learning approaches. Models based on the ‘large’ architecture perform well on the task at hand, while models based on the ‘base’ architecture perform at chance level. Fine-tuning on related domains (such as ASR or emotion), before fine-tuning on the targets, yields no higher performance compared to models pre-trained only in a self-supervised manner. The generalisability of the models between datasets is more intricate than expected, as seen in an unexpected low performance on the small custom dataset, and we discuss potential ‘hidden’ underlying discrepancies between the datasets. In summary, transformer-based architectures outperform previous attempts to quantify cognitive load from voice. This is promising, in particular for healthcare-related problems in computational paralinguistics applications, since datasets are sparse in that realm.
Voice Analysis for Neurological Disorder Recognition - A Systematic Review and Perspective on Emerging Trends. Hecker, Pascal; Steckhan, Nico; Eyben, Florian; Schuller, Björn W.; Arnrich, Bert in Frontiers in Digital Health (2022). 4
Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends. PRISMA-based literature searches were conducted through PubMed, Web of Science, and IEEE Xplore to identify publications in which original (i.e., newly recorded) datasets were collected. Disorders of interest were psychiatric as well as neurodegenerative disorders, such as bipolar disorder, depression, and stress, as well as amyotrophic lateral sclerosis amyotrophic lateral sclerosis, Alzheimer's, and Parkinson's disease, and speech impairments (aphasia, dysarthria, and dysphonia). Of the 43 retrieved studies, Parkinson's disease is represented most prominently with 19 discovered datasets. Free speech and read speech tasks are most commonly used across disorders. Besides popular feature extraction toolkits, many studies utilise custom-built feature sets. Correlations of acoustic features with psychiatric and neurodegenerative disorders are presented. In terms of analysis, statistical analysis for significance of individual features is commonly used, as well as predictive modeling approaches, especially with support vector machines and a small number of artificial neural networks. An emerging trend and recommendation for future studies is to collect data in everyday life to facilitate longitudinal data collection and to capture the behavior of participants more naturally. Another emerging trend is to record additional modalities to voice, which can potentially increase analytical performance.
Further Information
AbstractQuantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide a high-level overview of practices across various neurological disorders and highlight emerging trends. PRISMA-based literature searches were conducted through PubMed, Web of Science, and IEEE Xplore to identify publications in which original (i.e., newly recorded) datasets were collected. Disorders of interest were psychiatric as well as neurodegenerative disorders, such as bipolar disorder, depression, and stress, as well as amyotrophic lateral sclerosis amyotrophic lateral sclerosis, Alzheimer's, and Parkinson's disease, and speech impairments (aphasia, dysarthria, and dysphonia). Of the 43 retrieved studies, Parkinson's disease is represented most prominently with 19 discovered datasets. Free speech and read speech tasks are most commonly used across disorders. Besides popular feature extraction toolkits, many studies utilise custom-built feature sets. Correlations of acoustic features with psychiatric and neurodegenerative disorders are presented. In terms of analysis, statistical analysis for significance of individual features is commonly used, as well as predictive modeling approaches, especially with support vector machines and a small number of artificial neural networks. An emerging trend and recommendation for future studies is to collect data in everyday life to facilitate longitudinal data collection and to capture the behavior of participants more naturally. Another emerging trend is to record additional modalities to voice, which can potentially increase analytical performance.