Cognitive Load Quantification through Voice Analysis
In the scope of the EatMaps project, we focused on the prediction of cognitive load from voice. Participants performed several tasks under different levels of cognitive load. Voice analysis was used to measure acoustic parameters in the participants' speech in order to predict the different levels of cognitive load they experienced.
Contact: Pascal Hecker
Funded by the Federal Ministry for Economic Affairs and Energy.
A Cross-Dataset Study to analyse Cognitive Load
Transformer-based deep learning models have emerged as a powerful tool for detecting cognitive load from voice, offering significant promise for healthcare applications where data scarcity remains a critical limitation. In this project, we were able to demonstrate that wav2vec 2.0 architectures achieve state-of-the-art performance on established benchmarks while requiring substantially less training data than traditional deep learning methods typically demand.
In the scope of the EatMaps project, we recorded a custom dataset specifically designed to assess model generalisability across different laboratory settings and participant populations. This dataset comprises voice recordings from twelve participants performing cognitive load-inducing tasks including reading span exercises and Stroop tests with varying difficulty levels, accompanied by subjective workload assessments using the NASA Task Load Index questionnaire. The custom dataset encompasses 319 audio files totalling 33 minutes of speech, providing a controlled comparison point to the established Cognitive Load with Speech and EGG (CLSE) dataset. Notably, all recordings were conducted under professional conditions with consistent audio specifications, enabling rigorous cross-corpus evaluation of model robustness and generalisability.
A critical discovery from this work concerns the challenges of achieving model performance consistency across datasets, even when experimental protocols are intentionally designed for comparability. While transformer models achieved 72.9% Unweighted Average Recall on the CLSE benchmark, their performance on the custom dataset fell to chance level, suggesting potential label mismatches related to the degree of cognitive load induced rather than acoustic recording differences. This finding highlights the importance of precise task implementation and careful validation of workload induction mechanisms when collecting new datasets, as subtle variations in experimental design can fundamentally affect model transferability. Our dataset revealed high inter-participant variability in subjective workload ratings (standard deviation of 15–16 points) and demonstrated that visual embedding analysis separated CLSE samples by cognitive load class while clustering custom dataset samples indiscriminately, indicating dataset-specific rather than universal acoustic-cognitive relationships.
These findings underscore the value of our data collection efforts for computational paralinguistics research and signal important implications for future dataset development in speech-based mental wellbeing and healthcare applications. The custom dataset's contribution extends beyond immediate model performance, as it provides an authentic case study demonstrating that future data collection should prioritize rigorous standardization of cognitive load induction procedures, objective validation of workload levels, and systematic documentation of methodological differences from existing corpora.
Hecker, P., Kappattanavar, A. M., Schmitt, M., Moontaha, S., Wagner, J., Eyben, F., Schuller, B. W., & Arnrich, B. (2022). Quantifying cognitive load from voice using transformer-based models and a cross-dataset evaluation. In *2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)* (pp. 337–344).