TY - CHAP
T1 - Hybrid Network Feature Extraction for Depression Assessment from Speech
AU - Zhao, Ziping
AU - Li, Qifei
AU - Cummins, Nicholas
AU - Liu, Bin
AU - Wang, Haishuai
AU - Tao, Jianhua
AU - Schuller, Bjoern
PY - 2020/10
Y1 - 2020/10
N2 - A fast-growing area of mental health research is the search for speech-based objective markers for conditions such as depression. One vital challenge in the development of speech-based depression severity assessment systems is the extraction of depression-relevant features from speech signals. In order to deliver more comprehensive feature representation, we herein explore the benefits of a hybrid network that encodes depression-related characteristics in speech for the task of depression severity assessment. The proposed network leverages self-attention networks (SAN) trained on low-level acoustic features and deep convolutional neural networks (DCNN) trained on 3D Log-Mel spectrograms. The feature representations learnt in the SAN and DCNN are concatenated and average pooling is exploited to aggregate complementary segment-level features. Finally, support vector regression is applied to predict a speaker's Beck Depression Inventory-II score. Experiments based on a subset of the Audio-Visual Depressive Language Corpus, as used in the 2013 and 2014 Audio/Visual Emotion Challenges, demonstrate the effectiveness of our proposed hybrid approach.
AB - A fast-growing area of mental health research is the search for speech-based objective markers for conditions such as depression. One vital challenge in the development of speech-based depression severity assessment systems is the extraction of depression-relevant features from speech signals. In order to deliver more comprehensive feature representation, we herein explore the benefits of a hybrid network that encodes depression-related characteristics in speech for the task of depression severity assessment. The proposed network leverages self-attention networks (SAN) trained on low-level acoustic features and deep convolutional neural networks (DCNN) trained on 3D Log-Mel spectrograms. The feature representations learnt in the SAN and DCNN are concatenated and average pooling is exploited to aggregate complementary segment-level features. Finally, support vector regression is applied to predict a speaker's Beck Depression Inventory-II score. Experiments based on a subset of the Audio-Visual Depressive Language Corpus, as used in the 2013 and 2014 Audio/Visual Emotion Challenges, demonstrate the effectiveness of our proposed hybrid approach.
UR - http://www.scopus.com/inward/record.url?scp=85098217319&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2396
DO - 10.21437/Interspeech.2020-2396
M3 - Conference paper
VL - 2020-October
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4956
EP - 4960
BT - Proceedings INTERSPEECH 2020
PB - ISCA-INST SPEECH COMMUNICATION ASSOC
ER -