Abstract
Acoustic modelling for automatic dysarthric speech recogni-tion (ADSR) is a challenging task. Data deficiency is a majorproblem and substantial differences between the typical anddysarthric speech complicates transfer learning. In this paper,we build acoustic models using the raw magnitude spectra ofthe source and filter components. The proposed multi-streammodel consists of convolutional and recurrent layers. It allowsfor fusing the vocal tract and excitation components at differ-ent levels of abstraction and after per-stream pre-processing.We show that such a multi-stream processing leverages thesetwo information streams and helps the model towards normal-ising the speaker attributes and speaking style. This poten-tially leads to better handling of the dysarthric speech with alarge inter-speaker and intra-speaker variability. We comparethe proposed system with various features, study the train-ing dynamics, explore usefulness of the data augmentationand provide interpretation for the learned convolutional fil-ters. On the widely used TORGO dysarthric speech corpus,the proposed approach results in up to 1.7% absolute WER re-duction for dysarthric speech compared with the MFCC base-line. Our best model reaches up to 40.6% and 11.8% WERfor dysarthric and typical speech, respectively.
Original language | English |
---|---|
Title of host publication | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing |
Publisher | IEEE |
Number of pages | 5 |
Publication status | Accepted/In press - 7 May 2022 |