Raw waveform acoustic modelling has recently received in- creasing attention. Compared with the task-blind hand-crafted features which may discard useful information, representations directly learned from the raw waveform are task-specific and potentially include all task-relevant information. In the con- text of automatic dysarthric speech recognition, raw waveform acoustic modelling is under-explored owing to data scarcity. Parametric CNNs can compensate for this problem owing to having notably fewer parameters and requiring less training data in comparison with conventional non-parametric CNNs. In this paper, we explore the usefulness of raw waveform acous- tic modelling using various parametric CNNs for ADSR. Ad- ditionally, we investigate the properties of the learned filters and monitor the training dynamics of various models. Fur- thermore, we study the effectiveness of data augmentation and multi-stream acoustic modelling through combining the non-parametric and parametric CNNs fed by hand-crafted and raw waveform features. Experimental results on the widely- used TORGO dysarthric database show that the parametric CNNs significantly outperform the non-parametric CNNs on dysarthric speech (up to 2.7% and 1.8% absolute error reduc- tion), reaching up to 35.9% and 11.9% WERs for dysarthric and typical speech respectively. Multi-streaming acoustic mod- elling further improves the performance resulting in up to 33.2%and 10.3% WERs for dysarthric and typical speech, re- spectively.
|Title of host publication
|Proceedings of INTERSPEECH 2022
|ISCA-INST SPEECH COMMUNICATION ASSOC
|Number of pages
|Published - 2022