Raw acoustic-articulatory multimodal dysarthric speech recognition

Zhengjun Yue*, Erfan Loweimi, Zoran Cvetkovic, Jon Barker, Heidi Christensen

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Downloads (Pure)

Abstract

Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articula- tory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic fea- tures, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically anal- ysed by using a statistical space distribution indicator called Maximum Ar- ticulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic fea- tures at the empirically found optimal fusion level achieves a notable perfor- mance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.
Original languageEnglish
Article number101839
Number of pages35
JournalCOMPUTER SPEECH AND LANGUAGE
Volume95
Early online date10 Jun 2025
DOIs
Publication statusE-pub ahead of print - 10 Jun 2025

Fingerprint

Dive into the research topics of 'Raw acoustic-articulatory multimodal dysarthric speech recognition'. Together they form a unique fingerprint.

Cite this