TY - JOUR
T1 - Raw acoustic-articulatory multimodal dysarthric speech recognition
AU - Yue, Zhengjun
AU - Loweimi, Erfan
AU - Cvetkovic, Zoran
AU - Barker, Jon
AU - Christensen, Heidi
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/6/10
Y1 - 2025/6/10
N2 - Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articula- tory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic fea- tures, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically anal- ysed by using a statistical space distribution indicator called Maximum Ar- ticulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic fea- tures at the empirically found optimal fusion level achieves a notable perfor- mance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.
AB - Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articula- tory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic fea- tures, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically anal- ysed by using a statistical space distribution indicator called Maximum Ar- ticulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic fea- tures at the empirically found optimal fusion level achieves a notable perfor- mance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.
UR - http://www.scopus.com/inward/record.url?scp=105009087227&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2025.101839
DO - 10.1016/j.csl.2025.101839
M3 - Article
SN - 0885-2308
VL - 95
JO - COMPUTER SPEECH AND LANGUAGE
JF - COMPUTER SPEECH AND LANGUAGE
M1 - 101839
ER -