TY - JOUR
T1 - Code-switching input for machine translation: a case study of Vietnamese–English data
AU - Nguyen, Li
AU - Mayeux, Oliver
AU - Yuan, Zheng
N1 - Publisher Copyright:
© 2023 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - Multilingualism presents both a challenge and an opportunity for Natural Language Processing, with code-switching representing a particularly interesting problem for computational models trained on monolingual datasets. In this paper, we explore how code-switched data affects the task of Machine Translation, a task which only recently has started to tackle the challenge of multilingual data. We test three Machine Translation systems on data from the Canberra Vietnamese–English Codeswitching Natural Speech Corpus (CanVEC) and evaluate translation output using both automatic and human metrics. We find that, perhaps counter-intuitively, Machine Translation performs better on code-switching input than monolingual input. In particular, comparison of human and automatic evaluation suggests that codeswitching input may boost the semantic faithfulness of the translation output, an effect we term lexico-semantic enrichment. We also report two cases where this effect is most and least clear in Vietnamese–English, namely gender-neutral 3SG pronouns and interrogative constructions respectively. Overall, we suggest that Machine Translation, and Natural Language Processing more generally, ought to view multilingualism as an opportunity rather than an obstacle. Abbreviations: 1: First person; 2: Second person; 3: Third person; CLF: Classifier; COP: Copula; DET: Determiner; PL: Plural; POSS: Possessive marker; PRT: Particle; PST: Past tense; Q: Question marker; SG: Singular.
AB - Multilingualism presents both a challenge and an opportunity for Natural Language Processing, with code-switching representing a particularly interesting problem for computational models trained on monolingual datasets. In this paper, we explore how code-switched data affects the task of Machine Translation, a task which only recently has started to tackle the challenge of multilingual data. We test three Machine Translation systems on data from the Canberra Vietnamese–English Codeswitching Natural Speech Corpus (CanVEC) and evaluate translation output using both automatic and human metrics. We find that, perhaps counter-intuitively, Machine Translation performs better on code-switching input than monolingual input. In particular, comparison of human and automatic evaluation suggests that codeswitching input may boost the semantic faithfulness of the translation output, an effect we term lexico-semantic enrichment. We also report two cases where this effect is most and least clear in Vietnamese–English, namely gender-neutral 3SG pronouns and interrogative constructions respectively. Overall, we suggest that Machine Translation, and Natural Language Processing more generally, ought to view multilingualism as an opportunity rather than an obstacle. Abbreviations: 1: First person; 2: Second person; 3: Third person; CLF: Classifier; COP: Copula; DET: Determiner; PL: Plural; POSS: Possessive marker; PRT: Particle; PST: Past tense; Q: Question marker; SG: Singular.
UR - http://www.scopus.com/inward/record.url?scp=85163681648&partnerID=8YFLogxK
U2 - 10.1080/14790718.2023.2224013
DO - 10.1080/14790718.2023.2224013
M3 - Article
JO - International Journal of Multilingualism
JF - International Journal of Multilingualism
ER -