DIALECT-DRIVEN ASR ERRORS: PHONETIC MISMATCH IN SOUTH ASIAN AMERICAN ENGLISH SPEECH
1Muhammad Ansar MA Data and Discourse Studies Department of History and Social Sciences Technische Universität Darmstadt, Germany muhammad.ansar@stud.tu-darmstadt.de ORCID: https://orcid.org/0009-0007-0649-2033 1*Anosh Rehman Department of English Linguistics and Language Studies, University of Sargodha Email: anoshhamza338@gmail.com ORCID: https://orcid.org/0009-0001-0412-0160 2Hamza Nawaz Chaudhary Department of CS, University of Sargodha Email: hamnaw66@gmail.com Official Link https://jalt.com.pk/index.php/jalt/article/view/2124 Abstract ASR systems have reached almost human accuracy with Mainstream American English (MAE), but still make systematic errors on non-mainstream varieties. This paper examines how the ASR errors are formed in South Asian American English (SAAE), and it has been argued that the errors are due to a systematic discrepancy between the phonetic realizations of SAAE speakers and the acoustic-phonetic distributions coded into MAE-trained models, the Phonetic Mismatch Hypothesis. A convergent mixed-methods design was used and a controlled speech elicitation and quantitative analysis of error. The 40 SAAE speech samples were put together to form a corpus that reflects major segmental and suprasegmental aspects, such as variation in the quality of vowels, reduction of consonant clusters, epenthesis, and the presence of prosodic transfer. A pretrained Whisper ASR model was tested on reference transcriptions with the calculation of Word Error Rate (WER). A total of 170 errors were identified and classified as substitutions (82; 48.2%), deletions (52; 30.6%), and insertions (36; 21.2%). The speech of SAAE generated a WER of about 43, as opposed to a generation of about 6 by MAE speech, and there was a partial amelioration of the situation when the speech was generated under a fine-tuned adaptation condition (WER ≈ 18%). Types of errors were not randomly distributed among phonetic features: substitution errors were caused by vowel changes and consonant replacements; deletions were explained by the presence of consonant clusters; and most insertions were due to prosodic and rhythmic variation, specifically syllable-timed rhythm and epenthesis. These findings support the phonetic mismatch hypothesis that attributes errors in ASR to linguistic behaviors, and not failures in the system. This study contributes to a phonologically grounded description of ASR bias and proposes training and evaluation models to factor in dialect-specific phonetic knowledge. Keywords: Asian American English, ASR bias, phonetic variation, speech recognition errors, dialect mismatch, word error rate, linguistic equity, corpus, computational linguistics. 1. Introduction Human-computer interactions are now centered on the ASR systems and have led to applications such as virtual assistants, transcription systems, and voice-controlled interfaces. There have been rapid advances in the field of deep learning, but they are not equally effective across groups of speakers. The error rates of non-mainstream dialect speakers may tend to be higher, and that is why one wonders about the impartiality, access, and linguistic bias of speech technologies (Koenecke et al., 2020). The Asian American English (AAE) is a varied group of English varieties influenced by multilingualism and exposure to the mother tongue. Even though ASR systems are typically trained with huge amounts of Mainstream American English (MAE), they do not tend to be extrapolated to other dialects, including AAE (Errattahi et al., 2018). The existing literature has addressed this issue predominantly as a computing limitation and focused on model adjustment and scale-up of the data sets. Less emphasis has, however, been placed on the phonetic processes underlying recognition errors. The current studies in the area of Automatic Speech Recognition have placed more emphasis on the fact that the distribution of errors in speech recognition of the English language is not merely distributed randomly but is also predetermined by linguistic variation and the composition of the data set. It has been established that even state-of-the-art end-to-end ASR systems, including systems based on deep neural architecture, experience performance loss even in cases of speech that fails to meet the norms of typical training (Zhang et al., 2020; Chan et al., 2022). Particularly, the difference in pronunciation, phonotactics, and prosody leads to anticipated recognition errors, especially in spontaneous and accented speech. These findings suggest that the error of ASR is very dependent on the probabilistic nature of the training data, in which models have more chances of supporting the frequently represented linguistic patterns and are less effective with the less frequently represented ones. As such, recognition systems are more likely to miss non-standard phonetic realizations and place them in the closest acoustic category, further supporting the presence of systematic bias in the performance of English ASR. In line with that, the research on varieties of South Asian English has revealed challenges in phonetics, accent, and multilingual influences in the ASR systems. One such instance is that the ASR models trained on the standard English corpora have significantly larger error rates when applied to the processing of speech with South Asian accents due to the difference in the realisation of vowels, the production of consonants, and even the prosodic patterns (Psanadi, 2022). These investigations also show that transfer learning and model adaptation can be employed to increase the accuracy of recognition, although it still fails to eliminate the latent difference between speech input and training data. This reinforces the opinion that the ASR errors are not technical limitations, but rather rest on linguistic diversity and unequal representation of data. Consequently, it is becoming more and more clear that to improve the ASR performance in global Englishes, larger datasets are not the sole answer but a much broader approach to variability that has to put the phonetic and sociolinguistic variability into the model structure. In this study, the analysis of the system architecture has been changed to phonetic mismatch. It states that the errors of ASR arise due to the lack of systematic appearance of the acoustic-phonetic patterns of AAE in those models trained on MAE. The study offers a linguistically-based explanation of ASR bias by determining the relationship between certain phonetic properties and error types. 1.2 Research Objectives The research aims to meet the following objectives: 1.3 Research Questions The research aims to answer the following questions: RQ1. What phonetic characteristics of South Asian American English are systematically related to certain types of ASR errors, substitution, deletion, and insertion, and in what proportions?
