Robert Mac Auslan, Ph.D. Vice President, Academic Markets, Phonologics, Inc.
Joel Mac Auslan, Ph.D. Chief Technology Officer, Phonologics, Inc.
Linda J. Ferrier-Reid, Ph.D. Chief Linguistics Officer, Phonologics, Inc.
The goal of this paper is to describe the current version of an Automated Pronunciation Screening Test (APST) that analyzes digitized speech samples from Non-Native Speakers of English (NNS) who speak with a foreign accent, and provides a norm-referenced intelligibility score. The test, which has been under development for over ten years, determines whether a speaker has reached an acceptable level of intelligibility in standard American English for a given communication setting or whether s/he requires intelligibility training. Other tests generally used for this purpose rely almost exclusively on perceptual rating scales of different dimensions of speech intelligibility. Such rating scales are subjective, time-consuming, and unreliable.
The innovations in APST allow us to establish an objective standard of intelligibility in speakers of accented English. Our knowledge-based technology may also provide the basis for future diagnostic tests and intervention programs. The project uses data initially collected primarily on Chinese and Spanish-accented speakers, and later augmented with many Ukrainian, Vietnamese, and South Asian speakers. It employs correlations among judgments of American English listeners listening to conversation, with objective acoustic measures of pronunciation of test items. Newer research tests the current version of the program using a small number of naïve listeners. This research utilizes the statistical findings of a previous NIH funded study, and in an iterative design, established the validity of the current test.
Population of Non-Native Speakers of English
The foreign-born population of the United States exceeded 33 million in 2002, slightly more than the entire population of Canada, according to the U.S. Census Bureau’s latest American Community Survey (ACS). In particular, there is a huge flux of medical professionals worldwide with the largest flow being towards the Anglophone countries, with the US the largest magnet.
These global population flows are projected to increase over the next 20 years, because population ageing and changing technologies are likely to contribute to an increase in the demand for health workers, while workforce ageing will decrease the supply as the baby- boom generation of health workers reaches retirement age, according to the International Migration Outlook: SOPEMI 2007 Edition.
Students studying in US colleges and universities
Foreign enrollment in U.S. universities and colleges increased by 3% in fall 2009 to 586,000, rising 2% for non-science/engineering (S&E) fields (to 327,000) and 4% for science and engineering (to 259,000) (table 1). The increase in S&E enrollment was larger than in recent years, but for the 2006–09 period, S&E students accounted for a steady 44% of total foreign enrollment.
As described by Chen, 2010, p 183, “The term “foreign accent” might be characterized as the subjective impression of a native listener or an advanced student of a foreign language. The precise nature of a foreign accent remains mostly unexplored even though it has intrigued an increasing number of second language acquisition researchers.
We are interested in foreign accent to the extent that it reduces intelligibility. This may be caused by language interference (the substitution of features of one phonology for that of a second), processing effects (the slowing or other effects of listening to speech in real time due to increased processing load), irritation or prejudice against the perceived speaker group that may reduce attention to the message. The relative effects of these various factors are just beginning to be determined in the research literature
A. Segmental effects
NNSs are often unintelligible due to a variety of interference effects between their native languages (L1) and English (L2). These effects depend on the characteristics of the first language and affect multiple aspects of pronunciation of L2: syllable structure, phonemes, subphonemic characteristics, and suprasegmentals. In particular, NNSs often distort L2 by
substituting familiar L1 phonemes for those in L2. For example, they omit or insert phonemes at certain points in the syllable in order to produce a structure that conforms to a familiar pattern. They also produce phonemes that occur in both languages, but with acoustic parameters (such as voice onset time (VOT), vowel length, and stress) more appropriate to L1.
B. Interference effects
This effect also influences discrimination of sounds in L2. For example, the identification of vowels in L2 depends partly on the number and type of vowels in L1. The number and type of vowels in L1 also affect the capacity to respond to training (Flege, 1986; Best & Strange, 1992). Similarly, with respect to consonants, Flege and Wang (1989) found that Cantonese Chinese speakers performed better after training on discrimination of word-final /t/ and /d/, as in bat vs. bad, than speakers of the Shanghai dialect of Chinese or Mandarin speakers, because Cantonese, like English, permits unreleased unvoiced stops /p, t, k/ in final position. The Shanghai dialect of Chinese allows only final glottal stops and Mandarin allows no final obstruents at all. The performance of speakers of the Shanghai dialect was, as might be expected, better than that of Mandarin speakers.
Research on speech production (Flege, 1987, 1988) indicates that the phonetic space of bilingual speakers is restructured during L2 learning. This restructuring can negatively affect the production of L2 sounds, because the speaker sometimes classifies them as equivalent to ones in L1, a process referred to by Flege (1991) as identification. Conversely, sounds in L2 that are perceived as “new” or outside those in L1 are more likely to be produced correctly.
An example of a phoneme that can be assimilated to an English sound is Spanish /t/. Spanish /t/ is implemented as an unaspirated stop with short-lag VOT, similar to an English /d/; English /t/ is aspirated and has a long-lag VOT (Flege,1987, 1995). Most Spanish speakers produce English /t/ with “compromise” VOTs, which are between the values typical for English and Spanish, with the result that native English listeners often hear the /t/ as a /d/. An example of a “new” sound is the /y/ French vowel for English speakers. Flege (1987) found that native English speakers with different degrees of exposure to French produced this vowel with only slightly different F2 frequencies than a group of French monolinguals, suggesting that totally new sounds are less difficult to acquire.
Phonotactic differences also exert an effect. For example, Chinese dialects differ markedly from English with respect to syllable structure (Cheng, 1987). Chinese generally prefers open syllables, which end in vowels or sonorant consonants such as nasals, whereas English frequently closes syllables with a final consonant. Chinese speakers of English frequently omit word-final consonants, particularly non-sonorants; e.g., producing /experti/ for expertise.
English includes a number of phonemes that rarely occur in other languages but are easily identified with phonemes that are common. For example, the dental fricatives in words like “this” and “three” are often identified with [d] and [t]. Consequently, NNSs with different first languages may all experience difficulty with such sounds. Compton (1983), using groups of 28- 40 speakers each, who spoke Spanish, Filipino, Cantonese, Mandarin, and Japanese, found that all had difficulty with these sounds in word-initial and -final positions. Cardosa, 2011, found that
Portuguese NNSs of English, as well as inserting an epenthetic /i/ vowel after a word-final voiced consonant, also perceived such insertions in a forced-choice phone identification task, suggesting a strong link between perception and production in this regard.
Such interference effects appear to be related to the age at which the speaker was exposed to L2 (Flege, 1990, Flege & Schmidt 1995; Flege, Munro, & MacKay 1995) and to the amount of conversational experience they had (Best & Strange, 1992). A more recent study by Huang and Jun, 2011 indicates that the age of arrival in an English speaking country (and hence years of exposure to a language) effects aspects of prosody differently.
In summary, the type and number of interference effects depend on the L1, the speaker’s discrimination of sounds in L2, his/her reclassification of L2 sounds in production according to the grid imposed by the phonology of L1, and the amount and age of exposure to L2. In addition, there is wide individual variation in the ability to discriminate and produce new sounds — an area still largely unexplored. From the perspective of this research, one task is to determine whether a given speaker’s phonemes deviate from standard American English productions enough to be labeled non-standard to a native listener. Such aberrant segmental productions will reduce the speaker’s overall intelligibility, as found by Goldhor (1989), who writes, “Phonetic distortion and speaker intelligibility are intimately connected.” This observation was key to our initial focus on phonetic measures.
C. Prosodic effects
Prosody covers a number of suprasegmental systems that affect intelligibility (Crystal, 1969). These include stress in polysyllabic words (lexical stress), sentence stress or accent, and intonation. English has a complex derivational system which results in alternations such as “SIMple~simPLIcity~SimplifiCAtion.” NNSs frequently produce such words with lexical stress assigned to the wrong syllable, a source of confusion to native listeners. Unfortunately, there are few simple rules to guide the learner of English; word stress patterns must generally be learned on a word-by-word basis (Flege & Bohn, 1989). To complicate matters, even among linguists there is no agreement on the number of stress levels that should be used to describe this phenomenon. A generally accepted description appears to be a ternary distinction: 1° stress vs. 2° stress vs. unstressed (Ladefoged, 2001).
The three most important acoustic correlates of stress are changes in pitch or fundamental frequency (F0), syllable duration, and loudness. From an articulatory perspective, there is greater articulatory effort associated with stressed than with unstressed syllables (Kent & Netsell, 1971). In addition, vowel duration is longer in stressed than unstressed position. (At the phonological level, this is related to vowel reduction, the centralization of unstressed vowels.) While these three acoustic factors interact in production (Couper-Kuhlen, 1986), it appears that listeners attend primarily to change in F0, secondly to increased duration, and finally to increased intensity (Lehiste, 1970). This suggests a focus on F0 change as a measurement of stress, with other factors coming into play more in words-within-sentences.
There has been an increasing research interest in prosody as it relates to the perception of foreign accent in the last twenty years, particularly. In a study of Korean, and Chinese-accented speech versus that of American English (AE) speakers, Baker, R., Baese-Gahot, L., Kim, M., Van
Engen and Bradlow (2011) found that “AE speech had shorter durations, greater within-speaker word duration variance, greater reduction of function words [such as prepositions and articles], and less between-speaker variance than non-native speech. However, both American English (AE) and non-native speakers showed sensitivity to lexical predictability by reducing second mentions and high-frequency words. Non-native speakers with more “native-like” word durations, greater within-speaker word duration variance, and greater function word reduction were perceived as less accented”. This durational factor contributes to the perception of AE speakers that the speech of Chinese and Korean speakers is “jerky” or lacking in a feature we call “smoothness.”
D. Effects on the listener
Researchers have examined a number of listener variables, including intelligibility, speech naturalness, perception of foreign accent, and irritation. While many of these variables are clearly related, we focused most closely on the issue of intelligibility. It is our position (and that of the American Speech-Language-Hearing Association, 1994) that social and geographic dialects should be accepted as forms of linguistic variation. Intervention is required only to remedy deficiencies in intelligibility. For those individuals with reduced intelligibility, a strong dialect can be a handicapping condition. (Individuals may choose, however, to speak with strong dialects for whatever purpose.) Evaluations of foreign accent will, of course, frequently be associated with different levels of intelligibility and with affective responses, including irritation (Flege, et al. 1995), and social prejudice (Hewstone, 2002).
Intelligibility refers to the ability of a listener to recognize and understand a word, phrase or sentence of a normal speaker. (In this case, we use “normal” to refer to a speaker who is not hearing impaired and has no impairment of voice, fluency, or articulation.) Intelligibility is impacted by the social and linguistic context of the speech, as well as the clarity of the speech signal in relation to the level of background noise. In this instance we are mostly concerned with the decontextualized speech that a listener might hear on the telephone. Intelligibility can be defined as what is understood by the listeners of the phonetic realization of speech (Yorkston, Strand, & Kennedy, 1996). It is often measured by the number of phonemes that can be accurately transcribed from listening to recorded speech. It is often also rated on Likert scales.
Next to the ratings of the four dimensions, the judges indicated the most dominant dimension affecting intelligibility for each patient. The judgments of overall intelligibility and each of the dimensions were compared. Again, articulation and prosody show the strongest correlation with intelligibility, with nasality the lowest.
While voice quality and nasality are frequently disordered in dysarthric speech, as opposed to the speech of non-dysarthric NNSs, it is still perhaps significant that articulation and prosody were the major contributors to reduced intelligibility. This research also suggests that a complete evaluation of speech intelligibility should include an evaluation of prosody. It also suggests that intelligibility can be captured by a combination of a number of ratings of dimensions that make up intelligibility.
Intelligibility is also related to the familiarity of the listener with the speech pattern of the speaker (Kennedy, Sara ; Trofimovich, Pavel, 2008). (A well-known phenomenon is the miraculous improvement in intelligibility of a non-native speaker over time in the view of his/her teacher, when objective testing shows no real improvement!) We therefore select listeners, in our intelligibility testing, who are dialectally naïve in the sense of having limited exposure to dialected speech. (Given the large number of non-native speakers to the USA, this is a theoretical ideal to which we approximate.)
A further important issue is that meaning, once decoded from an utterance, is then retrievable from the listener’s memory for a considerable period of time. It is this initial ability to abstract meaning that we are interested in, rather than the ability to subsequently recall the utterance itself. Hence, subjects in the native-speaker listener group of our research can only be employed once to listen to each listener utterance.
While our concern is with intelligibility, rather than directly with level of accent, or accentedness, we must assume they are related, both objectively in terms of the number and type of errors the speaker makes and subjectively in terms of the ease, speed, and accuracy with which a native listener decodes a word or utterance. It awaits a larger study than ours to determine the relation between the two.
A related variable is comprehensibility. We take it that intelligibility and comprehensibility both describe the ability of the listener to extract meaning from or understand speech but differ in scope and focus. Comprehensibility is usually used to refer to understanding of larger units of speech such as discourse; intelligibility is used to refer to smaller units — the sound, sentence, or word. Comprehensibility is generally used to refer to the perception that speech is understandable rather than an actual measurement of the amount that is understood. It also takes into account the semantic and linguistic context. Sentences and discourse tend to be evaluated for vocabulary, morphosyntactic correctness, and fluency, as well as for the intelligibility of single words. Words and sentences are generally judged for intelligibility in relation to their phonological or phonetic correctness at the levels of segments and prosodic features.
Since most real-life intelligibility “tests” occur in spontaneous-speech situations (lectures, conversations, air-traffic dialogues), APST’s results must correlate well with larger measures of intelligibility. A screening test like ours must be easy and quick to administer and take, so single words are sometimes appropriate test items. APST’s intelligibility score must also correlate with measures of conversational and sentential intelligibility.
Understanding is subjective and depends on a variety of contextual variables, including the speaker’s level of language proficiency. It can be measured and defined in a variety of ways. Measurements of comprehensibility have largely used rating scales and error hierarchies. Gynan (1985), in a study of Spanish-speaking English learners, found that morphosyntactic errors such as pluralization endings contributed less to comprehensibility than phonological factors, at least for more proficient speakers. Elsewhere, Gass and Veronis (1983) investigated the contribution to comprehensibility of familiarity with topic, with nonnative speech in general, with a particular accent, and with the accent of a particular speaker. Topic familiarity was most significantly related to comprehensibility, but all had an effect. Clearly, in situations such as classrooms or clinics, where listeners will not necessarily be familiar with the topic at hand, accent issues will be more prominent.
In another study that employed rating scales of a variety of features, including grammar, pronunciation, intonation, word-choice, and intelligibility, Fayer and Krasinski (1987) found that pronunciation and hesitations were the most distracting features to both non-native and native listeners. Anderson, Johnson, and Koehler (1992) used the SPEAK test to investigate the relationship between raters’ judgments of pronunciation and deviance in segmentals, prosody, and syllable structure. They found that, while all variables were significantly related to pro- nunciation ratings, prosody had the greatest effect.
In an important study, Munro and Derwing (1995) used a sentence-verification task to determine the effects of a foreign accent on sentence processing time in native speakers. Response latency times were longer when American English listeners were required to evaluate Mandarin- accented utterances than those produced by native English speakers. In a later study, (2006) these authors studied the impact of functional load (FL) of segmental errors on perceptions of accentedness and comprehensibility. Thirteen native English listeners judged 23 Cantonese- accented sentences that exhibited various combinations of high and low FL errors. The high FL errors had relatively large effects on both perceptual scales, while the low FL errors had only a minimal impact on comprehensibility. The only cumulative effects of errors seen in the data occurred with high FL errors in the judgments of accentedness. While this research is helpful in suggesting that high FL load phonemes be included in our test, the study did not address intelligibility directly.
It appears, then, that phonological factors contribute to comprehension and distraction on the part of listeners, that prosody is an important component of these in addition to the functional load of the content phonemes, and that these factors impose a processing burden on the listener. More research is needed, however, to determine the relative weightings given to segments versus prosodic factors and to processing factors.
Potentially Fatal Miscommunications
Miscommunication can occur in any human interaction, as medical institutions know to their cost. There is regrettably little systematic data on miscommunications that occur in medical settings, such as during surgical procedures, but the results can be clearly of great consequence. Anecdotes of such miscommunications are, however, legion.
Likewise, according to the recent Federal Air Surgeon’s Medical Bulletin, entitled, Thee…Uhhmm…Ah.. , ATC-Pilot Communications by Mike Wayda, “When you produce these hesitations while speaking, you are using … ‘place holders,’ or ‘filled pauses’, a type of speech dysfluency especially common in pilot-controller exchanges. Until recently, such speech dysfluencies and other mistakes were not considered to be important; however, new research suggests that there is a correlation between miscommunications and mistakes, says CAMI scientist Dr. Veronika Prinzo-Roberts”.
Indeed, nothing underscores the subtle complexities of speech communication more strikingly than the miscommunications that occur among pilots, crew members, and air traffic controllers. Problems stemming from mistaking one word for another happen to even native speakers; the difficulties are only compounded when accented speech is involved.
There are many relevant examples. Homophony and, more commonly, near-homophony, in which different words or phrases sound exactly or nearly alike, can be just as problematic as prosody. Phonological confusion is possible, for example, because “left” can sound very much like “west”. An outbound pilot who was told to receive his clearance from the Air Traffic Control Center when he was “on the deck” misheard this as “off the deck” and proceeded with his takeoff, consequently finding himself head-on with an inbound aircraft. One wide-body airplane barely missed colliding with another after landing, because the pilot heard “Hold short” as “Oh sure” in response to his asking the controller “May we cross” in reference to a runway. The words “to” and “two” (a stress issue) are especially problematic. Confusions between the differently stressed “two” and “to” led to a fatal accident in another incident when a pilot read back the instruction “Descend two four zero zero” as “O.K. Four zero zero” and then proceeded, without correction from the controller, to descend to four hundred, rather than twenty-four hundred, feet.
Existing Knowledge of Acoustic Features of Standard American English Speech
Since acoustic analysis methods became readily available in the 1960s, there has been a steady stream of research documenting features of standard American English speech in single words and sentences. These studies have examined phonetic features and their effects on perception with subjects reading the stimuli, as in (Clear) SpeechWorks. The alveolar stops /t,d/, for example, which are among the most frequently used and which are produced with different lag times in different languages, have been extensively researched in all word positions and many languages (Lisker & Abramson, 1964; Fischer-Jorgensen, 1954; Klatt, 1975; Zue, 1976; Sharf, 1962; Umeda, 1977; Zue & Laferriere, 1979).
Research on the acoustics of American English was further encouraged by the practical goal of developing speaker-independent automatic speech-recognition systems. The goal of achieving isolated word recognition provided the motivation for exact measurement of the statistical properties, and constraints on the phonemic properties, of single words. To give just a few examples of specific measurements from Phase I: the average duration of a stressed vowel is about 130 msec; the average duration for unstressed vowels, including schwa, is about 70 msec; and the average duration for a consonant is 70 msec (Klatt, 1976). An acoustic correlate of emphatic or contrastive stress is an increase in the duration of a word by 10% to 20% (Coker, Umeda, & Browman, 1973).
Some researchers have used template-building techniques (see, for example, Blumstein (1986) and references contained therein). Labial consonants, for instance, are characterized by either a flat distribution of spectral energy between the release burst and onset of voicing, or by sustained spectral energy at low frequencies (about 1500 Hz). By contrast, dental and alveolar consonants evince greater concentration of energy in high frequencies (about 3500 Hz) at release. These researchers were dealing with the problem of variability in speech production across different speakers and with the careful measurement of contextual effects (Oshika, et al., 1975). However, speech recognition has the added problem of not knowing the target word the speaker intends. APST does not need to deal with this complex issue, since the words tested are from a short list and known in advance.
More recently, Chen 2010, in an extension of Chen and Chung (2008), investigated the difficulties encountered by Taiwanese learners in English speech timing patterns and identified critical variables that affected native listeners’ perceptions of foreign accents. Thirty Taiwanese learners and 10 native speakers of American English read an article. Six variables—syllable duration, vowel reduction, pause duration, linking, consonant cluster simplification, and speech rate—were acoustically analyzed in five sentences. In addition, ten English listeners rated these speech samples for degree of foreign accent. A regression analysis was used to analyze the relation among these variables.
Taiwanese learners displayed very different speech patterns according to the six acoustic variables from those of native English speakers. The perceptual ratings of the six individual variables showed a very strong positive correlation with the overall ratings, suggesting that timing patterns were more a holistic impression rather than a discrete component. Speech rate was the primary predictor determining native listeners’ perception of foreign accent. If this overall fluency variable was excluded, then vowel reduction and linking duration became the two most heavily weighted variables. The author proposed a temporal perception model to account for the effects of timing variables on native English listeners’ judgment of foreign accents. While this study was not examining intelligibility, it does highlight the need to incorporate measures of prosody into an analysis of foreign accented speech.
Kang, O, (2010) studied the relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. He examined international teaching assistant (ITA) classroom speech for a variety of suprasegmental features as well as native speakers’ judgments of that speech. Particularly, speech samples were acoustically analyzed for measures of speech rate, pauses, stress, and pitch range. Fifty eight U.S. undergraduate students evaluated the ITAs’ oral performance and commented on their ratings. The results showed that suprasegmental features independently contributed to listeners’ perceptual judgments. Accent ratings were best predicted by pitch range and word stress measures whereas comprehensibility scores were mostly associated with speaking rates. These last two studies highlight the significance of speech rate in the perception and understanding of NNS speech.
History of APST
The initial version of the test was a Macintosh system developed to screen the large numbers of non-native speakers who were to be employed as international teaching assistants or lab assistants at Northeastern University in Boston, MA. NNSs read the stimuli and their versions were recorded onto the computer using a high quality microphone, fitted appropriately and adjusted to the speaker’s volume. This first test version was scored by trained graduate students, who, having achieved a reasonable to degree of inter-scorer reliability (95% or better), listened to the speech samples, and entered their scores into the computer program. The program then provided a summary score. This was used with standard TOEFL scores to determine whether the NNS should be allowed into the lab or classroom or whether s/he should receive accent modification training. This first version showed the need for a more objective and quickly scored version of the test.
A second automatic prototype was developed with funding from an NIH grant. In this Phase I study, we developed statistical speaker models of NNSs and listener models of American English (AE) listeners. Major conclusions from Phase I that underlie this current research are:
- Segmental errors of NNS speech in isolated words are major predictors of reduced intelligibility. See Figure 1 for an example of such errors, for the word pulled.
- Lexical stress does not significantly independently contribute to intelligibility ratings.
- Word intelligibility correlates with intelligibility of longer units of speech, particularly with an intelligibility rating carried out by listeners on American English speech.
- A test using only selected objective acoustic measures can distinguish highly intelligible native speech from less intelligible NNS speech.
Figure 1. Example of an isolated-word formant rule for one phoneme.
(Formants are character-istic frequencies for vowels and liquids; each pho-neme has several of them. They arise from the differing positions of the tongue, lips, and jaw when producing these phonemes.) In this case, the line shows the division between pairs of measured formant values in examples that were judged correct by a phonetician (filled symbols) and those in productions judged incorrect. Formant pairs F1 (lowest frequency) and F2 (second lowest) for the /l/ of pulled. Correct productions correspond to a 450-Hz or smaller difference between F1 and F2, shown by the dividing line.
Figure 2. Empirical performance of an early version of APST. (Left) Predicted intelligibility on a 100- point scale, using only automatically detected segment-error rules, as in Figure 1. (Right) Predicted intelligibility of full APST, which combines segmental and other components of speech. As shown on the right, APST compares very favorably with listener evaluations of the same speakers. (Note: Listeners evaluated native speakers in this data set as having scores of 88 and above, NNSs as 79 and below.)
Further development of the APST system has been under the auspices of Speech Technology and Applied Research Corp. (S.T.A.R.). It is this newest version of the system that we test in this most recent work.
Description of Current System
Technologically, APST is built upon knowledge-based speech analysis (KBSA), physics and acoustics, and the physiology of human speech production, modification, and enhancement. The phrase knowledge-based refers to a system that is based on the careful study and analysis of the target; in this case, speech. Two key components of such a system include a knowledge base (derived from the expertise of a human “domain expert”) and inference mechanisms (a decision or classification engine).
In this research, Prof. Ferrier-Reid served as the domain expert. Her expertise was used to determine which things to test for (e.g., a word like river) – as well as key features within those items that might have special significance (the first syllable, or letter combination, ri- ). APST’s decision-classification system is a fuzzy logic system built on low-level acoustical measurements. APST uses KBSA to determine: phonetic (single-consonant or single-vowel), prosodic (phrase), and fluency (smoothness) components of intelligibility.
Our own research and a recent review of the literature lead us to ask the following researchquestion:
Will the current version of APST and the judgments of a small group of naïve Native English Listeners agree on whether preselected non‐native speakers are most intelligible, least intelligible, or in the middle of the intelligibility range?
Will APST results agree with 1) intelligibility ratings and 2) position rankings of a small
group of naïve American English listeners who listen to digitized recordings of sentences
produced by NNSs?
In this iterative design, we rely upon the findings of APST Phase I, showing that
- Sentence Intelligibility (i.e., the number of words in sentences heard correctly) correlated highly with rating scales of intelligibility (nine‐point anchored scale) in sentences by naïve listeners. (This finding gives us confidence in the use of Likert scales to measure intelligibility.)
- Phonetic intelligibility (i.e., the percentage of phonemes correct in single words assessed by a phonetician) also correlated highly with rating scales of intelligibility, supplying additional confidence in intelligibility rating scales
- Listeners provided rankings of NNSs that can be used to sort speakers into most intelligible, least intelligible, and mid‐range. These speaker recordings will be used again in the current study.
Five native speakers of American English, consisting of four untrained listeners and one trained pronunciation coach, listened to four sets of recordings in order to evaluate (using the nine‐point Likert scale) and rank (top/middle/bottom) their overall intelligibility. The recordings consisted of one Spanish male (SMJPA5) with a mid‐range intelligibility score from the APST, two Chinese females, one with a very high score (CFJLO5) and the other with a very low score (CFXLO9), and one male native speaker with a very high APST score (EMSGA21). The APST ranking is thus EMSGA21, CFJLO5, SMJPA5, CFXLO9.
Results and Analysis
On both measures, the evaluators all rated the recording subjects in a fashion consistent with their APST scores.
Note: To calculate Mean and Median in the graphs below, T/M/B was converted to 0/1/2.
Using both the Likert scale and the T/M/B ranking system, we tested for:
H0=Random Rating Selection, independent across listeners and speakers
Using this null hypothesis we were able to determine the likelihood that listeners would score subjects in a fashion consistent with the APST by chance. Specifically, it is straightforward to compute the probability that, of two speakers who are adjacent in APST scores, a listener would randomly assign increasing (4/9 chance), equal (1/9), or decreasing (4/9) Likert values on the 9‐point scale. (Such a probability is computed from the binomial distribution. Note that there are three adjacent pairs: EMSGA21/CFJLO5, CFJLO5/SMJPA5, and SMJPA5/CFXLO9.)
For the T/M/B ranking system we were able to reject the null hypothesis with p<0.003. For the Likert scale we were able to reject the null hypothesis with p<0.001. Our findings are all the more remarkable due to their exact correspondence on the Likert scale with the speaker ratings based upon their APST scores: Every listener scored every speaker in the same order as APST. Similarly, the T/M/B ranking system also corresponds as closely as possible to the APST scores. Of course, with four speakers and only three ranks, every listener is forced to assign identical ranks to at least one pair of speakers; but even then, the identically ranked speakers were adjacent in APST rank in every case.
American Speech-Language and Hearing Association position paper on social dialects. (1989).
Anderson-Hsieh, J., Johnson, R., & Koehler, K. (1992). The relationship between native speaker
judgments of nonnative pronunciation and deviance in segmentals, prosody and syllable
structure, Language and Learning 42(4), 529-555.
Rachel E. Baker , Melissa Baese-Berk, Laurent Bonnasse-Gahot, Midam Kim, Kristin J. Van
Engen, & Ann R. Bradlow (2011). Word durations in non-native English. Journal Of Phonetics,
Best, C.E. and Strange, W. (1992). Effects of phonological and phonetic factors on cross
language perception of approximants. Journal of Phonetics 20, 305-330.
Blumstein, S. (1986). On acoustic invariance in speech. In J. Perkell and D. Klatt (eds.)
Invariance and Variability in Speech Processes. Hillsdale, NJ: Erlbaum.
Cardoso, Walcir (2011). The development of coda perception in second language phonology: A
variationist perspective. Second Language Research, 27(4), 433-465.
Chan, Alice Y. W. (2011). The Perception of English Speech Sounds by Cantonese ESL
Learners in Hong Kong. TESOL Quarterly: A Journal For Teachers Of English To Speakers Of
Other Languages And Of Standard English As A Second Dialect, 45(4), 718-748.
Cheng, L.L. (1987). Cross-cultural & linguistic considerations in working w/ Asian populations.
ASHA 29(6), 33-37.
Chen, Hsueh-Chu (2010). Second Language Timing Patterns and their Effects on Native Listener
Perceptions. Concentric: Studies in Linguistics, 36, 2, 183-212.
Flege, J.E. (1987). The production of “new” and “similar” phones in a foreign language:
evidence for the effect of equivalence classification. Journal of Phonetics 15, 47-65.
Coker, C.H., Umeda, N. and Browman, C.P. (1973). Automatic synthesis from text. IEEE
Trans. Audio Electroacoust. AU-21, 293-297.
Compton, Arthur J. (1983). Accent reduction for NNSs of English. Short Course at the
American Speech Language and Hearing Convention.
Couper-Kuhlen, E. (1986). An Introduction to English Prosody. Tubingen: Max Niemeyer.
Crystal, D. (1969). Prosodic systems and intonation in English. New York: Cambridge
Fayer, J.M. & Krasinski, E. (1987). Native & nonnative judgments of intelligibility & irritation.
Language Learning 37(3): 313-326.
Fischer-Jorgensen, E. (1954). Acoustic analysis of stop consonants. Miscellanea Phonetica 2,
Flege. J. E. (1988). The development of skill in producing word-final English stops: kinematic
parameters. J. Acoustical Society of America 84(5), 1639-1652.
Flege, J.E. (1990). Age of learning affects the authenticity of voice-onset-time (VOT) in stop
consonants produced in a second language. J. Acoustical Society of America 89(1), 395-411.
Flege, J.E. (1991). The interlingual identification of Spanish and English vowels: orthographic
evidence. The Quarterly Journal of Experimental Psychology 43A(3), 701-731.
Flege, J. E.,& Bohn, O. (1989). An instrumental study of vowel-reduction and stress placement
in Spanish-accented English. SSLA 11, 35-62
Flege, J. E., Munro, M.J. & MacKay, I.R. (1995). Effects of age of second-language learning on
the production of English consonants. Speech Communication 16, 1-26.
Flege and Wang (1989). Native language phonotactic constraints affect how well Chinese
subjects perceive the word-final English /t/-/d/ contrast. Journal of Phonetics 17, 299-315.
Flege, J. and Schmidt, A. (1995). Native speakers of Spanish show rate-dependent processing of
English stop consonants. Phonetica 52, 90-111.
Flege, J. E., Munro, M.J. & MacKay, I.R. (1995). Factors affecting strength of perceived foreign
accent in a second language. J. Acoustical Society of America, 97(5), 3125-3134.
Gass, S. & Veronis, M. (1983). The effect of familiarity on the comprehensibility of nonnative
speech. Language Learning 34, 1.
Goldhor, R. (1989). The perceptual and acoustic assessment of the speech of hearing-impaired
talkers. In A. Syrdal, R. Bennett, and S. Greenspan (eds.) Applied Speech Technology. Ann
Arbor, CRC Press, 521-545.
Gynan, S.N. (1985). Comprehension, irritation and error hierarchies. Hispania 68, 160-165.
Hewstone, M. Rubin, M and Willis, H (2002) Intergroup Bias (Social Prejudice) Annual Review
of Psychology,Vol. 53: 575-604.
Huang, B. H., & Jun, S. (2011). The Effect of Age on the Acquisition of Second Language
Prosody. Language And Speech, 54(3), 387-414.
Hsueh-Chu Chen (2010). Second Language Timing Patterns and Their Effects on Native
Listeners’ Perceptions, Studies in Linguistics, 36.2 (183-212).
Klatt, D.E. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual
evidence. J. Acoustical Society of America 59(5), 1208-1221.
Jung, Mi-Young (2010). The Intelligibility and Comprehensibility of World Englishes to Non-
Native Speakers, Journal of Pan-Pacific Association of Applied Linguistics, v14 n2 p141-163
Kang, Olim. (2010). Relative salience of suprasegmental features on judgments of L2
comprehensibility and accentedness. System, 38301-315.
Kennedy, Sara., & Trofimovich, Pavel (2008). Intelligibility, Comprehensibility, and
Accentedness of L2 Speech: The Role of Listener Experience and Semantic Context. Canadian
Modern Language Review, 64(3), 459-489.
Kent, R. & Netsell, (1971). Effects of Stress Contrasts on Certain Articulatory Parameters,
Ladefoged, P. (2001). A Course in Phonetics. Fourth Edition, Harcourt Brace, College
Lehiste, I. (1970). Suprasegmentals. Cambridge, MA : MIT Press.
Lisker, L. and Abramson, A. (1964). A cross language study of voicing in initial stops:
Acoustical measurements. Word 20, 394-422.
MIGRATION OUTLOOK: SOPEMI 2007 EDITION – downloaded from the web, ISBN 978-
92-64-03285-9 – © OECD 2007)
Munro, M and Derwing, T. (1995). Processing time, accent, and comprehensibility in the
perception of native and foreign-accented speech. Language and Speech 38(3), 289-306.
Murray J. Munro & Tracey M. Derwing (2006). The functional load principle in ESL
pronunciation instruction: An exploratory study. System, 34520-531. doi:10.1016/j.system.
Oshika, B.T., Zue, V., Weeks, R., Neu, H. & Aurbach, J. (1975). The role of phonological rules
in speech understanding research. IEEE Transactions on Acoustics, Speech, and Signal
Processing 23(1), 155-193.
Sharf, D.J. (1962). Duration of post stress intervocalic stops and preceding vowels. Language
and Speech 5, 26-30.
Shaw, A. G. (1985). Student frustration, expected peer and faculty relationships as correlates of
Umeda, N. (1977). Duration in American English of consonants. J. Acoustical Society of
America 60, 846-858.
Yorkston, K. M., Strand, E. A., & Kennedy, M. R. T. (1996). Comprehensibility of dysarthric
speech: Implications for assessment and treatment planning. American Journal of Speech-
LanguagePathology, 5(1), 55-66.
Zue, V.W. (1976). Acoustic characteristics of stop consonants: A controlled study. Doctoral
Dissertation, MIT, Cambridge, MA.
Zue, V.W. and Laferrier, M. (1979). Acoustic study of medial /t,d/ in American English. J.
Acoustical Society of America 66(4), 1039-1050.