This article describes how the performance of a Dutch continuous speech recognizer was improved by modeling pronunciation variation. We propose a general procedure for modeling pronunciation variation. In short, it consists of adding pronunciation variants to the lexicon, retraining phone models and using language models to which the pronunciation variants have been added. First, within-word pronunciation variants were generated by applying a set of ®ve optional phonological rules to the words in the baseline lexicon. Next, a limited number of cross-word processes were modeled, using two dierent methods. In the ®rst approach, cross-word processes were modeled by directly adding the cross-word variants to the lexicon, and in the second approach this was done by using multi-words. Finally, the combination of the within-word method with the two cross-word methods was tested. The word error rate (WER) measured for the baseline system was 12.75%. Compared to the baseline, a small but statistically signi®cant improvement of 0.68% in WER was measured for the within-word method, whereas both cross-word methods in isolation led to small, non-signi®cant improvements. The combination of the within-word method and cross-word method 2 led to the best result: an absolute improvement of 1.12% in WER was found compared to the baseline, which is a relative improvement of 8.8% in WER. Ó 1999 Elsevier Science B.V. All rights reserved.
30 Figures and Tables
Figure 1: Cumulative frequency of occurrence as a function of word frequency rank for the words in the VIOS training material.
Figure 1.1: Overview of a speech recognizer, shaded areas indicate where pronunciation variation modeling is incorporated in the work presented in this thesis.
Figure 1.2: Contamination of phone models caused by a mismatch between the acoustic signal and the corresponding transcription during training due to schwa-insertion.
Figure 1.3: Contamination of phone models caused by a mismatch between the acoustic signal and the corresponding transcription during training due to schwa-deletion.
Figure 1.4: Cumulative frequency of occurrence as a function of word frequency rank for the words in the VIOS training material.
Table 2 The words selected for cross-word method 1, their counts in the training material, baseline transcriptions and added cross-word variants
TABLE 2 Qualifications for κ-values >0
Fig. 3. Di erence in WER between the baseline result and results of adding variants of separate rules to the lexicon, sum of those results, and combination result of all rules.
Figure 3 Cohen’s κ for CSR ( ) and listeners (×) compared to various sets of reference transcriptions based on responses of eight listeners, and median κ for the sets of reference transcriptions for the CSR ( ) and the listeners ( )
Figure 3: Example of part of the lattice used to compute the average confusion.
Figure 4 Cohen’s κ for the listeners and the CSR compared to the sets of reference transcriptions (5 of 8) for the various phonological processes ( = CSR, × = listener, = median listeners,
Figure 4: Example of part of the lattice used to compute the word confusability scores, and an excerpt from a lexicon containing variants.
Table 4: Results for lexica generated using a data-derived approach, for the ICSI and Phicos systems.
Table 5 DWER for condition MMM compared to the baseline (SSS) for all methods
Table 5: Results of using confusability metric to remove variants from lexica for the ICSI system.
Table 6 Comparison between baseline test and ®nal test condition: number of correct utterances, incorrect utterances, improvements and deteriorations (percentages between brackets)
Table 6: Overlap between variants generated using five phonological rules and variants obtained using data-derived methods.
Table 7 Type of change in utterances going from baseline condition to ®nal test condition (percentages between brackets)
Table 7: Overlap between variants generated using five phonological rules which truly occur in the training material and variants generated using phone recognition or variants generated by the D-trees.
Table A.1: SAMPA phone symbols used for ASR, their corresponding IPA transcriptions and examples of Dutch words in which the sound occurs. Relevant sound is in bold type.
Table A.2: Mapped SAMPA phones.
Download Full PDF Version (Non-Commercial Use)