Integrating Lattice-Free MMI Into End-to-End Speech Recognition

TASLP Volume 31 | 2023

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

By:

Jinchuan Tian; Jianwei Yu; Chao Weng; Yuexian Zou; Dong Yu

In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3 k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released 1.

In Recent research of automatic speech recognition (ASR)1, great progress has been made due to the advances in neural network architecture design [1], [2], [3], [4], [5] and end-to-end (E2E) frameworks [6], [7], [8], [9], [10], [11], [12]. Without the compulsory forced-alignment and external language model integration, the end-to-end systems are also becoming increasingly popular due to its compact working pipeline. Today, E2E ASR systems have achieved state-of-the-art results on a wide range of ASR tasks [13], [14]. Currently, attention-based encoder-decoders (AEDs) [6], [7] and neural transducers (NTs) [8] are two of the most popular frameworks in E2E ASR. In general practice, training criteria like cross-entropy (CE), connectionist temporal classification (CTC) [15] and transducer loss [8] are adopted in AED and NT systems. However, all of the three E2E criteria try to directly maximize the posterior of the transcription given acoustic features but never attempt to consider the competitive hypotheses and optimize the model discriminatively.

Read on IEEE Xplore

Tags:

IEEE TASLP Article

Congratulations Image (1).png

SPS Members Recieve Prestigious IEEE Awards

congratulations.jpg

Congratulations to Signal Processing Society Members Elevated to Senior Members!

Submit_Manuscript_pg.jpg

Submit a Proposal for ICASSP 2030

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

Publications & Resources

For Authors

Congratulations Image (1).png

congratulations.jpg

Submit_Manuscript_pg.jpg

Top Reasons to Join SPS Today!

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

SPS Social Media

IEEE SPS Educational Resources

What is Signal Processing?

Popular Pages

Today's:

All time:

Last viewed:

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

Search form

You are here

Publications & Resources

For Authors

Top Reasons to Join SPS Today!

Integrating Lattice-Free MMI Into End-to-End Speech Recognition

SPS Social Media

IEEE SPS Educational Resources