The intersection of speech and language models offer unique opportunities and challenges. This talk provides a comprehensive walkthrough of speech-language model research from NVIDIA NeMo. We cover several types of models such as attention-encoder-decoder Canary-1B, and LLM-based architectures such as SALM or BESTOW. In particular, we highlight the challenges in training and inference efficiency of such models and propose robust solutions via 2D bucketing and batch size OOMptimizer. Finally, we highlight the difficulty of preserving text-domain capabilities in speech-augmented training and present several possible solutions: EMMeTT, VoiceTextBlender, and Canary-Qwen-2.5B.

DOI

Duration

1:43:31

Subtitles

✖

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

Education Center

Foundational Speech Models and their Efficient Training with NVIDIA NeMo (video)

IEEE SPS Education Center FAQs

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

Career & Industry

About IEEE SPS

For Volunteers

Education Center

Foundational Speech Models and their Efficient Training with NVIDIA NeMo (video)

IEEE SPS Education Center FAQs

What is the IEEE SPS Education Center?

How Can I Access Materials?

Are Educational Credits Available?

How Do I Get Support?

IEEE Signal Processing Society on

Publications & Resources

Conferences & Events

Education & Training

Community & Involvement

About IEEE SPS

For Volunteers

Career & Industry

Education & Training