Grand Challenges in Speech and Language Processing

You are here

Inside Signal Processing Newsletter Home Page

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

10 years of news and resources for members of the IEEE Signal Processing Society

Grand Challenges in Speech and Language Processing

By John H.L. Hansen (SLTC Liaison to E-Newsletter)

John H.L. HansenIt is now less than two months to IEEE ICASSP-2011, so I encourage if you have not done so to register and attend this premier conference. The organizers have done an excellent job in putting together the program with four plenary speakers (including our own Speech and Language Processing Technical Committee (SLTC) member Nelson Morgan from ICSI: "Does ASR have a PHD, or is it just Piled Higher and Deeper?"), 14 Tutorials (taking place on Sunday and Monday [May 22-23]). Texas Instruments and XILINX also have organized industry workshops on Sunday and Monday as well.

This month, I wanted to pass on some thoughts regarding research directions in speech and language processing. In my role as Department Head for the University of Texas at Dallas, I recently attended the Electrical and Computer Engineering Dept. Heads Conference (ECEDHA-2011) (this meeting is actually open to all Department Heads of EE/ECE from any country). At this meeting, Michael Lightner (Past CEO & President of IEEE) served as Moderator for a feature session entitled: "New Energy Technologies: An Overview of the Next-Generation". In the USA, and also worldwide, energy has rapidly emerged as a key area of focus, with many Electrical Engineering Departments in the US moving to change their names to "Electrical, Computer, and Energy Engineering". This got me thinking about where the key challenges are in speech? Therefore, in this installment, I wish to consider the topic of "Grand Challenges in Speech and Language Processing", which should be of interest to researchers and development engineers in the field of speech processing and language technology.

In the field of speech processing, many advancements have been made over the past thirty years which have helped shape speech communications, recognition, and various speech/language technologies. To the general public, many perceive the topic of speech recognition to be a solved problem, but most researchers and engineers know there are major impediments for present day speech recognition to be employed in everyday voice applications. Robustness to noise, communication channels, handsets and microphone-mismatch, clearly present major obstacles for general use. However, are there clear "Grand Challenges" present and emerging in the field of speech and language processing? In the United States, the National Academy of Engineering has suggested a number of topics associated with "grand challenges" including: provide energy from fusion, secure cyberspace, reverse-engineer the brain, make solar energy economical, etc. USA DARPA has established their DARPA Grand Challenge which is focused on an autonomous/driverless vehicle being able to automatically navigate a +150 mile route with as many as 195 teams participating from 36 US States and 4 non-US countries. With the concept of Grand Challenges, are their areas in speech and language which we might consider? I would like to suggest the following three: (i) Speech-to-Speech Translation, (ii) Speech Recognition for All Languages, (iii) Thought-to-Speech Signal Production.

Multi-Lingual Speech-to-Speech Translation (S2S).
Recently, a number of efforts have emerged and have demonstrated effective speech-to-speech translation. With +6,000 languages spoken in this world, the ability to reduce communication barriers between humans could help reduce (i) differences between peoples where military conflicts might arise, (ii) provide more effective rapid response by emergency and care givers in times of natural disasters, (iii) help encourage closer cooperation in science and engineering advancements, or (iv) just help those traveling to new countries better interact with others. Mobile technology in the form of cell-phones (Android, iPhone, etc.) have enabled improving computing support for mobile communication devices. It is clear that seamless S2S translation is something that would benefit all.

Assume we have Speaker A (who is a speaker of Language A) and Listener B (who is a speaker of Language B). This process requires (i) speech recognition in Language A for the original input speaker, (ii) machine translation of text from Language A to B, (iii) speech synthesis of Language B for the listener; (iv) speech recognition in Language B, (v) machine translation of text from Language B to A, (vi) and speech synthesis of Language A for the original speaker who is now the listener. While a number of groups have been active in this area, IBM TJ Watson recently demo’d their handheld MASTOR, aka. Multilingual Automatic Speech-to-Speech Translator, device at IEEE ICASSP-2010 (which received several awards from DARPA; and one of the advancements cited for IBM T.J. Watson in them receiving the IEEE Corporate Innovation Recognition in 2009 "for long-term commitment to pioneering research, innovative development, and commercialization of speech recognition."). This is the first S2S system that allows for bidirectional (English-Mandarin) free-form speech input and output (and other languages are also supported). While their solution is domain specific, this represents one of the strategic advancements many would argue to be a "Grand Challenge" in speech processing.

Speech Recognition for All Languages.
Another topic which should be considered in the content of a "Grand Challenge" is speech recognition for all languages. It is estimated that there are more than 6,900 languages spoken in the world, with countless dialects, languages with no written form, and languages which are considered "dying languages" because the numbers of speakers are dwindling to a point where the language will become extent. Wikipedia lists 10 languages spoken by more than 100M speakers, another 12 with between 50-100M speakers, and it is estimated that 330 languages are spoken by more than 1M. However, if one considers which languages enjoy the most effective working speech recognition platforms, the number might be less than 30. As such, there is a cultural, economic, and societal need to see speech recognition, as well as various forms of language technology (e.g., spoken document retrieval, dialect/language ID, automatic translation (see above), etc.) move to new and under-researched languages. IARPA recently announced their goals to focus on this topic in the BABEL program. Also, organizations such as CALICO – the Computer Assisted Language Instruction Consortium – has been focused on this for language learning for a number of years. Advancements here clearly would represent one of the core "Grand Challenges" in speech and language technology.

Thought-to-Speech Signal Production.
One area which has challenged speech scientists is the ability to tap directly into the thought process of the brain in order to translate this to a speech signal. Mapping Broca’s Area and Weinicke’s Area, along with the speech articulators in the motor cortex is not an easy task. For those individuals who suffer from permanent paralysis/inability to vocalize any speech, some have suggested the prospects of implanting microelectrode array into the language portion of your brain, and when you "think" of what you would like to say, that information is sensed and transmitted perhaps wirelessly to an external speech synthesis engine where artificial speech is produced. For subjects who know they will lose their ability to speak (e.g., pending surgery, etc.), collecting sufficient speech content prior to surgery allows an individual to maintain their voice (those of you who know the Movie Critic Roger Ebert, know an example of how saving prior speech can help restore one’s voice after severe health recovery). For this area, there was a very interesting paper in the Interspeech-2009 conference which considered an artificial speech synthesizer controlled by a brain-computer interface. The subject had a neural prosthesis for speech restoration and was able to perform vowel production from thought to artificial synthesis.

While there are other topics in our field, these might represent a starting point. With this, I look forward to seeing all of you in Prague at this upcoming ICASSP2011!

Table of Contents:

SPS on Twitter

SPS Videos


Signal Processing in Home Assistants

 


Multimedia Forensics


Careers in Signal Processing             

 


Under the Radar