Speech and Language Processing Technical Committee Newsletter

April 2011

Welcome to the Spring 2011 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter.

In this issue we are pleased to provide another installment of brief articles representing a diversity of views and backgrounds. This issue includes articles from 8 guest contributors, and our own staff reporters and editors.

This issue also features a new regular section: Award announcements. If you or a colleague has received an award, please let us know and we'll pass along the news.

We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can submit job postings here, and reach us at speechnewseds [at] listserv (dot) ieee [dot] org.

Finally, to subscribe the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Jason Williams, Editor-in-chief
Pino Di Fabbrizio, Editor
Martin Russell, Editor
Chuck Wooters, Editor

From the SLTC and IEEE

From the IEEE SLTC chair

John Hansen

Grand Challenges in Speech & Language Processing

IEEE Signal Processing Society Newsletter

The IEEE Signal Processing Society, our parent organization, also produces a monthly newsletter, "Inside Signal Processing".

A Job Marketplace for Signal Processing Professionals

Alex Acero, SPS Director of Industrial Relations

Editor's note: The SLTC jobs page has long been one of the newsletter's most popular features. We recently worked with the IEEE Signal Processing Society leadership to help other technical committees start and grow their own job boards. This article by Alex Acero, which appeared in the March 2011 issue of "Inside Signal Processing eNewsletter", reports on progress.

Book announcements

Edited by Jason Williams

Award announcements

Includes the MIT Lincoln Laboratory 2010 Best Paper Award. Edited by Jason Williams.

Babel Program - Broad Agency Announcement (BAA)

Contributed by Mary Harper

The IARPA Babel BAA was released on April 07, 2011. A synopsis of this new speech program is given below, along with important dates and links.

Machine Learning for Speech and Language Processing - Symposium and Special Interest Group

Joseph Keshet and Geoffrey Zweig

The newly created Special Interest Group (SIG) on Machine Learning for Speech and Language Technology (SIGML) is organizing a first symposium, to be held June 27th in Bellevue, Washington.

Book Review: Johanna Drucker's "SpecLab: Digital Aesthetics and Projects in Speculative Computing"

Antonio Roque

This book review describes how Johanna Drucker's "SpecLab: Digital Aesthetics and Projects in Speculative Computing" approaches language processing from a humanist's perspective.

An Overview of the TransTac Evaluation and the IBM Speech-to-Speech System

Tara N. Sainath

Over the past few years, the DARPA Translation System for Tactical Use (TransTac) program has create a number of evaluations with the goal of advancing state of the art research in automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS). The speech-to-speech (S2S) translation group at IBM has developed a system which has performed quite well in these evaluations. In this article, we describe the TransTac evaluation and highlight various research directions in the IBM S2S Group.

Question Generation Workshop at AAAI Fall Symposia 2011

Svetlana Stoyanchev and Rashmi Prasad

This is a call for participation for the Fourth Question Generation workshop at AAAI Fall Symposia 2011, to be held Friday - Sunday, November 4-6 in Arlington, Virginia, adjacent to Washington, DC.

The Spoken Dialog Challenge 2010 and 2011

Jason D. Williams

The Spoken Dialog Challenge 2010 was an exercise to investigate how different spoken dialog systems perform on the same task. Dialog systems from different research teams fielded calls from test subjects, and real bus riders. The data will be made available to the community, and admission to the 2011 challenge is now open.

Speech science and technology for real life at INTERSPEECH 2011 in Florence

Piero Cosi, Renato De Mori, Roberto Pieraccini, and Giuseppe Di Fabbrizio

20 years after the second EUROSPEECH Conference, which was held in Genoa, INTERSPEECH returns to Italy, this time in the cradle of renaissance: Florence, on 27-31 August 2011.

From the SLTC Chair

John H. L. Hansen

SLTC Newsletter, April 2011

Welcome to the next installment of the SLTC Newsletter of 2011. It is now less than two months to IEEE ICASSP-2011, so I encourage if you have not done so to register and attend this premier conference. The organizers have done an excellent job in putting together the program with four plenary speakers (including our own SLTC member Nelson Morgan from ICSI: "Does ASR have a PHD, or is it just Piled Higher and Deeper?"), 14 Tutorials (taking place on Sunday and Monday [May 22-23]). Texas Instruments and XILINX also have organized industry workshops on Sunday and Monday as well.

This month, I wanted to pass on some thoughts regarding research directions in speech and language processing. In my role as Department Head for the Univ. of Texas at Dallas, I recently attended the Electrical and Computer Engineering Dept. Heads Conference (ECEDHA-2011) [1] (this meeting is actually open to all Department Heads of EE/ECE from any country). At this meeting, Michael Lightner (Past CEO & President of IEEE) served as Moderator for a feature session entitled: "New Energy Technologies: An Overview of the Next-Generation". In the USA, and also worldwide, energy has rapidly emerged as a key area of focus, with many Electrical Engineering Departments in the US moving to change their names to "Electrical, Computer, and Energy Engineering". While got me thinking about the where the key challenges are in speech? Therefore, in this installment, I wish to consider the topic of "Grand Challenges in Speech and Language Processing", which should be of interest to researchers and development engineers in the field of speech processing and language technology.

Grand Challenges in Speech & Language Processing

In the field of speech processing, many advancements have been made over the past thirty years which have helped shape speech communications, recognition, and various speech/language technologies. To the general public, many perceive the topic of speech recognition to be a solved problem, but most researchers and engineers know there are major impediments for present day speech recognition to be employed in everyday voice applications. Robustness to noise, communication channels, handsets and microphone-mismatch, clearly present major obstacles for general use. However, are there clear "Grand Challenges" present and emerging in the field of speech and language processing? In the United States, the National Academy of Engineering have suggested a number of topics associated with "grand challenges" including: provide energy from fusion, secure cyberspace, reverse-engineer the brain, make solar energy economical, etc. [2]. USA DARPA has established their DARPA Grand Challenge which is focused on an autonomous/driverless vehicle being able to automatically navigate a +150mile route with as many as 195 teams participating from 36 US States and 4 non-US countries. With the concept of Grand Challenges, are their areas in speech and language which we might consider? I would like to suggest the following three: (i) Speech-to-Speech Translation, (ii) Speech Recognition for All Languages, (iii) Thought-to-Speech Signal Production.

Multi-Lingual Speech-to-Speech Translation (S2S)

Recently, a number of efforts have emerged and have demonstrated effective speech-to-speech translation. With +6000 languages spoken in this world, the ability to reduce communication barriers between humans could help reduce (i) differences between peoples where military conflicts might arise, (ii) provide more effective rapid response by emergency and care givers in times of natural disasters, (iii) help encourage closer cooperation in science and engineering advancements, or (iv) just help those traveling to new countries better interact with others. Mobile technology in the form of cell-phones (Android, iPhone, etc.) have enabled improving computing support for mobile communication devices. It is clear that seamless S2S translation is something that would benefit all.

Assume we have Speaker A (who is a speaker of Language A) and Listener B (who is a speaker of Language B). This process requires (i) speech recognition in Language A for the original input speaker, (ii) machine translation of uncertain text from Language A to B, (iii) speech synthesis of Language B for the listener; (iv) speech recognition in Language B, (v) machine translation of uncertain text from Language B to A, (vi) and speech synthesis of Language A for the original speaker who is now the listener. While a number of groups have been active in this area, IBM TJ Watson (their demo video is available online at [4]) recently demo’d their handheld MASTOR (Multilingual Automatic Speech-to-Speech Translator) device, which is completely hosted on off-the-shelf smartphones requiring no server connections, at both IEEE ICASSP-2010 and InterSpeech-2010 (which received several awards from DARPA; and one of the advancements cited for IBM T.J. Watson in them receiving the IEEE Corporate Innovation Recognition in 2009 "For long-term commitment to pioneering research, innovative development, and commercialization of speech recognition."[5]). This reveals the continuation and a significant improvement of the first handheld S2S system allowing for bidirectional (English-Mandarin) large-vocabulary free-form speech input and output that IBM developed in 2003-2004, and now many other languages are also supported. While their solution is focused on a set of specific domains, this represents one of the strategic advancements many would argue to be a "Grand Challenge" in speech processing.

Speech Recognition for All Languages

Another topic which should be considered in the content of a "Grand Challenge" is speech recognition for all languages. It is estimated that there are more than 6900 languages spoken in the world, with countless dialects, languages with no written form, and languages which are considered "dying languages" because the numbers of speakers are dwindling to a point where the language will become extent. Wikipedia [6] lists 10 languages spoken by more than 100M speakers, another 12 with between 50-100M speakers, and it is estimated that 330 languages are spoken by more than 1M . However, if one considers which languages enjoy the most effective working speech recognition platforms, the number might be less than 30. As such, there is a cultural, economic, and societal need to see speech recognition, as well as various forms of language technology (e.g., spoken document retrieval, dialect/language ID, automatic translation (see above), etc.) move to new and under-researched languages. IARPA recently announced their goals to focus on this topic in the BABEL program[7]. Also, organizations such as CALICO – The Computer Assisted Language Instruction Consortium has been focused on this for language learning for a number of years [8]. Advancements here clearly would represent one of the core "Grand Challenges" in speech and language technology.

Thought-to-Speech Signal Production

One area which has challenged speech scientists is the ability to tap directly into the thought process of the brain in order to translate this to a speech signal. Mapping Broca’s Area and Weinicke’s Area, along with the speech articulators in the motor cortex is not an easy task. For those individuals who suffer from permanent paralysis/inability to vocalize any speech, some have suggested the prospects of implanting microelectrode array into the language portion of your brain, and when you "think" of what you would like to say, that information is sensed and transmitted perhaps wirelessly to an external speech synthesis engine where artificial speech is produced. For subjects who know they will lose their ability to speak (e.g., pending surgery, etc.), collecting sufficient speech content prior to surgery allows an individual to maintain their voice (those of you who know the Movie Critic Roger Ebert [10], know and example of how saving prior speech can help restore one’s voice after severe health recovery). For this area, there was a very interesting paper in the Interspeech-2009 conference which considered an artificial speech synthesizer control by a brain-computer interface. The subject had a neural prosthesis for speech restoration and was able to perform vowel production from thought to artificial synthesis. While there are other topics in our field, these might represent a starting point. With this, I look forward to seeing all of you in Prague at this upcoming ICASSP!

References

[1] http://ecedha.org/membership/conference_schedule2011.asp
[2] http://www.engineeringchallenges.org/
[3] http://en.wikipedia.org/wiki/DARPA_Grand_Challenge
[4] https://researcher.ibm.com/researcher/view_page.php?id=2323 (Demos available here)
[5] http://www.ieee.org/about/awards/recognitions/corpinnov.html
[6] http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers
[7] http://www.iarpa.gov/Babel_PD_post.pdf
[8] https://calico.org/
[9] http://www.interspeech2009.org/conference/programme/session.php?id=6710
[10] http://en.wikipedia.org/wiki/Roger_Ebert

John H.L. Hansen is Chair, Speech and Language Processing Technical Committee.

Book announcements

To post a book announcement, please email speechnewseds [at] listserv (dot) ieee [dot] org.

Spoken Language Understanding: Systems for Extracting Semantic Information from Speech

Gokhan Tur and Renato De Mori, Eds.
John Wiley & Sons, May 2011, 268 pp., Hardcover. ISBN: 978-0-470-69683-5

Spoken language understanding (SLU) is an emerging field in between speech and language processing, investigating human/ machine and human/ human communication by leveraging technologies from signal processing, pattern recognition, machine learning and artificial intelligence. SLU systems are designed to extract the meaning from speech utterances and its applications are vast, from voice search in mobile devices to meeting summarization, attracting interest from both commercial and academic sectors.

Both human/machine and human/human communications can benefit from the application of SLU, using differing tasks and approaches to better understand and utilize such communications. This book covers the state-of-the-art approaches for the most popular SLU tasks with chapters written by well-known researchers in the respective fields. Key features include:

Presents a fully integrated view of the two distinct disciplines of speech processing and language processing for SLU tasks.
Defines what is possible today for SLU as an enabling technology for enterprise (e.g., customer care centers or company meetings), and consumer (e.g., entertainment, mobile, car, robot, or smart environments) applications and outlines the key research areas.
Provides a unique source of distilled information on methods for computer modeling of semantic information in human/machine and human/human conversations.

This book can be successfully used for graduate courses in electronics engineering, computer science or computational linguistics. Moreover, technologists interested in processing spoken communications will find it a useful source of collated information of the topic drawn from the two distinct disciplines of speech processing and language processing under the new area of SLU.

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470688246.html

Awards

To share an award announcement, please email speechnewseds [at] listserv (dot) ieee [dot] org.

MIT Lincoln Laboratory 2010 Best Paper Awarded to Tianyu Wang and Thomas F. Quatieri

March 2011

MIT student Tianyu Wang and advisor Thomas F. Quatieri recently received the MIT Lincoln Laboratory 2010 Best Paper Award for their publication: "High-Pitch Formant Estimation by Exploiting Temporal Change of Pitch," published in the IEEE Transactions on Audio, Speech, and Language Processing, January 2010. The Best Paper Award recognizes the Lincoln Laboratory author(s) of the most outstanding published paper appearing in a peer-reviewed journal or peer-selected conference publication during an approximate one-year period preceding the award announcement. At a recent MIT Lincoln Laboratory awards ceremony, Tianyu and Tom received plaques with a citation that reads: "Recognized for creativity, technical and seminal importance, and rigorous analysis in spectral-temporal processing and representation of speech." Tianyu and Tom are members of the MIT Lincoln Laboratory Human Language Technology Group.

http://www.ll.mit.edu/news/ebel-ross-excellenceawards.html

Babel Program - Broad Agency Announcement (BAA)

Contributed by Mary Harper

SLTC Newsletter, April 2011

The IARPA Babel BAA was released on April 07, 2011. A synopsis of this new speech program is given below, along with important dates and links.

Synopsis

The Babel Program will develop agile and robust speech recognition technology that can be rapidly applied to any human language in order to provide effective search capability for analysts to efficiently process massive amounts of real-world recorded speech. Today’s transcription systems are built on technology that was originally developed for English, with markedly lower performance on non-English languages. These systems have often taken years to develop and cover only a small subset of the languages of the world. Babel intends to demonstrate the ability to generate a speech transcription system for any new language within one week to support keyword search performance for effective triage of massive amounts of speech recorded in challenging real-world situations.

Important Dates:

Proposers' Day: January 20, 2011
BAA Release Date: April 07, 2011
BAA Question Period: April 07, 2011 - May 24, 2011
Proposal Due Date: June 07, 2011

For more information, see:

Machine Learning for Speech and Language Processing - Symposium and Special Interest Group

Joseph Keshet and Geoffrey Zweig

SLTC Newsletter, April 2011

The newly created Special Interest Group (SIG) on Machine Learning for Speech and Language Technology (SIGML) is organizing a first symposium, to be held June 27th in Bellevue, Washington.

Symposium on Machine Learning in Speech and Language Processing

A symposium on Machine Learning in Speech and Language Processing will held June 27th in Bellevue, Washington. The goal of the symposium is to foster communication and collaboration between researchers in machine learning, speech recognition and natural language processing. The symposium takes advantage of the nearby locations of ACL-HLT 2011 (the annual meeting of the association for computational linguistics) and ICML 2011 (international conference on machine learning). It will bring together members of the Association for Computational Linguistics, the International Speech Communication Association, and the International Machine Learning Society.

The symposium will feature a series of invited talks and general submissions by speakers from all the three communities, such as Yoshua Bengio, Jeff Bilmes, Ming-Wei Chang, Stanley Chen, Ed Hovy, Sanjoy Dasgupta, Mark Hasegawa-Johnson, David McAllester, Bhuvana Ramabhadran, George Saon, Lawrence Saul, and Mark Steedman. Submissions focusing on novel research are solicited, and position and review papers are especially encouraged. The organizing committee of the symposium includes Hal Daume III, Joseph Keshet, Dan Roth and Geoffry Zweig. The scientific program committee includes Jeff Bilmes, Brian Kingsbury and Karen Livescu.

The symposium website is http://www.ttic.edu/sigml/symposium2011/. Papers are due April 15, 2011. The symposium will be held at the Hyatt Regency Hotel in Bellevue, Washington, the site of ICML on June 27, 2011. This is an especially pleasant time of year in the Pacific Northwest, and attendees may want to consider spending some extra time exploring Seattle itself, and the surrounding natural attractions. The Olympic National Park, Mount Rainier National Park and the North Cascades National Park are just three examples of nearby scenic areas.

Special Interest Group (SIG) on Machine Learning in Speech and Language Processing

The symposium is part of the activity of the newly created Special Interest Group (SIG) on Machine Learning for Speech and Language Technology (SIGML) within the International Speech communication Association (ISCA). SIGML have the overall aim of promoting research in modern machine learning specifically applied to speech and language processing, and of encouraging interaction between the speech community and the machine learning community. To this end, it has the following objectives: promoting conferences and workshops, promoting electronic discussion through the internet (web site, discussion list, etc.), maintaining a database of active researchers, and providing a channel of communication between the two communities.

A website with resources for both the speech and machine learning communities, including member lists, links to data, code, and more will be available soon here: http://www.ttic.edu/sigml/. The ISCA SIG is interested in continuing an annual workshop series along the lines of the symposium.

Book Review: Johanna Drucker's "SpecLab: Digital Aesthetics and Projects in Speculative Computing"

Antonio Roque

SLTC Newsletter, April 2011

This book review describes how Johanna Drucker's "SpecLab: Digital Aesthetics and Projects in Speculative Computing" approaches language processing from a humanist's perspective.

Many language technologists are aware of the value of an interdisciplinary perspective to language research, having worked with linguists, psychologists, educators, and others. As language technology takes on more sophisticated texts and more complex interactions, it is worth considering the efforts of researchers in the humanities, such as those who have been studying the ways that human aesthetic texts (fiction, for example) are best understood.

Technology has greatly influenced the humanities: the wide availability of inexpensive networked computers has created the field of digital humanities (also known as humanities computing) to exploit those new resources. Predictably, there have already been several reactions against digital humanities; this book "SpecLab: Digital Aesthetics and Projects in Speculative Computing" is part of one such reaction, which embraces the new technology while emphasizing the need to remember the lessons learned by literary theorists over the past century. Drucker's book discusses the projects of her SpecLab group in the early 00s, and the theories she developed from it. The book is principally aimed at other researchers in the humanities, but provides an interesting perspective to language research that may be useful to language technologists.

SpecLab was a group of researchers at the University of Virginia from 2000 to around 2008; the lab was cofounded by Johanna Drucker, a literary theorist, poet, and visual artist. Drucker's Speclab projects were a reaction to turn-of-the-century digital humanities projects that digitized texts, performed counts and statistical analyses on those texts, defined metadata to annotate and classify texts, and built tools and information structures to work with the texts. Drucker acknowledges that these computational efforts are good in requiring that humanities scholars make their ideas precise enough to model computationally. However, she believes computational methods are not as beneficial if one pretends that their constraints do not involve interpretation. Specifically, computation should not be used as an excuse to make decisions that tacitly assume the objectivity of knowledge and the transparency of its transmission.

Research in the humanities involves dealing with complexity and ambiguity in texts as well as in interpretation. Creating a model of knowledge, such as a coding scheme to create tags for annotating documents, can be difficult when modeling complex and ambiguous phenomena; doing so requires a subjective stance. A problem arises when digital humanities methodologies assume objectivity of knowledge and transparency of its transmission.

This is because Drucker subscribes to a theory of knowledge as partial, subjective, and situated. Subjectivity in this context has two components: a point of view inscribed in the possible interpretations of a work, and "inflection, the marked presence of affect and specificity, registered as the trace of difference, that inheres in material expressions" (p. 21) To Drucker, subjectivity of knowledge is evident in the fact that interpretation occurs in modeling, encoding, processing, and accessing knowledge.

Her solution is developing projects that make subjectivity explicit in the production of knowledge. The point of these projects (which are described below) is not just the artifact being studied, but where and how the artifact is processed, and how the assumptions of objectivity built into that processing can be exposed. Part of her approach is to use aesthetics and design as a way of challenging assumptions of objectivity in knowledge representation and use, by highlighting how perception and experience are important to knowledge.

Drucker supplements poststructuralist critical theory of the late 20th century with ideas from other disciplines. Philosopher Charles Peirce is favored over linguist Ferdinand de Saussure: signification involves an agent to whom a sign means something, not just a signifier and signified. Inspired by cognitive scientists and systems theorists, Drucker can maintain that "knowledge is always interpretation, and thus located in a perceiving entity whose position, attitudes, and awareness are all constituted in a codependent relation with its environment." (pg. 20) She uses quantum physics as a metaphor to suggest that a text is a field of potential readings; meaning is created when a reader intervenes. Also influential is the 'patacritical approach of the writer Alfred Jarry, which values outliers and exceptions over generalizations.

Drucker develops the notion of Speculative Computing as opposed to Digital Humanities. This involves a focus on humanities tools in digital contexts rather than digital tools in humanities contexts. Using the theoretical ideas described above, she supports interpretation as deformance rather than analysis, identity as codependent over self-identity, and the value of individual cases rather than general characteristics, for example (p. 25-26).

Drucker's theoretical ideas developed along with the SpecLab projects, so reading about the specifics of the SpecLab projects after reading her ideas is a bit anticlimactic. This is because what's most interesting about the projects is not the result of their analyses, but the ways in which the projects influenced her ideas. Also, the project descriptions focus on the design process and usually include a number of sketches which may be interesting from a design point of view but are not necessarily relevant to language technologists.

SpecLab's projects were:

The Temporal Modeling project, which involved designing timeline software to focus on temporal relations rather than time. This involved integrating items important to humanists such as foreshadowing, mood, causality, and point of view.
Ivanhoe, which was a system whose users select roles and develop personalities inspired by textual interpretations, to perform rewritings using texts and multimedia objects. Doing so highlights subjectivity and the creation of a discourse field. Social interactions among the users was a key component.
Subjective Meteorology, which was a design for a proposed system whose content uses meteorology as a metaphor for an individual's experience.
DTDs for AbsOnline, which were Document Type Definitions that highlighted how metatexts act as interpretation were developed for the Artists Books Online collection.
And finally, the 'Patacritical Demon, which was a conceptual representation of the act of interpretation, and is more of a mascot or symbol than something to ever be implemented.

The book ends with a series of essays on various topics in new media, aesthetics, and text processing, guided by the ideas developed earlier.

If interpretation is subjective, I should feel free to say that I was fascinated by the book's general theme on the subjectivity of knowledge and the suggestion of ways it could be used in implemented projects, while being frustrated on an almost per-page basis by the details of it. Consider Drucker's statement that "Humanists are skilled at complexity and ambiguity. Computers, as is well known, are not." (p 7) It's true that humans have natural skills at managing ambiguous inputs in ways that most contemporary algorithms do not. But it's also true that many systems can handle information processing of a complexity that no human can, as well as store, track, and reason about conflicting information in ways that human cannot. This is one of many examples that show Drucker's understanding of computation to be closer to the level of XML than to the level of natural language processing research. It's unfortunate that language engineers were not part of Drucker's lab, as they might have introduced her to the statistical methods that were literally becoming textbook approaches at the time. For example, there was no need to resort to quantum physics to inspire the multi-dimensionality of text interpretation; statistical NLP would have been enough.

However, Drucker has a good point that few humanists, let alone engineers, are likely to consider the impact of subjectivity during interpretation when developing novel computer tools for knowledge management. Although this book is not aimed at language technologists, it provides a useful perspective on language, knowledge, and computation.

For more information, see:

Johanna Drucker, "SpecLab: Digital Aesthetics and Projects in Speculative Computing", University Of Chicago Press (June 1, 2009). Preview: http://books.google.com/books?id=VPXCk396uPYC&lpg=PP1&dq=speclab&pg=PP1#v=onepage&q&f=false
"Humanistic Approaches to the Graphical Expression of Interpretation", an online video of a talk by Drucker that discusses many of the same issues: http://mitworld.mit.edu/video/796

An Overview of the TransTac Evaluation and the IBM Speech-to-Speech System

Tara N. Sainath

SLTC Newsletter, April 2011

Background on TransTac Eval

The TransTac evaluation methodology can be described as follows. Soldiers are given a system from a particular group being evaluated. The soldier asks questions in English to a person who speaks a foreign language, for example Pashto. First, ASR is performed to convert the spoken English words to text. Next, an MT system translates the English text into Pashto. Then, a TTS system converts the Pashto text into synthesized speech. The Pashto speaking person hears the question and replies back in Pashto. The S2S engine works bidirectionally, translating from English to Pashto and also from Pashto to English.

In the above process, everything is recorded and manually transcribed. There are a number of components of the above process that are scored. First, the word error rate (WER) of the ASR system is calculated. Next, the performance of the English to Pashto translation is evaluated using a variety of scores including BLUE, TER and METEOR. Finally, the Likert score is used to evaluate the speech synthesis.

Overview of IBM S2S System

Research related the TransTac evaluation has led to an increased need to make real-time speech-to-speech (S2S) translation technology more robust and suitable for real world applications [1]. Specifically, research in robustness has been focused in three main categories. First, research has focused on improving model robustness to different speaker and environmental conditions. Second, research has focused on developing models which are robust to different amounts and quality of training data. Third, research has focused on developing code to run on low-computational platforms. Below, we describe specific techniques within the ASR, MT and TTS components which focus on improving various aspects of robustness.

ASR

First, research in ASR has focused on improving recognition accuracy while reducing the recognition latency. Specifically, discriminative training and speaker adaptation methods are being explored to improve model robustness. In addition, low latency ASR is being explored in using techniques such as feature streaming and fast Gaussian computation [2]. Finally, bootstrapping and model restructuring is used to improve acoustic models for low-resourced languages [3].

MT

Second, novel aspects of the MT system have also been introduced. Improvements to accuracy include an integrated speech translation decoding which combines ASR and MT. To improve latency, the system uses fixed-point arithmetic and efficient memory management. In addition, a number of techniques have been explored to address MT with limited amounts of data. This includes methods such as improved alignment with bilingual chart parsing [4], optimized word alignment combination and phrase table extraction [5], active learning using monolingual data [6] and improved reordering models using POS constraints [7]. Furthermore, improved translation of conversational speech with limited computational overhead using soft syntactic constraints has also recently been explored [8].

TTS

Finally, research in TTS has focused on improving the quality and intelligibility of the synthesized spoken output, particularly for under-resourced languages. This is accomplished by using statistical methods, such as automatic generation of word to phoneme sequence, syllable boundaries and stress conversion, improving the robustness of the TTS engine. By combining these techniques with concatenative Text-to-speech techniques [9], it is possible to rapidly build and deploy TTS engines for lower-resourced languages.

Acknowledgements

Thank you to Bowen Zhou and Sameer Maskey of the Speech-to-Speech Translation Group at IBM T.J. Watson Research Center for useful discussions related to the content of this article.

References

[1] B. Zhou et. al, ``Towards Robust Speech-to-Speech Translation For Real World Applications," submitted to Computer Speech and Language, 2011.
[2] L. Gu, J. Xue, X. Cui, Y. Gao, ``High-performance Low-Latency Speech Recognition via Multi-Layered Feature Streaming and Fast Gaussian Computation", in Proc. Interspeech, 2008.
[3] X. Cui et. al, ``Acoustic Modeling with Bootstrap and Restructuring for Low-resourced Languages," in Proc. Interspeech, 2010.
[4] M. Cmejrek and B. Zhou, ``Two Methods for Extending Hierarchical Rules from the Bilingual Chart Parsing," in Proc. Coling, 2010.
[5] Y. Deng and B. Zhou, ``Optimizing Word alignment Combination for Phrase Table Training", in Proc. ACL, Short Papers, 2009.
[6] B. Xiang, B. Zhou and M. Cmejrek, ``Advances in Syntax-based Malay-English Speech Translation", in Proc. ICASSP, 2009.
[7] S. Maskey and B. Zhou, ``Rapid Integration of Parts of Speech Information to Improve Reordering Model for English-Farsi Speech to Speech Translation", in Proc. ICASSP, 2010.
[8] Z. Huang, M. Cmejrek and B. Zhou, "Soft Syntactic Constraints for Hierarchical Phrase-based Translation using Latent Syntactic Distributions," in Proc. EMNLP, 2010.
[9] W. Zhang and X. Cui, ``Applying Scalable Phonetic Context Similarity in Unit Selection of Concatenative Text-to-Speech," in Proc. Interspeech, 2010.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com

What is the SLTC? A brief history

Svetlana Stoyanchev and Rashmi Prasad

SLTC Newsletter, April 2011

Asking questions is a fundamental cognitive process that underlies higher-level cognitive abilities such as comprehension and reasoning. Ultimately, question generation allows humans, and in many cases artificial intelligence systems, to understand their environment and each other. This is a call for participation for the Fourth Question Generation workshop at AAAI Fall Symposia 2011, to be held Friday - Sunday, November 4-6 in Arlington, Virginia, adjacent to Washington, DC.

Introduction

Research on question generation (QG) has a long history in artificial intelligence, psychology, education, and natural language processing. One thread of research has been theoretical, with attempts to understand and specify the triggers (e.g., knowledge discrepancies) and mechanisms (e.g., association between type of knowledge discrepancy and question type) underlying QG. The other thread of research has focused on automated QG, which has far-reaching applications in intelligent technologies, such as dialogue systems, question answering systems, web search, intelligent tutoring systems, automated assessment systems, inquiry-based environments, adaptive intelligent agents and game-based learning environments.

The Past QG Workshops

The past Question Generation workshops were held in conjunction with the 10th International Conference on Intelligent Tutoring Systems in 2010 and the International Conference on Artificial Intelligence in Education in 2009. They have served as a highly interdisciplinary forum for a diverse community of researchers from Artificial Intelligence, Natural Language Processing, Cognitive Science, and Psychology, working on theories and applications of QG. The Question Generation Workshop 2010 was collocated with the Question Generation Shared Task Evaluation Challenge.

The 2010 QG workshop and Shared Task Challenge included computational papers that used a variety of approaches to automatic question generation employing theories of syntax, minimal recursion semantics, discourse relations, and semantic sentence structure. From a psychology perspective, one of the papers described the effects of cognitive disequilibrium (a situation when a learner is forced to handle an obstacle that prevents them from achieving a goal) on Question Generation by humans.

Art Graesser, the invited speaker of the QG2010 pointed out the importance of question asking for human learners in a learning task and effectiveness of vicarious learning when students observe animated conversational agents asking deep questions (Craig, Sullins, Witherspoon, & Gholson, 2006). These findings suggest that question generation technology can play an important role in developing students' learning skills and abilities.

Workshop Topics

QG-2011 is being held in the same spirit as the previous workshops, to foster theoretical and applied research on computational and cognitive aspects of QG. As such, it seeks the involvement of participants from diverse disciplines, including, but not limited to, Natural Language Processing, Artificial Intelligence, Linguistics, Psychology, and Education. We invite submissions that deal with theoretical, empirical, and computational aspects of QG, encouraging completed as well as speculative or in-progress work. Topics will include, but will not be limited to, the following:

Cognitive models of QG
Role of QG in learning and problem solving
Question taxonomies
Empirical approaches to QG
QG tasks and subtasks
Corpus annotation schemes for QG
Representation language(s) for data/resource sharing between QG systems
Impact of NLP technologies on QG tasks
Context-sensitive question type selection or ranking
Descriptions of implemented systems or components
Applications of QG (intelligent tutoring systems, dialogue systems, web querying, querying over information repositories, etc.)
Generation from different inputs - knowledge bases, ontologies, text, queries

The goal for this year's meeting is to create an environment for an active exchange of ideas among researchers working on the QG problem from different perspectives, and to move forward in designing focused QG tasks. The broad format of the symposium will consist of "thematic sessions", "panel discussions", "break-out sessions", and "interactive poster and demo sessions", and will feature two invited talks.

Workshop Organizers

This year's workshop organizers are: Arthur Graesser, James Lester, Jack Mostow, Rashmi Prasad, and Svetlana Stoyanchev.

Summary

We invite participants in Question Generation Workshop 2011 at AAAI Fall symposia. The deadline for submitting an extended abstract (1500-2000 words) is May 13, 2011. Please see the Question Generation website for up-to-date information on the workshop and paper submission. If you are interested in joining our mailing list, please send an email to s.stoyanchev [at] open.ac.uk.

If you have comments, corrections, or additions to this article, please contact the authors and organizers of the workshop: Svetlana Stoyanchev, s.stoyanchev [at] open.ac.uk and Rashmi Prasad rjprasad [at] seas.upenn.edu

Svetlana Stoyanchev is Research Fellow at the Open University. Her interests are in dialogue, discourse, language generation, question answering, and information presentation.

Rashmi Prasad is a Senior Research Scientist at the Institute for Research in Cognitive Science, University of Pennsylvania. She received her PhD in Linguistics from the Department of Linguistics, University of Pennsylvania. Her research interests include discourse and dialogue processing, biomedical NLP, and natural language generation.

The Spoken Dialog Challenge 2010 and 2011

Jason D. Williams

SLTC Newsletter, April 2011

The Spoken Dialog Challenge 2010 was an exercise to investigate how different spoken dialog systems perform on the same task. Organization was led by Alan Black and Maxine Eskenazi at Carnegie Mellon University, and funding was provided by the US National Science Foundation. The task was providing bus timetable information for Pittsburgh, PA, USA. Participation was open to any research team -- see the Call for Participation issued in July 2009.

Research teams could participate by entering in one (or more) of three challenge tracks:

Build a spoken dialog system;
Build a simulated user; or
Propose and illustrate a method for evaluating the systems and/or simulated users.

The Spoken Dialog Challenge 2010 had entries in all three tracks.

Carnegie Mellon University (CMU) made several resources available to research teams. First, CMU has operated a spoken dialog system called "Let's Go" for several years [1]. "Let's Go" is used by real Pittsburgh bus riders during the evening, when the Pittsburgh bus company's call center is closed. The source of the "Let's Go" system, including everything needed for operation including speech recognizer, language models, bus timetable database, and text-to-speech were provided to participants. This allowed teams to start with an end-to-end system rather than building from scratch, if desired. Second, CMU made call and utterance data collected over several years available to research teams, including approximately 250,000 utterances transcribed by crowd-sourcing [2].

There were two phases to the Spoken Dialog Challenge 2010. In the first phase, each dialog system was used by a small group of test subjects. Test subjects were given pre-defined tasks to accomplish usnig the system, such as finding the next bus from downtown to the airport. After each call, test subjects were asked to fill out a short survey about the call. This phase has the benefit that task completion can be accurately computed, because the tasks are prescribed. During this phase, simulated users also called each system. All four dialog systems entered in the challenge participated in this phase.

In the second phase, calls from real bus riders were routed to each dialog system. This phase has the benefit of testing systems under real use, where people actually suffer the consequences of failed dialogs. It has been shown in the past that real users behave quite differently than usability subjects [3], so testing on real users is crucial to understanding true system performance. However, in this phase, it is not possible to measure user satisfaction well. Measuring task completion is also subject to some interpretation -- for example, a user hang-up can indicate a failed call, or simply that the user's bus has arrived so no further dialog is needed. Three dialog systems participated in this phase. Each dialog system received approximately 500-800 phonecalls in July and August of 2010.

A special session was held at IEEE Workshop on Spoken Language Technology (SLTC) in December 2010. At this session, results from the first phase with usability subjects were published, including an overview of the first phase with overall performance of the four dialog systems entered [4]. Papers and demonstrations from Challenge participants described their dialog systems, simulated users, and evaluation methods. In the dialog systems track, two of the systems used recent statistical techniques. In the user simulation track, one paper showed a method for simulating user behavior at the audio level, demonstrated by calling all four of the dialog systems. In the evaluation track, one paper showed a method for predicting user satisfaction using collaborative filtering, and another showed how finite state machines could be used for system evaluation. Other papers showed new methods for building dialog systems and how the challenge could be used for teaching spoken dialog development.

Results from the second phase were not available at the time of SLT in December. It is expected that they will be published soon at upcoming conferences such as SigDial 2011.

The Spoken Dialog Challenge 2011 is now under way. In 2011, there will be 2 "streams": the first stream runs from Nov 2010 - Dec 2011, and is designed to enable participants to present complete results at SigDial 2012. The second stream runs from April 2011 - May 2012. In this stream, development activities are concentrated during the summer months. This schedule is designed for people pursuing summer projects, such as students and interns.

Data from the Spoken Dialog Challenge will be made available to the research community. For more information about the challenge, see the Spoken Dialog Challenge website: dialrc.org/sdc/. This website includes an audio and slide presentation describing the challenge task, recorded in November 2010.

References

[1] Raux, A., Langner, B., Black, A., Eskenazi, M. 2005. Let's Go Public! Taking a Spoken Dialog System to the Real World, Interspeech 2005 (Eurospeech), Lisbon, Portugal.
[2] Parent, G., Eskenazi, M. 2010. Toward better crowdsourced transcription: Transcription of a year of the let’s go bus information system data. In Proc SLT, Berkeley, CA.
[3] Ai, H., Raux, A., Bohus, D., Eskenzai, M., Litman, D. 2008. Comparing spoken dialog corpora collected with recruited subjects versus real users. In Proc SIGdial, Columbus, Ohio, USA.
[4] Black, A., Burger, S., Langner, B., Parent, G., and Eskenazi, M. 2010. Spoken Dialog Challenge 2010, SLT 2010, Berkeley, USA.

Jason Williams is Principal Member of Technical Staff at AT&T Labs Research. His interests are dialog systems and planning under uncertainty. He is also on the Organizing Committee of the Spoken Dialog Challenge. Email: jdw@research.att.com

Speech science and technology for real life at INTERSPEECH 2011 in Florence

Piero Cosi, Renato De Mori, Roberto Pieraccini, and Giuseppe Di Fabbrizio

SLTC Newsletter, April 2011

Exactly 20 years after the second EUROSPEECH Conference, which was held in Genoa, INTERSPEECH returns to Italy, this time in the cradle of renaissance: Florence, on 27-31 August 2011.

INTERSPEECH is the world largest and most comprehensive conference on the science and technology of human and machine spoken language. Its proceedings are indexed in ISI Web of Science, Engineering Index and Scopus.

INTERSPEECH is also a unique event that expresses the very interdisciplinary essence of the spoken language field. It brings together a large research community ranging from psychology, linguistics, physiology, and physics, to medicine, education, and engineering.

Looking in prospective, these 20 past years beheld a remarkable growth in spoken language science and technologies, fostering user acceptance and creating a favorable environment for applications at scale. Such life-pervasive presence is conveyed in this year INTERSPEECH theme: "Speech science and technology for real life" and expressed in many of the conference events.

In addition to regular oral and poster sessions, the conference will host:

3 plenary talks by Prof. Julia Hirschberg, Prof. Sandy Pentland, and Prof. Tom Mitchell respectively on “Speaking More Like You: Entrainment in Conversational Speech, “Honest Signals”, and “Neural Representations of Word Meanings”;
A round-table on the theme of the conference "Speech Science and Technology for real life", where three internationally renowned experts, Prof. Gabriele Miceli, Prof. Björn Granström, and Prof. Hiroshi Ishiguro, will share their research and vision on: “Language disorders: viewpoints on a complex object”, “Speech technology in (re)habilitation of persons with communication disabilities”, and “From teleoperated androids to cellphones as surrogates” respectively;

10 tutorials

Functional Data Analysis for Speech Research, Michele Gubian
Automatic Summarization, Ani Nenkova, Sameer Maskey, Yang Liu
Blind Speech Separation based on Independent Component Analysis and Sparse Component Analysis, Shoji Makino, Hiroshi Sawada
Low-dimensional speech representation based on Factor Analysis and its applications, Najim Dehak, Stephen Shum
Pitch Estimation: Advances and Speech Applications, Mads Grasboll Christensen
Spoken Query Understanding, Xiao Li, Gokhan Tur
More than Words Can Say: Prosodic Analysis Techniques and Applications, Andrew Rosenberg
Building an Open Vocabulary ASR System using Open Source Software, Stefan Hahn, David Rybach
Learning with Rich Prior Knowledge, Joao Graca, Gregory Druck, and Kuzman Ganchev
Registers and Resonances in Singing, Joe Wolfe, John Smith, & Maëva Garnier

7 special sessions

2 special events

Moreover, for the first time, a Show & Tell session where attendees will get a chance to experience in first person exciting demonstrations of the latest and most advanced technologies.
Seven satellite events will take place immediately before and after the conference:

AVSP2011 - Workshop on Audio Visual Speech Processing, August 31 - September 3, 2011, Volterra, Italy
Blizzard_Challenge_2011 - Workshop on Evaluating corpus-based speech synthesis on common databases, September 2, 2011, Turin, Italy
CHiME 2011 - Workshop on Machine Listening in Multisource Environments, September 1st, 2011, Florence, Italy
IWSDS 2011 - Workshop on Paralinguistic Information and its Integration in Spoken Dialogue Systems, September 1-3, 2011, Granada, Spain
MAVEBA 2011 - Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, August 25-27, 2011, Florence, Italy
MediaEval 2011 - Workshop on MediaEval Benchmarking Initiative, 1-2 September 2011 Pisa, Italy.
SLaTE 2011 - Workshop on Speech and Language Technology in Education (SLaTE), August 24-26, 2011, Venice, Italy.

Search for “Interspeech 2011” on Facebook or “@IS_2011” on Twitter for last minute updates and news on the conference.

More information: INTERSPEECH 2011 website: http://www.interspeech2011.org/

Chairs: Piero Cosi (ISTC-CNR UOS Padova), Renato De Mori (University of Avignon, FRANCE)
Technical Chairs: Roberto Pieraccini (SpeechCycle, USA), Giuseppe Di Fabbrizio (AT&T Labs Research, USA)