Speech and Language Processing Technical Committee Newsletter

November 2013

Welcome to the Winter 2013-2014 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter! This issue of the newsletter includes 7 articles and announcements from 25 contributors, including our own staff reporters and editors. Thank you all for your contributions! This issue includes news about IEEE journals and recent workshops, SLTC call for nominations, and individual contributions.

We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions.

To subscribe to the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Dilek Hakkani-Tür, Editor-in-chief
William Campbell, Editor
Haizhou Li, Editor
Patrick Nguyen, Editor


From the SLTC and IEEE

From the IEEE SLTC chair

Douglas O'Shaughnessy


The SLaTE 2013 Workshop

Pierre Badin, Thomas Hueber, Gérard Bailly, Martin Russell, Helmer Strik

SLaTE 2013 was the 5th workshop organised by the ISCA Special Interest Group on Speech and Language Technology for Education. It took place between 30th August and 1st September 2013 in Grenoble, France as a satellite workshop of Interspeech 2013. The workshop was attended by 68 participants from 20 countries. Thirty eight submitted papers and 14 demonstrations were presented in oral and poster sessions.


The INTERSPEECH 2013 Computational Paralinguistics Challenge - A Brief Review

Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani

The INTERSPEECH 2013 Computational Paralinguistics Challenge was held in conjunction with INTERSPEECH 2013 in Lyon, France, 25-29 August 2013. This Challenge was the fifth in a series held at INTERSPEECH since 2009 as an open evaluation of speech-based speaker state and trait recognition systems. Four tasks were addressed, namely social signals (such as laughter), conflict, emotion, and autism. 65 teams participated, the baseline as was given by the organisers could be exceeded, and a new reference feature set by the openSMILE feature extractor and the four corpora used are publicly available at the repository of the series.


An Overview of the Base Period of the Babel Program

Tara N. Sainath, Brian Kingsbury, Florian Metze, Nelson Morgan, Stavros Tsakalidis

The goal of the Babel program is to rapidly develop speech recognition capability for keyword search in previously unstudied languages, working with speech recorded in a variety of conditions with limited amounts of transcription. Several issues and observations frame the challenges driving the Babel Program. The speech recognition community has spent years improving the performance of English automatic speech recognition systems. However, applying techniques commonly used for English ASR to other languages has often resulted in huge performance gaps for those other languages. In addition, there are an increasing number of languages for which there is a vital need for speech recognition technology but few existing training resources [1]. It is easy to envision a situation where there is a large amount of recorded data in a language which contains important information, but for which there are very few people to analyze the language and no existing speech recognition technologies. Having keyword search in that language to pick out important phrases would be extremely beneficial.


MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker-Recognition Research

Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck

We are happy to announce the release of the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. It provides researchers with a test bed for developing new front-end and back-end techniques, allowing replicable evaluation of new advancements. It will also help newcomers in the field by lowering the "barrier to entry," enabling them to quickly build baseline systems for their experiments. Although the focus of this toolbox is on speaker recognition, it can also be used for other speech related applications such as language, dialect, and accent identification. Additionally, it provides many of the functionalities available in other open-source speaker recognition toolkits (e.g., ALIZE [1]) but with a simpler design which makes it easier for the users to understand and modify the algorithms.


The REAL Challenge

Maxine Eskenazi

The Dialog Research Center at Carnegie Mellon (DialRC) is organizing the REAL Challenge. The goal of the REAL Challenge (dialrc.org/realchallenge) is to build speech systems that are used regularly by real users to accomplish real tasks. These systems will give the speech and spoken dialog communities steady streams of research data as well as platforms they can use to carry out studies. It will engage both seasoned researchers and high school and undergrad students in an effort to find the next great speech applications.


SPASR workshop brings together speech production and its use in speech technologies

Karen Livescu

The Workshop on Speech Production in Automatic Speech Recognition (SPASR) was recently held as a satellite workshop of Interspeech 2013 in Lyon on August 30.


Speaker Identification: Screaming, Stress and Non-Neutral Speech, is there speaker content?

John H.L. Hansen, Navid Shokouhi

The field of speaker recognition has evolved significantly over the past twenty years, with great efforts worldwide from many groups/laboratories/universities, especially those participating in the biannual U.S. NIST SRE - Speaker Recognition Evaluation [1]. Recently, there has been great interest in considering the ability to perform effective speaker identification when speech is not produced in "neutral" conditions. Effective speaker recognition requires knowledge and careful signal processing/modeling strategies to address any mismatch conditions that could exist between the training and testing conditions. This article considers some past and recent efforts, as well as suggested directions when subjects move from a "neutral" speaking style, vocal effort, and ultimately pure "screaming" when it comes to speaker recognition. In the United States recently, there has been discussion in the news regarding the ability to accurately perform speaker recognition when the audio stream consists of a subject screaming. Here, we illustrate a probe experiment, but before that some background on speech under non-neutral conditions.



From the SLTC Chair

Douglas O'Shaughnessy

SLTC Newsletter, November 2013

Welcome to the final SLTC Newsletter of 2013.

I am pleased to announce the results of this year's election to renew the membership of the SLTC. We elected 17 members to replace the 17 members whose term expires next month, as well as a new Vice Chair for the committee, Bhuvana Ramabhadran. She has been a member of the SLTC since 2010, and is a well-known member of our technical community, at IBM's TJ Watson Research Center since 1995. Currently, she manages a team of researchers in the Speech Recognition and Synthesis Research Group and co-ordinates research activities at IBM's labs in China, Tokyo, Prague and Haifa.

She was a technical area chair for ICASSP 2013 and for Interspeech 2012, and was one of the lead organizers and technical chair of IEEE ASRU 2011 in Hawaii. She co-organized the HLT-NAACL workshop on language modeling in 2011, special sessions on Sparse Representations in Interspeech 2010 and on Speech Transcription and Machine Translation at the 2007 ICASSP in Honolulu, and organized the HLT-NAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval.

Of the 17 newly-elected members, seven are members continuing for a second term, while ten are new to the SLTC. By areas, we elected members in Speech Recognition (Jasha Droppo, Yifan Gong, Mark Hasegawa-Johnson, John Hershey, Frank Seide, Peder Olsen, George Saon, Hagen Soltau, Andreas Stolcke, Shinji Watanabe), in Speaker Recognition (Nick Evans), in Speech Synthesis (Tomoki Toda), in Natural Language Processing (Larry Heck), in Dialogue Systems (Svetlana Stoyanchev), In Speech Analysis (Gernot Kubin, Deep Sen), and in Speech Enhancement (Maurizio Omologo). The large number in ASR is due to the fact that we have so many papers in this area at ICASSP, and needed a significant increase to help with the reviews. We had been proportionally under-represented in this area.

At this time, we look forward to ASRU-2013 next month (the IEEE Automatic Speech Recognition and Understanding Workshop in the Czech Republic (www.asru2013.org)). In addition to the excellent program of oral and poster papers, as per the accepted papers that were submitted to our committee and others for evaluation this past summer and a nice social program (recall the nice pre-Christmas locale of Merano, Italy for ARSU-2009), there are 14 invited speakers scheduled, grouped on three major topics: Neural Nets, Limited Resources, and ASR in applications:

I remind our speech and language community of the importance of doing reviews for the ICASSP-2014 paper submissions. If you have not signed up to be a reviewer and wish to help, please contact the committee. The reviews should start in a few weeks.

While still a bit distant, we are looking forward to the next IEEE Spoken Language Technology Workshop, to be held at the South Lake Tahoe Resort Hotel in South Lake Tahoe, California (Dec. 6-9, 2014).

In closing, I hope you will consider reviewing ICASSP submissions as well as participating at ASRU-2013, ICASSP-2014 and SLT-2014. We look forward to meeting friends and colleagues in beautiful Olomouc, Florence, and Tahoe.

Best wishes,

Douglas O'Shaughnessy

Douglas O'Shaughnessy is the Chair of the Speech and Language Processing Technical Committee.


The SLaTE 2013 Workshop

Pierre Badin, Thomas Hueber, Gérard Bailly, Martin Russell, Helmer Strik

SLTC Newsletter, November 2013

SLaTE 2013 was the 5th workshop organised by the ISCA Special Interest Group on Speech and Language Technology for Education. It took place between 30th August and 1st September 2013 in Grenoble, France as a satellite workshop of Interspeech 2013. The workshop was attended by 68 participants from 20 countries. Thirty eight submitted papers and 14 demonstrations were presented in oral and poster sessions.

SLaTE 2013

SLaTE is the ISCA Special Interest Group (SIG) on Speech and Language Technology in Education. SLaTE 2013 was the 5th SLaTE workshop, with previous workshops in Farmington (USA, 2007), Wroxall Abbey (UK, 2009), Tokyo (Japan, 2010) and Venice (Italy, 2011). It was organised by Pierre Badin, Gérard Bailly, Thomas Hueber and Didier Demolin from GIPSA-lab, and Françoise Raby from LIDILEM.

SLaTE 2013 was attended by 68 participants from 20 countries. Thirty eight submitted papers and 14 demonstrations were presented in oral and poster sessions over three days. SLaTE 2013 also featured plenary lectures by invited speakers, each an expert in his or her field. Diane Litman (University of Pittsburgh) presented her work on enhancing the effectiveness of spoken dialogue for STEM (Science, Technology, Engineering and Mathematics) education. Jozef Colpaert (Universiteit Antwerpen) described his work on the role and shape of speech technologies in well-designed language learning environments. The third invited speaker, Mary Beckman (Ohio State University), was unable to attend the workshop and her talk on enriched technology-based annotation and analysis of child speech was presented by Benjamin Munson (University of Minnesota).

Topics covered in the regular technical sessions included speech technologies for children and children's education, computer assisted language learning (CALL), and prosody, phonetics and phonology issues in speech and language in education.

The SLaTE Assembly was held on Saturday 31st August. Martin Russell (University of Birmingham, UK) and Helmer Strik (Radboud University, Nijmegen, the Netherlands) were re-elected as chair and secretary/treasurer of SLaTE, respectively. The assembly felt that SLaTE has become a successful and established SIG, with regular workshops typically attracting 70 attendees and 45 submitted papers. Nevertheless there was an enthusiasm for improving and expanding SLaTE, leading to a discussion about the future locations and frequency of SLaTE workshops - because SLaTE is currently a biennial satellite workshop of Interspeech, it will always be in Europe. The discussion expanded to include special sessions at SLaTE, invited plenary talks, the submitted paper acceptance rate, and possible colocation with other conferences. It was agreed that a questionnaire covering these issues will be circulated to the SLaTE mailing list.

In addition to its technical content, SLaTE 2013 featured an excellent social programme, including a cheese and wine reception in the sunny courtyard of the bar 'Le Saxo' in Grenoble, and a banquet accessed by Grenoble's bastille cable car in the ''Chez le Pèr Gras'' restaurant in the mountains overlooking the city.

The next SLaTE workshop will be held in Leipzig in 2015, as a satellite event of Interspeech 2015.

Acknowledgements and more information

Pierre Badin (Pierre.Badin@gipsa-lab.grenoble-inp.fr), Thomas Hueber (Thomas.Hueber@gipsa-lab.grenoble-inp.fr) and Gérard Bailly (Gerard.Bailly@gipsa-lab.grenoble-inp.fr) are at GIPSA-lab, CNRS - Grenoble-Alps University, France, Martin Russell (m.j.russell@bham.ac.uk) is at the University of Birmingham, UK, Helmer Strik (w.strik@let.ru.nl) is at Radboud University, the Netherlands


The INTERSPEECH 2013 Computational Paralinguistics Challenge - A Brief Review

Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani

SLTC Newsletter, November 2013

The INTERSPEECH 2013 Computational Paralinguistics Challenge was held in conjunction with INTERSPEECH 2013 in Lyon, France, 25-29 August 2013. This Challenge was the fifth in a series held at INTERSPEECH since 2009 as an open evaluation of speech-based speaker state and trait recognition systems. Four tasks were addressed, namely social signals (such as laughter), conflict, emotion, and autism. 65 teams participated, the baseline as was given by the organisers could be exceeded, and a new reference feature set by the openSMILE feature extractor and the four corpora used are publicly available at the repository of the series.

The INTERSPEECH 2013 Computational Paralinguistics Challenge was organised by Björn Schuller (Université de Genève, Switzerland / TUM, Germany / Imperial College London, England), Stefan Steidl (FAU, Germany), Anton Batliner (TUM/FAU, Germany), Alessandro Vinciarelli (University of Glasgow, Scotland / IDIAP Research Institute, Switzerland), Klaus Scherer (Université de Genève, Switzerland), Fabien Ringeval (Université de Fribourg, Switzerland), and Mohamed Chetouani (UPMC, France). The Challenge dealt with short-term states (emotion in twelve categories) and long-term states (autism in four categories). Moreover, group discussions were targeted to find conflict (in two classes but also given as continuous 'level of conflict' from -10 to +10), and a frame-level task was presented with the detection of the 'social signals' laughter and filler. The area under the receiver operating curve (AUC) was used for the first time as one of the competition measures, besides unweighted average recall (UAR) for the classification tasks. AUC is particularly suited for detection tasks. UAR equals the addition of the accuracies per class (i.e., the recall values) divided by the number of classes. With four tasks, this was the largest Computational Paralinguistics Challenge so far. Also, enacted data alongside naturalistic data had not been considered before this year's edition. A further novelty of this year's Challenge was the provision of a script for re-producing the baseline results on the development set in an automated fashion, including pre-processing, model training, model evaluation, and scoring by the competition and further measures as are outlined below.

The following corpora served as clearly defined test, training, and development partitions incorporating speaker independence as needed in most real-life settings: the Scottish-English SSPNet Vocalisation Corpus (SVC) from mobile phone conversations, the Swiss-French SSPNet Conflict Corpus (SC²) featuring broadcast political debates - both provided by the SSPNet, the Geneva Multimodal Emotion Portrayals (GEMEP) featuring professionally enacted emotions in 16 categories as provided by the Swiss Center for Affective Sciences, and the French Child Pathological Speech Database (CPSD) provided by UPMC including speech of children with Autism Spectrum Condition.

Four Sub-Challenges were addressed: in the Social Signal Sub-Challenge, the non-linguistic events laughter and filler of a speaker had to be detected and localised, based on acoustic information. In the Conflict Sub-Challenge, group discussions had to be automatically evaluated with the aim of recognising conflict as opposed to non-conflict. For training and development data, also continuous level of conflict was given. This information could be used for model construction or for reporting of more precise results in papers when dealing with the development partition. In the Emotion Sub-Challenge, the emotion of a speaker within a closed set of twelve categories had to be determined by a learning algorithm and acoustic features. In the Autism Sub-Challenge, three types of pathology of a speaker had to be determined as opposed to no pathology by a classification algorithm and acoustic features.

A new set of 6,373 acoustic features per speech chunk, again computed with TUM's openSMILE toolkit, was provided by the organisers. This set was based on low level descriptors that can be extracted on the frame level by a script provided by the organisers. For the Social Signals Sub-Challenge that requires localisation, a frame-wise feature set was derived from the above. These features could be used directly or sub-sampled, altered, etc., and combined with other features.

As in the 2009 - 2012 Challenges, the labels of the test set were unknown, and all learning and optimisations needed to be based only on the training material. Each participant could upload instance predictions to receive the results up to five times. The format was instance and prediction, and optionally additional probabilities per class. This allowed a final fusion of all participants' results, to demonstrate the potential maximum by combined efforts. As typical in the field of Computational Paralinguistics, classes were unbalanced. Accordingly, the primary measure to optimise was UAR and unweighted average AUC (UAAUC). The choice of taking the unweighted average is a necessary step to better reflect the imbalance of instances among classes. In addition - but not as competition measure - the correlation coefficient (CC) was given for the continuous level of conflict.

The organisers did not take part in the Sub-Challenges but provided baselines using the WEKA toolkit as a standard-tool so that the results were reproducible. As in previous editions, Support Vector Machines (and Regression for additional information) were chosen for classification and optimised on the development partition. The baselines were 83.3% UAAUC (social signals), 80.8% UAR (2-way conflict class), 40.9% (12-way emotion category), and 67.1% UAR (4-way autism diagnosis) on the test sets with chance levels of 50%, 50%, 8%, and 25% UAR, respectively.

All participants were encouraged to compete in all Sub-Challenges and each participant had to submit a paper to the INTERSPEECH 2013 Computational Paralinguistics Challenge Special Event. The results of the Challenge were presented in a Special Event of INTERSPEECH 2013 (double session) and the winners were awarded in the closing ceremony by the organisers. Four prizes (each 125.- EUR sponsored by the Association for the Advancement of Affective Computing (AAAC) - the former HUMAINE Association) could be awarded following the pre-conditions that the accompanying paper was accepted for the special event following the INTERSPEECH 2013 general peer-review, that the provided baseline was exceeded, and that a best result in a Sub-Challenge was reached in the respective competition measure. Overall, 65 sites registered for the Challenge, 33 groups actively took part in the Challenge and uploaded results, and finally 15 papers of participants were accepted for presentations.

The Social Signals Sub-Challenge was awarded to Rahul Gupta, Kartik Audhkhasi, Sungbok Lee, and Shrikanth Narayanan, all from the Signal Analysis and Interpretation Lab (SAIL), Department of Electrical Engineering, University of Southern California at Los Angeles, CA, U.S.A., who reached 0.915 UAAUC in their contribution "Paralinguistic Event Detection from Speech Using Probabilistic Time-Series Smoothing and Masking".

The Conflict Sub-Challenge Prize was awarded to Okko Räsänen and Jouni Pohjalainen both from the Department of Signal Processing and Acoustics, Aalto University, Finland, who obtained 83.9% UAR in their paper "Random Subset Feature Selection in Automatic Recognition of Developmental Disorders, Affective States, and Level of Conflict from Speech".

The Emotion Sub-Challenge Prize is awarded to Gábor Gosztolya, Róbert Busa-Fekete, and László Tóth, all from the Research Group on Artificial Intelligence, Hungarian Academy of Sciences and University of Szeged, Hungary, for their contribution "Detecting Autism, Emotions and Social Signals Using AdaBoost". Róbert Busa-Fekete is also with the Department of Mathematics and Computer Science, University of Marburg, Germany. They won this Sub-Challenge with 42.3% UAR.

The Autism Sub-Challenge Prize was awarded to Meysam Asgari, Alireza Bayestehtashk, and Izhak Shafran from the Center for Spoken Language Understanding, Oregon Health & Science University, Portland, OR, U.S.A. for their publication "Robust and Accurate Features for Detecting and Diagnosing Autism Spectrum Disorders". They reached 69.4% UAR.

Overall, the results of all 33 uploading groups were mostly very close to each other, and significant differences between the results accepted for publication were as rare as one might expect in such a close competition. However, by late fusion (equally weighted voting of N best participants' results), new baseline scores in terms of UAAUC and AUR exceeding all individual participants' results could be established in all Sub-Challenges except for the Autism Sub-Challenge. The general lesson learned thus again is "together we are best": obviously, the different feature representations and learning architectures contribute an added value, when combined. In addition, the Challenge clearly demonstrated the difficulty of dealing with real-life data - this challenge remains again.

In the last time slot of the 2nd session of the challenge, past and (possible) future of these challenges were discussed, and the audience filled in a questionnaire. The answers from the 35 questionnaires that we received can be summarised as follows and corroborate the pre-conditions and the setting chosen so far: time provided for the experiments (ca. 3 months) and number of possible uploads (5) was considered to be sufficient. Performance measures used (UAR and AUC) are preferred over other possible measures. Participation should be only possible if the paper was accepted in the review process; "additional 2-pages material" papers should not be established for rejected papers but possibly as a voluntary option. There is a strong preference for a "Special Event" at Interspeech, and not for a satellite workshop of Interspeech or an independent workshop. The benefits of these challenges for the community and by that, the adequate criteria for acceptance of a paper, are foremost considered to be interesting/new computational approaches and/or phonetic/linguistic features; boosting of performance (above baseline) was the second most important criterion.

For more information on the 2013 Computational Paralingusitics Challenge (ComParE 2013), see the webpage on emotion-research.net.

The organisers would like to thank the sponsors of INTERSPEECH 2013 ComParE: The Association for the Advancement of Affective Computing (AAAC), SSPNet, and the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 289021 (ASC-Inclusion).

Björn Schuller is an Associate of the University of Geneva’s Swiss Center for Affective Sciences and Senior Lecturer at Imperial College London and Technische Universität München. His main interests are Computational Paralinguistics, Computer Audition, and Machine Learning. Email: schuller@IEEE.org

Stefan Steidl is a Senior Researcher at Friedrich-Alexander University Erlangen-Nuremberg. His interests are multifaceted ranging from Computational Paralinguistics to Medical Image Segmentation. Email: steidl@cs.fau.de

Anton Batliner is Senior Researcher at Technische Universität München. His main interests are Computational Paralinguistics and Phonetics/Linguistics. Email: Anton.Batliner@lrz.uni-muenchen.de

Alessandro Vinciarelli is Senior Lecturer at the University of Glasgow (UK) and Senior Researcher at the Idiap Research Institute (Switzerland). His main interest is Social Signal Processing. Email: vincia@dcs.gla.ac.uk

Klaus Scherer is a Professor emeritus at the University of Geneva. His main interest are Affective Sciences. Email: Klaus.Scherer@unige.ch

Mohamed Chetouani is a Professor at University Pierre and Marie Curie-Paris 6. His main interests are Social Signal Processing and Social Robotics. Email: mohamed.chetouani@upmc.fr

Fabien Ringeval is an Assistant-Doctor at Université de Fribourg. His main interest is Multimodal Affective Computing. Email: fabien.ringeval@unifr.ch


An Overview of the Base Period of the Babel Program

Tara N. Sainath, Brian Kingsbury, Florian Metze, Nelson Morgan, Stavros Tsakalidis

SLTC Newsletter, November 2013

Program Overview

The goal of the Babel program is to rapidly develop speech recognition capability for keyword search in previously unstudied languages, working with speech recorded in a variety of conditions with limited amounts of transcription. Several issues and observations frame the challenges driving the Babel Program. The speech recognition community has spent years improving the performance of English automatic speech recognition systems. However, applying techniques commonly used for English ASR to other languages has often resulted in huge performance gaps for those other languages. In addition, there are an increasing number of languages for which there is a vital need for speech recognition technology but few existing training resources [1]. It is easy to envision a situation where there is a large amount of recorded data in a language which contains important information, but for which there are very few people to analyze the language and no existing speech recognition technologies. Having keyword search in that language to pick out important phrases would be extremely beneficial.

The languages addressed in the Babel program are drawn from a variety of different language families (e.g., Afro-Asiatic, Niger-Congo, Sino-Tibetan, Austronesian, Dravidian, and Altaic). Consistent with the differences among language families, the languages have different phonotactic, phonological, tonal, morphological, and syntactic properties.

The program is divided into two phases, Phase I and Phase II, each of which is divided into two periods. In the 28-month Phase I, 75-100% of the data is transcribed, while in the 24-month Phase II, only 50% of the data is transcribed. The first 16-month period of Phase I focuses on telephone speech, while the next 12-month period uses both telephone and non-telephone speech. These channel conditions continue in Phase II.

During each program period, researchers work with a set of development languages to develop new methods. Between 4-7 development languages are provided per period, with the number of languages increasing (and the development time decreasing) as the program progresses. At the end of each period, researchers are evaluated on an unseen surprise language, with constraints on both system build time and amount of available transcribed data. These constraints help to put the focus on developing methods that are robust across different languages, rather than tailored to specific languages. For this reason, the development languages are not identified until kickoff meetings for each program period, and the surprise languages are revealed only at the beginning of the each evaluation exercise.

In addition to challenges associated with limited transcriptions and build time, technical challenges include methods that are effective across languages, robustness to speech recorded in noisy conditions with channel diversity, effective keyword search algorithms for speech, and analysis of factors contributing to system performance.

Evaluation

All research teams are evaluated on a keyword search (KWS) task for both the development and surprise languages. The goal of the KWS task is to find all of the occurrences of a "keyword", i.e., a sequence of one or more words in a language's original orthography, in an audio corpus of unsegmented speech data [1,2].

April 2013 marked the completion of the base period of the Babel program, and teams were evaluated on their KWS performance for the Vietnamese surprise language task. The larger Vietnamese corpus consisted of 100 hours of speech in a training set, using only language resources limited to the supplied language pack (Base LR), and no test audio reuse (NTAR). This condition is known as Full Language Pack (FullLP) + BaseLR + NTAR. The smaller corpus consisted of 10 hours of speech in the training set, again using the Base LR and NTAR conditions. This condition is known as LimitedLP + Base LR + NTAR [2].

The performance of the FullLP and LimitedLP is measured in terms of Actual Term Weighted Value (ATWV) for keyword search and Token Error Rate (% TER) for transcription accuracy.

Below we highlight the system architectures and results of the four teams participating in the program.

Team Babelon

The Babelon team consists of BBN (lead), Brno University of Technology, Johns Hopkins University, LIMSI/Vocapia, Massachusetts Institute of Technology, and North-West University. In addition to improving fundamental ASR, the principal focus of the Babelon team is to design and implement the core KWS technology, which goes well beyond ASR technology. The most critical areas for this effort thus far are:

The primary KWS system is a combination of different recognizers using different acoustic or language models across different sites within the Babelon team, including (1) HMM systems from BBN, (2) DNN and HMM systems from Brno University of Technology, and (3) HMM systems from LIMSI/Vocapia. All three systems use the robust NN based features. The scores of each system output are normalized before combination. We also focused on single systems and the single system result was never more than 10% behind the combined result.

Team LORELEI

LORELEI is an IBM-led consortium participating in the Babel program, and includes researchers from Cambridge University, Columbia University, CUNY Queens College, New York University, RWTH Aachen, and the University of Southern California. The approach to the Babel problem taken by LORELEI has emphasized combining search results from multiple indexes produced by a diverse collection of speech recognition systems. This team has focused on system combination because it can produce the best performance in the face of limited training data and acoustically challenging material, because it is possible to achieve a wide range of tradeoffs between computational requirements and keyword search performance by varying the number and complexity of the speech recognition systems used in combination, and because the system combination framework provides a good environment which to implement and test new ideas. In addition to fundamental work on speech recognition and keyword search technologies, the consortium is also pursuing work in automatic morphology, prosody, modeling of discourse structure, and machine learning, all with the aim of improving keyword search on new languages.

The primary entries from LORELEI in the surprise language evaluation on Vietnamese used the same general architecture for indexing and search. First, all conversation sides were decoded with a speaker-independent acoustic model, and the transcription output was post-processed to produce a segmentation of the evaluation data. Next, a set of speech transcription systems, most of which were multi-pass systems using speaker adaptation, were run to produce word-level lattices. Then, as the final step in indexing, the lattices from each transcription system were post-processed to produce lexical and phonetic indexes for keyword search. All indexes were structured as weighted finite-state transducers.

The LORELEI primary full language pack evaluation system combined search results from six different speech recognition systems: four using neural-network acoustic models and two GMM acoustic models using neural network features. One of the neural network acoustic models was speaker-independent, while the other five models were speaker-adapted. Three of the models performed explicit modeling of tone and used pitch features, while the other three did not.

Likewise, the LORELEI primary limited language pack evaluation system combined search results from six different speech recognition systems: one conventional GMM system, two GMM systems using neural-network features, and three neural-network acoustic models. One of the neural network acoustic models was speaker-independent, while the other five models were speaker-adapted. A notable feature of the limited language pack system is that one of neural-network features systems used a recurrent neural network for feature extraction.

Team RADICAL

The RADICAL consortium is the only University-lead consortium in BABEL, consisting of Carnegie Mellon University (lead and integrator, Pittsburgh and Silicon Valley campuses), The Johns Hopkins University, Karlsruhe Institute of Technology, Saarland University, and Mobile Technologies. Systems are developed using the Janus and Kaldi toolkits, which are benchmarked internally and combined at suitable points of the pipeline.

The overall system architecture of the RADICAL submissions to the OpenKWS 2013 evaluation [2] is best described as

Janus-based systems use a retrieval based on confusion networks, while Kaldi-based systems use an OpenFST-based retrieval. System combination was found to be beneficial on all languages, both development and surprise. While development focused on techniques useful for the LimitedLP conditions, the 2013 evaluation systems were tuned for the primary FullLP condition first and foremost. The evaluation systems used a number of interesting techniques, such as:

Team Swordfish

Swordfish is a relatively small team, consisting of ICSI (the lead and system developer), University of Washington, Northwestern University, Ohio State University, and Columbia University.

Given the team size, most of the effort was focused on improving single systems. Swordfish developed two systems that shared many components. One was based on HTK, while the other was based on Kaldi. In each case the front end incorporated hierarchical bottleneck neural networks that used as input both vocal tract length normalized (VTLN)-warped mel-frequency cepstral coefficients (MFCCs) and novel pitch and probability of voicing features that were generated by a neural network that used critical band autocorrelations as input. Speech/nonspeech segmentation is implemented with an MLP-FST approach. The HTK-based system used a cross-word triphone acoustic model, using 16 mixtures/state for the FullLP case and an average of 12 mixtures/state for the LimitedLP. The Kaldi-based system incorporated SGMM models.

For both systems, the primary LM was a standard Kneser-Ney smoothed trigram, but the team also experimented (for the LimitedLP) with sparse plus low rank language modeling and in some cases got small improvements. The HTK-based system learned multiwords from the highest weight non-zero entries in the sparse matrix. For Vietnamese, some pronunciation variants were collapsed across dialects. Swordfish’s keyword search has thus far primarily focused on a word-based index (except for Cantonese, where merged character and word posting lists was done), discarding occurrences where the time gap between adjacent words in more than 0.5 seconds.

Results

The performance of the surprise language full and limited language pack primary systems, measured in terms of Actual Term Weighted Value (ATWV) for keyword search and Token Error Rate (% TER) for transcription accuracy, are summarized below:

Team

FullLP

LimitedLP

TER (%)

ATWV

TER (%)

ATWV

Babelon

45.0

0.625

55.9

0.434

LORELEI

52.1

0.545

66.1

0.300

RADICAL

51.0

0.452

65.9

0.223

Swordfish

55.9*

0.332*

71.0*

0.120*

Table 1: Official NTAR condition surprise language results for Base period languages (* indicates single system results. All other results are based on system combination.)

Acknowledgements

Thank you to Mary Harper and Ronner Silber of IARPA for their guidance and support in helping to prepare this article.

References

[1] "IARPA broad agency announcement IARPA-BAA-11-02," 2011, https://www.fbo.gov/utils/view?id= ba991564e4d781d75fd7ed54c9933599.

[2] OpenKWS13 Keyword Search Evaluation Plan, March 2013 www.nist.gov/itl/iad/mig/upload/OpenKWS13-EvalPlan.pdf.

[3] M. Karafiat, F. Grezl, M. Hannemann, K. Vesely, and J. H. Ceernocky, "BUT BABEL System for Spontaneous Cantonese," in INTERSPEECH, 2013.

[4] B. Zhang, R. Schwartz, S. Tsakalidis, L. Nguyen, and S. Matsoukas, "White listing and score normalization for keyword spotting of noisy speech," in INTERSPEECH, 2012.

[5] D. Karakos et al., "Score Normalization and System Combination for Improved Keyword Spotting," in ASRU, 2013.

[6] R. Hsiao et al., "Discriminative Semi-supervised Training for Keyword Search in Low Resource Languages," in ASRU, 2013.

[7] F. Grezl and M. Karafiat, "Semi-supervised Bootstrapping Approach for Neural Network Feature Extractor Training," in ASRU, 2013.

[8] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative Training of Deep Neural Networks," in INTERSPEECH, 2013.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com


MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker-Recognition Research

Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck

Microsoft Research

SLTC Newsletter, November 2013

We are happy to announce the release of the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. It provides researchers with a test bed for developing new front-end and back-end techniques, allowing replicable evaluation of new advancements. It will also help newcomers in the field by lowering the "barrier to entry," enabling them to quickly build baseline systems for their experiments. Although the focus of this toolbox is on speaker recognition, it can also be used for other speech related applications such as language, dialect, and accent identification. Additionally, it provides many of the functionalities available in other open-source speaker recognition toolkits (e.g., ALIZE [1]) but with a simpler design which makes it easier for the users to understand and modify the algorithms.

The MATLAB tools in the Identity Toolbox are computationally efficient for three reasons: vectorization, parallel loops, and distributed processing. First, the code is simple and easy for MATLAB to vectorize. With long vectors, most of the CPU time is spent in optimized loops, which are the core of the processing. Second, the code is designed for parallelization available through the Parallel Computing Toolbox (i.e., the toolbox codes use "parfor" loops). Without the Parallel Computing Toolbox, these loops execute as normal "for" loops on a single CPU. But when this toolbox is installed, the loops are automatically distributed across all the available CPUs. In our pilot experiments, the codes were run across all 12 cores in a single machine. Finally, the primary computational routines are designed to work as compiled programs. This makes it easy to distribute the computational work to all the machines on a computer cluster, without the need for additional licenses.

Speaker ID Background

In recent years, the design of robust and effective speaker-recognition algorithms has attracted significant research effort from academic and commercial institutions. Speaker recognition has evolved substantially over the past few decades; from discrete vector quantization (VQ) based systems [2] to adapted Gaussian mixture model (GMM) solutions [3], and more recently to factor analysis based Eigenvoice (i-vector) frameworks [4]. The Identity Toolbox, version 1.0, provides tools that implement both the conventional GMM-UBM and state-of-the-art i-vector based speaker-recognition strategies.


Figure 1: Block diagram of a typical speaker-recognition system. A bigger version of the figure

As shown in Fig. 1, a speaker-recognition system includes two primary components: a front-end and a back-end. The front-end transforms acoustic waveforms into more compact and less redundant acoustic feature representations. Cepstral features are most often used for speaker recognition. It is practical to only retain the high signal-to-noise ratio (SNR) regions of the waveform, therefore there is also a need for a speech activity detector (SAD) in the front-end. After dropping the low SNR frames, acoustic features are further post-processed to remove the linear channel effects. Cepstral mean and variance normalization (CMVN) [5] is commonly used for the post-processing. The CMVN can be applied globally over the entire recording or locally over a sliding window. Feature warping [6], which is also applied over a sliding window, is another popular feature-normalization technique that has been successfully applied for speaker recognition. This toolbox provides support for these normalization techniques, although no tool for feature extraction or SAD is provided. The Auditory Toolbox [7] and VOICEBOX [8], which are both written in MATLAB, can be used for feature extraction and SAD purposes.

The main component of every speaker-recognition system is the back-end where speakers are modeled (enrolled) and verification trials are scored. The enrollment phase includes estimating a model that represents (summarizes) the acoustic (and often phonetic) space of each speaker. This is usually accomplished with the help of a statistical background model from which the speaker-specific models are adapted. In the conventional GMM-UBM framework the universal background model (UBM) is a Gaussian mixture model (GMM) that is trained on a pool of data (known as the background or development data) from a large number of speakers [3]. The speaker-specific models are then adapted from the UBM using the maximum a posteriori (MAP) estimation. During the evaluation phase, each test segment is scored either against all enrolled speaker models to determine who is speaking (speaker identification), or against the background model and a given speaker model to accept/reject an identity claim (speaker verification).

On the other hand, in the i-vector framework the speaker models are estimated through a procedure called Eigenvoice adaptation [4]. A total variability subspace is learned from the development set and is used to estimate a low (and fixed) dimensional latent factor called the identity vector (i-vector) from adapted mean supervectors (the term "i-vector" sometimes also refers to a vector of "intermediate" size, bigger than the underlying cepstral feature vector but much smaller than the GMM supervector). Unlike the GMM-UBM framework, which uses acoustic feature vectors to represent the test segments, in the i-vector paradigm both the model and test segments are represented as i-vectors. The dimensionality of the i-vectors are normally reduced through linear discriminant analysis (with Fisher criterion [9]) to annihilate the non-speaker related directions (e.g., the channel subspace), thereby increasing the discrimination between speaker subspaces. Before modelling the dimensionality-reduced i-vectors via a generative factor analysis approach called the probabilistic LDA (PLDA) [10], they are mean and length normalized. In addition, a whitening transformation that is learned from i-vectors in the development set is applied. Finally, a fast and linear strategy [11], which computes the log-likelihood ratio (LLR) between same versus different speaker's hypotheses, scores the verification trials.

Identity Toolbox

The Identity toolbox provides tools for speaker recognition using both the GMM-UBM and i-vector paradigms. It has been attempted to maintain consistency with the naming convention in the code to follow the formulation and symbolization used in the literature. This will make it easier for the users to compare the theory with the implementation and help them better understand the concept behind each algorithm. The tools can be run from a MATLAB command line using available parallelization (i.e., parfor loops), or compiled and run on a computer cluster without the need for a MATLAB license.

The toolbox includes two demos which use artificially generated features to show how different tools can be combined to build and run GMM-UBM and i-vector based speaker recognition systems. In addition, the toolbox contains scripts for performing a small-scale speaker identification experiment using the TIMIT database. Moreover, we have replicated state-of-the-art results on the large-scale NIST SRE-2008 core tasks (i.e., short2-short3 conditions [15]). The list below shows the different tools available in the toolbox, along with a short descriptions of their capabilities:

Feature normalization

GMM-UBM

i-vector-PLDA

EER and DET plot

The Identity Toolbox is available from the MSR website (http://research.microsoft.com/downloads) under a Microsoft Research License Agreement (MSR-LA) that allows use and modification of the source codes for non-commercial purposes. The MSR-LA, however, does not permit distribution of the software or derivative works in any form.

[1] A. Larcher, J.-F. Bonastre, and H. Li, "ALIZE 3.0 - Open-source platform for speaker recognition," in IEEE SLTC Newsletter, May 2013.

[2] F. Soong, A. Rosenberg, L. Rabiner, and B.-H. Juang, "A vector quantization approach to speaker recognition," in Proc. IEEE ICASSP, Tampa, FL, vol.10, pp.387-390, April 1985.

[3] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, "Speaker verification using adapted Gaussian mixture models", Digital Signal Processing, vol. 10, pp. 19-41, January 2000.

[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE TASLP, vol. 19, pp. 788-798, May 2011.

[5] B.S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Am., vol. 55, pp. 1304-1312, June 1974.

[6] J. Pelecanos and S. Sridharan, "Feature warping for robust speaker veri?cation," in Proc. ISCA Odyssey, Crete, Greece, June 2001.

[7] M. Slaney. Auditory Toolbox - A MATLAB Toolbox for Auditory Modeling Work. [Online]. Available: https://engineering.purdue.edu/~malcolm/interval/1998-010/

[8] M. Brooks. VOICEBOX: Speech Processing Toolbox for MATLAB. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

[9] K. Fukunaga, Introduction to Statistical Pattern Recognition. 2nd ed. New York: Academic Press, 1990, Ch. 10.

[10] S.J.D. Prince and J.H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," in Proc. IEEE ICCV, Rio de Janeiro, Brazil, October 2007.

[11] D. Garcia-Romero and C.Y. Espy-Wilson, "Analysis of i-vector length normalization in speaker recognition systems," in Proc. INTERSPEECH, Florence, Italy, August 2011, pp. 249-252.

[12] P. Kenny, "A small footprint i-vector extractor," in Proc. ISCA Odyssey, The Speaker and Language Recognition Workshop, Singapore, Jun. 2012.

[13] D. Matrouf, N. Scheffer, B. Fauve, J.-F. Bonastre, "A straightforward and efficient implementation of the factor analysis model for speaker verification," in Proc. INTERSPEECH, Antwerp, Belgium, Aug. 2007, pp. 1242-1245.

[14] P. Kenny, "Bayesian speaker verification with heavy-tailed priors," in Proc. Odyssey, The Speaker and Language Recognition Workshop, Brno, Czech Republic, Jun. 2010.

[15] "The NIST year 2008 speaker recognition evaluation plan," 2008. [Online]. Available: http://www.nist.gov/speech/tests/sre/2008/sre08_evalplan_release4.pdf

[16] "The NIST year 2010 speaker recognition evaluation plan," 2010. [Online]. Available: http://www.itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf

Seyed Omid Sadjadi is a PhD candidate at the Center for Robust Speech Systems (CRSS), The University of Texas at Dallas. His research interests are speech processing and speaker identification. Email: omid.sadjadi@ieee.org

Malcolm Slaney is a Researcher at Microsoft Research in Mountain View, California, a Consulting Professor at Stanford CCRMA, and an Affiliate Professor in EE at the University of Washington. He doesn’t know what he wants to do when he grows up. Email: malcolm@ieee.org

Larry Heck is a Researcher in Microsoft Research in Mountain View California. His research interests include multimodal conversational interaction and situated NLP in open, web-scale domains. Email: larry.heck@ieee.org


The REAL Challenge

Maxine Eskenazi

SLTC Newsletter, November 2013

This is a description of the REAL Challenge for researchers and students to invent new system that have real users.

Overview

The Dialog Research Center at Carnegie Mellon (DialRC) is organizing the REAL Challenge. The goal of the REAL Challenge (dialrc.org/realchallenge) is to build speech systems that are used regularly by real users to accomplish real tasks. These systems will give the speech and spoken dialog communities steady streams of research data as well as platforms they can use to carry out studies. It will engage both seasoned researchers and high school and undergrad students in an effort to find the next great speech applications.

Why have a REAL Challenge?

Humans greatly rely on spoken language to communicate, so it seems natural that we would be likely to communicate with objects via speech as well. Some speech interfaces do exist and they show promise, demonstrating that smart engineering can palliate indeterminate recognition. Yet the general public has not yet picked up this means of communication as easily as they have the tiny keyboards. About two decades ago, many researchers were using the internet, mostly to send and receive email. They were aware of the potential that it held and waited to see when and how the general public would adopt it. Practically a decade later, thanks to providers such as AmericaOnline, who had found how to create easy access, everyday people started to use the internet. And this has dramatically changed our lives. In the same way, we all know that speech will eventually replace the keyboard in many situations when we want to speak to objects. The big question is what is the interface or application that will bring us into that era.

Why hasn't speech become a more prevalent interface? Most of today’s speech applications have been devised by researchers in the speech domain. While they certainly know what types of systems are “doable”, they may not be the best at determining which speech applications would be universally acceptable.

We believe that students who have not yet had their vision limited by knowledge of the speech and spoken dialog domains and who have grown up with computers as a given, are the ones that will find new, compelling and universally appealing speech applications. Along with the good ideas, they will need some guidance to gain focus. Having a mentor, attending webinars and participating in a research team can provide this guidance.

The REAL challenge will combine the talents of these two very different groups. First it will call upon the speech research community who know what it takes to implement real applications. Second, it will advertise to and encourage participation from high school students and college undergrads who love to hack and have novel ideas about using speech.

How can we combine these two types of talent?

The REAL Challenge is starting with a widely-advertised call for proposals. Students can propose an application. Researchers can propose to create systems or to provide tools. A proposal can target any type of application in any language. The proposals will be lightly filtered and the successful proposers will be invited to a workshop on June 21, 2014 to show what they are proposing and to team up. The idea is for students to meet researchers and for the latter to take one or more students on their team. Students will present their ideas and have time for discussion with researchers. A year later, a second workshop will assemble all who were at the first workshop to show the resulting systems (either WOZ experiments with real users or prototypes), measure success and award prizes. Student travel will be taken care of by DialRC through grants.

Preparing students

Students will have help from DialRC and from researchers as they formulate their proposals. DialRC will provide webinars on such topics as speech processing tool basics and how to present a poster. Students will also be assigned mentors. Researchers in speech and spoken dialog can volunteer to be a one-on-one mentor to a student. This consists of being in touch either in person or virtually. Mentors can tell the students about what our field consists of, what the state of the art is, and what it is like to work in research. They can answer questions about how the student can talk about their ideas. If you are a researcher in speech and/or spoken dialog and you would like to be a mentor, please let us know at realchallenge@speechinfo.org

What is an entry?

The groups will create entries. Here are the characteristics of a successful entry.