Speech and Language Processing Technical Committee Newsletter

January 2009

Welcome to the Winter 2009 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter.

Editors Mike Seltzer, Stephen Cox, and Brian Mak completed their three-year terms in December 2009, and this is the first edition produced by the next slate of editors, Jason Williams, Pino Di Fabbrizio, and Chuck Wooters. Thanks to Mike, Stephen and Brian for their support in getting us started! Over the next 3 years, we look forward to maintaining the high standards Mike and his team have established.

This newsletter is our first on the signalprocessingsociety.org website -- thanks to the great crew at IEEE for getting us running here.

Thanks also to Bano Banerjee, Svetlana Stoyanchev, and Anotonio Roque -- our trusty senior staff reporters -- for continuing to serve. I'm delighted they will continue to write articles, and also get their feet wet with reviewing for the newsletter. In addition, it is a pleasure to introduce two new staff reporters, Annie Louis and Filip Jurcicek. Annie is a PhD candidate at University of Pennsylvania, and Filip is a post-doc at the University of Cambridge, UK.

In addition to articles from our reporters, guest articles are always welcome. This newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can reach us at speechnewseds [at] listserv (dot) ieee [dot] org.

Finally, to subscribe the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Jason Williams, Editor-in-chief
Pino Di Fabbrizio, Editor
Chuck Wooters, Editor


Welcome from the New SLTC Chair

Steve Young

First of all, I would like to wish everyone in our speech and language community a happy and a prosperous New Year.

Farewell from the Outgoing SLTC Chair

Roberto Pieraccini

The beginning of 2009 marks the end of my two year term as the Chair of SLTC, the IEEE Speech and Language Technical Committee.

What is the SLTC? A brief history

Jason Williams

Readers new to the newsletter may not be familiar with the Speech and Language Processing Technical Committee. This article provides a quick introduction and a brief history.

Report from IEEE Workshop on Spoken Language Technology in Goa

Alex Rudnicky

The IEEE workshop on Spoken Language Technology (SLT) took place during December 2008 in Goa (the first IEEE speech meeting to take place in India). SLT focuses on topics that connect speech recognition and understanding to some of its applications, such as machine translation, dialog, search and summarization. A particular focus of this year’s meeting was Spoken Language Technology for Development, the application of SLT to social needs.

Language Death

Stephen Cox

Language, in either its written or spoken form, is the raw material for our research. So it is a shock to realise that a large proportion of the world¹s languages are under threat of extinction.

Keyword spotting in Czech: First evaluation complete, with funding from Ministries of Interior and Defense

Honza Cernocky

With funding from several Czech Government Ministries, three Czech university speech labs have tackled keyword spotting. In late 2008, an evaluation showed the strengths and weaknesses of their keyword spotting systems. The work received positive marks from its Interior and Defense sponsors, and the researchers look forward to further advances with additional funding.

Loebner Prize at InterSpeech 2009

Filip Jurcicek

The Loebner Prize contest, which follows the Turing test, has been organized every year since 1991. This year, the contest will be held in conjunction with InterSpeech, in Brighton, UK.

Speech Mashups – Speech Processing on Demand for the Masses

Pino Di Fabbrizio

AT&T is now granting free access to its speech processing technology for use in building speech recognition prototype applications that run on iPhones, BlackBerrys, and other networked devices.

Introducing Toastmasters, a club for improving public speaking skills

Svetlana Stoyanchev

Toastmasters is an international nonprofit organization that helps its members improve their presentation skills. It teaches specific techniques that help people communicate their ideas effectively and master a valuable skill of a competent speaker. This organization is widely available all over the world and can be a useful resource for students and scientists.

Spoken Document Summarization - An Overview

Annie Louis

With the increasing amount of multimedia information generated today, the provision of suitable information access interfaces to the data forms an interesting research problem. Summarizing speech contained in multimedia documents promises efficient organization, retrieval and browsing of their contents. Techniques from text summarization can be adapted for speech. But there are also unique challenges associated with spoken documents.

2008 - A Busy Year for the ISCA Student Advisory Committee

Ebru Arisoy, Marco Piccolino-Boniforti, Tiago Falk, Antonio Roque, and Sylvie Saget

The Student Advisory Committee of the International Speech Communication Association has been active in a number of efforts recently. These include a Round Table Research Discussion and Lunch Event, development of an Online Grant Application System, transcriptions of interviews with speech and language technology researchers, and web resource development.


Welcome from the New SLTC Chair

Steve Young

SLTC Newsletter, January 2009

First of all, I would like to wish everyone in our speech and language community a happy and a prosperous New Year. Given the general economic doom and gloom, such a greeting might appear to many to be hopelessly optimistic. However, there are reasons to be hopeful in 2009. Speech and language technology continues to make inroads into the main-stream and new developments in areas such as voice search, multi-modal interfaces, expressive speech and speech translation provide good reasons for commercial and government support to continue. In research, there are also exciting new developments which should continue to entice enthusiastic grad students to enter our field.

Closer to home we have an excellent ICASSP in prospect in Taipei which for the first time this year will include a number of thematic symposia designed to showcase specific technology areas and encourage collaboration between technical areas. Then later in the year we will hold our biennial workshop on Automatic Speech Recognition and Understanding which this year will be in Merano, Italy. I think I am the first ever British chair of the SLTC so I am going to use that as an excuse to also plug a non-IEEE event: Interspeech 2009 will be in Brighton, UK and with the pound sinking fast, it will not only be very good, it will also be very cheap! So forget the doom and gloom, there is much to look forwards to in 2009.

As we start the year, we have 14 committee members retiring and I would like to thank them all for the valuable contributions that they have made. We also have 22 new members starting. So as you can see we are expanding the committee to accommodate the growing demand. Not only do we have more ICASSP paper proposals to deal with, we have been actively working to improve the quality of the ICASSP review process itself. No peer review procedure can be perfect but recent changes which include a target of 4 reviews per paper and a meta-review process to handle cases of reviewer disagreement mean that final accept/reject decisions are now being made with a very high level of confidence.

In addition to ICASSP reviews, SLTC members serve on a variety of sub-committees which are busy working on a whole range of topics. We have a new Newsletter team led by Jason Williams, and a new Communications team led by Doug O'Shaughnessy. We will try to keep improving these to ensure that our community continues to be well-informed. However, to make them really useful, we need everyone in the community to become involved. So if you have some news, or an idea, or an opinion that you want to share, then send it to Jason. If you have ideas for improving our web-site, then send them to Doug. If you look at the full list of sub-committees on our web site, then you will see that there are lots of other behind- the-scenes activities including organizing and promoting workshops; liaising with cognate organizations; promoting educational activities; and ensuring that we continue to recognize the achievements of our members via award and fellow nominations.

Being elected Chair of the SLTC is a great honor but it is also a significant responsibility and a considerable challenge. Fortunately, I am not alone. I have a great team to help me and I have the benefit of all the solid foundations laid by my predecessors. In particular, I would like to pay tribute to the excellent job done over the last two years by the outgoing SLTC chair, Roberto Pieraccini. Roberto has worked hard to ensure that our TC is a shining star of the Signal Processing Society. And best of all, he will stay on the committee for one final year to hold my hand.

So best wishes again for the New Year and see you all at ICASSP 2009!


Farewell from the Outgoing SLTC Chair

Roberto Pieraccini

SLTC Newsletter, January 2009

The beginning of 2009 marks the end of my two year term as the Chair of SLTC, the IEEE Speech and Language Technical Committee. I have to say that these two years have gone by rather quickly and smoothly thanks to the excellent support I received by everyone in the committee and by IEEE. I was very fortunate to have the priceless guidance of former chair Mazin Gilbert at the beginning of my term, the always available word of wisdom of IEEE Signal Processing Society VP of technical directions Alex Acero, and the continuous and unconditional help of all members that took their commitment to the SLTC very seriously. They have been truly terrific and my warm gratitude goes to all of them. However I would like to mention those friends and colleagues of SLTC who played a major role in several activities, who promptly and always responded to my many requests, and with whom I communicated sometimes almost on a daily or hourly basis. In particular I would like to thank Gokhan Tur who, among many other things, was in charge of making and gathering nominations for the many IEEE awards that in several occasions resulted in actual acknowledgements given to worthy members of our community. Gokhan was also an ICASSP area chair for two consecutive years, a coordinator of the student technical committees, and the student awards, and he was the speech and language liaison for the organization of ICASSP 2009.

I would like to thank Ciprian Chelba for taking ownership of the Web site and the mailing list, and for being an active ICASSP 2008 area chair. My special thanks also go to Mike Seltzer who has been leading this newsletter and chasing, every three months, potential column writers for interesting stories and articles. Many times we have been commended as one of the technical committees within the Signal Processing Society with the most extensive Web site and with one of the best newsletters. Thanks Ciprian and Mike!

Many thanks to John Hansen, who coordinated the fellowship nominations, as a result of which we got a record number of four new fellows this year. Many thanks also go to Murat Saraclar for his constant advice and for being an area chair for two consecutive years, and to Brian Mak for his hard work as one of the ICASSP 2008 area chairs. Many thanks to all the others SLTC members of the past two years—too many to mention all of them here—for their advice, hard work, and for being there when I needed them.

The main bulk of work for the SLTC comes with the ICASSP review. In 2007 we received about 620 papers in the speech and language processing areas. This year we received a total of 689 papers, 565 in the speech processing and 124 in the language processing areas. As in 2007, we required 3 reviews per paper, generally performed by external reviewers, in addition to one meta-review performed by a SLTC member. Because of the uneven allocation of papers across the various topics, some SLTC members had more than 20 papers to review in less than a month! Thanks to all of those who, in spite of the large number of reviews, the busy end-of-the year schedule, and the strict deadlines, completed their reviews on time.

Almost all of the ICASSP review organizational work is carried out by the area chairs. Each area chair is assigned a set of topics and the job to overview a number of papers—typically between 100 and 300—throughout the whole process. Among their responsibilities are the overview of the assignment of three external and one SLTC reviewers for each paper, the reassignment of papers in case of conflict, and making sure that all the papers get their reviews within the strict schedule imposed by the process. When all the reviews are in place, the area chairs are responsible for deciding a consistent acceptance criterion across topics, grouping the accepted papers into a number of sessions, and resolving other potential conflicts and issues that may appear after the authors have been notified. The overall acceptance rate was about 46%. I would like to give my special thanks to the ICASSP 2009 SLTC area chairs: TJ. Hazen, Murat Saraclar, and Gokhan Tur. They worked really hard to make sure that the process went on as smoothly as possible. And last but not least, many thanks to Lance Cotton, the man behind the curtains, the administrator of the ICASSP review Web site who was always available to make last minute changes both to the allocation of papers and to the review system software.

Finally I would like to formally pass the leadership of the SLTC on to my successor. I am very happy that Steve Young took on this responsibility, and I am more than sure he will be a great Chair and will bring a lot of value to SLTC. Steve will be in charge of SLTC during 2009 and 2010 and, as it is tradition, I will be helping him for all 2009 as a former Chair.

Happy new year to everyone!


What is the SLTC? A brief history

Jason Williams

SLTC Newsletter, January 2009

Readers new to the newsletter may not be familiar with the Speech and Language Processing Technical Committee. This article provides a quick introduction and a brief history.

Overview

The "Speech and Language Processing Technical Committee" (SLTC) is a technical committee of the Signal Processing Society (SPS) of the Institute for Electrical and Electronic Engineers (IEEE).

The overall aim of the SLTC is to promote activities in speech and language processing. Much of its efforts are devoted to the annual IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), where the SLTC manages reviews of papers covering speech and language, and organizes conference sessions, special sessions, and tutorials.

In addition, the SLTC has a number of other responsibilities. First, it promotes and supports workshops such as the workshop on Spoken Language Technology (SLT) and the workshop on Automated Speech Recognition and Understanding (ASRU). It also endorses several events sponsored by other professional organizations such as, for instance, the International Conference on Multimodal Interfaces (ICMI), The Human Language Technology Workshop (HLT), and SIGdial, the conference of the Special Interest Groups on Dialogue and Discourse.

Second, it makes nominations for IEEE awards, such as best paper and best column awards, technical achievement awards, and education awards. It also promotes the nomination of members to the IEEE Fellow grade, and promotes nominations for Distinguished Speakers.

Third, it supports the IEEE Transactions on Audio, Speech, and Language Processing by reviewing articles and assisting the Editor-in-chief to identify Associate Editor candidates. In addition, the SLTC recommends changes for the sub-categories for paper submissions to the Transactions and ICASSP (called EDICS), promotes relations with ISCA and ACL, produces this quarterly newsletter, and interfaces with the Signal Processing Society and its Board of Governors to ensure that speech and language are adequately represented in conferences, workshops, fellowships awards, and across the IEEE.

The SLTC consists of up to 60 members who each serve a 3-year term, beginning Jan 1. There are currently 44 members. The terms overlap, so that each year about one third of the members are replaced. New members are nominated via a nominations sub-committee, and elected by a vote of the existing committee. The SLTC has no budget: its members are uncompensated volunteers.

The SLTC is the largest of 12 technical committees in the IEEE Signal Processing Society. Other technical committees include "Signal Processing Theory and Methods", "Audio and Electroacoustics", "Bio Imaging and Signal Processing", and "Sensor Array and Multichannel Signal Processing".

History

The Speech and Language Technical committee traces its roots back to the Institute of Radio Engineers (IRE) Audio Group, which was founded in 1947 and met for the first time on March 22, 1948. The IRE Executive Committee (with real foresight!) originally named it the "Audio, Video and Acoustics Group", but its charter members shrunk the name to just the "Audio Group". Then in 1963 the Institute of Radio Engineers (IRE) merged with the American Institute of Electrical Engineers (AIEE) to form the IEEE, and the Audio Group followed along into the IEEE. It changed names in 1965 to "Group on Audio and Electroacoustics", in 1974 to "Group on Acoustics, Speech and Signal Processing" (ASSP), and in 1976 to the "Acoustics, Speech and Signal Processing Society". From here, the name remained stable for 14 years, then changed names again in 1990 to become the "Signal Processing Society", which it remains today.

In 1968, this audio group established its first technical committee, on Digital Signal Processing. Then in late 1968 or early 1969, it established its second committee, called the "Speech Processing and Sensory Aids Technical Committee". At some point in the early 1970s "Sensory Aids" was dropped from the name, and for over 30 years it remained the "Speech Processing Technical Committee". Then in 2006 its scope was expanded and its name changed to the "Speech and Language Processing Technical Committee".

These early technical committees first organized and held workshops, and then in 1976, the ASSP Society held the first International Conference on Acoustics, Speech and Signal Processing (ICASSP) in Philadelphia. It included 600 attendees and 226 published papers. ICASSP has continued each year since, and is now substantially larger, with the 2008 ICASSP attracting 2062 registered attendees and 1352 published papers.

As today, much of the early efforts of the Speech Processing Technical Committee were devoted to reviewing papers for ICASSP, but the process then looked much different. In the early days, the technical committees performed all of the reviews, with each member performing as many as 50 reviews. Hard-copies of extended abstracts were sent via postal mail to committee members, and the committee met together in New York City, then later in Piscataway, NJ, to make program decisions. Steve Young recalls, "We would stand in a room with all of the ICASSP proposals (paper only!) in piles and we would make sessions by physically placing the papers on top of sheets with the name of the session on. When we had a good paper but no session, we would wave it in the air and shout 'Could anybody use a paper on XY widgets in one of their sessions?' or 'I have a paper on XY widgets but I really need one on ZZ flow - can anybody do a swap'. It was hugely inefficient but great fun!" Today, external reviewers are used extensively, and all reviewing is electronic, removing the need for an in-person meeting.

Past Chairs of the SLTC (1990 - present)

Acknowledgements and more information

Thanks to Rich Cox, Jay Wilpon, Mazin Gilbert, Roberto Pieraccini, and Steve Young for providing extensive input to this article.

For more information, see:

If you have comments, corrections, or additions to this article, please contact the author: Jason Williams, jdw [at] research [dot] att [dot] com.


Report from IEEE Workshop on Spoken Language Technology in Goa

Alex Rudnicky

SLTC Newsletter, January 2009

The second bi-annual IEEE workshop on Spoken Language Technology (SLT) took place December 15-18th 2008 in Goa, India. The SLT workshop alternates with the IEEE ASRU (Automatic Speech Recognition and Understanding) workshop and is meant to focus on topics that connect speech recognition and understanding to some of its applications, such as machine translation, dialog, search and summarization. A particular focus of this year’s meeting was Spoken Language Technology for Development (sometimes abbreviated as SLT4D), the application of SLT to social needs, and several sessions were devoted to this topic. SLT 2008 was also the first IEEE speech technology meeting to be held on the Indian sub-continent and it provided a long-overdue opportunity for European, Asian and American researchers to interact with members of the very active Indian speech research community. Amitav Das, the Chair, and Srinivas Bangalore, the Co-Chair, did an outstanding job of organizing the meeting and its associated activities. Unfortunately, due to circumstances beyond the organizers’ control attendance was much smaller than anticipated: there were 88 attendees, out of 122 registrations.This year’s workshop attracted 154 regular submissions, of which 72 were accepted.

A special session on SLT in India introduced participants to the linguistic landscape of India. Your correspondent was struck by both the sheer diversity and sizes of language groups (22 official languages, 29 with more than a million speakers each, 122 languages spoken by at least 10,000 people). Some interesting accommodations have been prompted by practical considerations; for example, communities without a written language might borrow script from multiple neighboring groups. Dr. Mallinkarjun of the Central Institute of Indian Languages described a major effort currently under way to collect corpora for 24 major languages, under the direction of the Linguistic Data Consortium for Indian Languages. The collection effort extends beyond speech and encompasses a variety of linguistic resources, including parallel corpora and dictionaries. Prof Hema Murthy of IIT-Madras described a variety of ways that language technologies are being used for education and training (for example to provide exam preparation); of particular interest were descriptions of how the needs of practical applications drive development in core technology (for example, speech synthesis). Dr Amitav Das of Microsoft India provided a comprehensive overview of speech technology activity in India taking place both in industry and in academia. Dr Das reinforced the point that understanding how to use speech technology to meet peoples’ needs in turn generates interesting research questions.

The workshop keynote address was given by Prof Giuseppe Riccardi of the University of Trento and described a comprehensive program of research in exploring next-generation spoken language interfaces, including multi-modality and system-originated transactions (and featured a compelling demonstration of the latter in action).

The workshop also included two tutorials addressing spoken language technology in development. "Rapid Language Adaptation Tools and technologies for Multilingual Speech Processing System" prepared by Prof Tanja Schultz of Karlsruhe and Carnegie Mellon Universities (and presented by Prof Pascale Fung of HKUST) focused on the SPICE project and the development of tools for rapid acquisition of language data and simplified configuration of models and applications that would allow non-specialists to create useful artifacts. The second tutorial, on the World Wide Telecom Web, given by Dr Nitendra Rajput of IBM India in Delhi, described the development and successful deployment of a speech-only web service that enables access to information resources over mobile phones (for example, market prices). Of particular note is that the system allows ordinary users to create their own content entirely over the phone (for example, advertizing services or sharing knowledge about farming techniques).

Following the model set by previous workshops, the meeting consisted of a single track of poster sessions organized around a specific topic and preceded by an introductory overview lecture. In contrast to the previous SLT meeting, SLT 2008 featured a full session on spoken language generation; the kinds of problems addressed in summarization and search seem to have evolved to ones focusing more on complete systems rather than component technologies. In this meeting, as in others recently, machine translation featured prominently.


Language Death

Stephen Cox

SLTC Newsletter, January 2009

Language, in either its written or spoken form, is the raw material for our research. So it is a shock to realise that a large proportion of the world’s languages are under threat of extinction. Estimates vary, but one contends that 90% of the more than 6 000 languages spoken in the world today could become moribund or extinct by 2100 [1]. 20 to 40 percent of languages are estimated to be already moribund, and only 5 to l0 percent are "safe" in the sense of being widely spoken or having official status. Another survey [2] compared the threat of extinction of languages with that of birds, and found that languages were considerably more threatened: there have already been more recorded language extinctions, and there are substantially more rare languages than rare bird species. This might seem a fanciful comparison, but interestingly, the conditions for languages and bird species to flourish appear to be similar. Once the effect of different land area sizes has been taken into account, languages, birds and mammals are all more diverse in low-latitude countries, in countries with large areas of forest, and in mountainous countries. Factors that we might expect to be correlated with number of languages, such as GDP per person or numbers of televisions per 1000 people (as an indicator of national and global communication) are, in fact, uncorrelated.

In 2004, Abrams and Strogatz proposed a simple and elegant mathematical model of how one language might decline at the expense of the other in a bilingual community [3]. They assumed that the two languages compete for speakers and hence there is a certain probability of switching from one to the other. The model accurately predicts the number of speakers of each language over time in a number of bilingual communities. Ominously, it also predicts only two stable points: all speakers speak either one language or the other. The authors point out that successful bilingual communities do exist, but until now, they have been split monolingual populations without significant interactions. However, the evidence from e.g. Welsh and Quebec French is that policies to encourage the minority language can arrest decline once mixing of the communities takes place.

Are we aiding or hindering the demise of languages with our technology? On the one hand, commercialisation of language and speech technology will tend to favour popular languages, which have bigger markets, and hence possibly increase their dominance. But tools created by our activity are excellent resources (analysis technology, teaching and learning technology, translation technology etc.) for these languages to regain speakers and to become active again, as activities such as the EuroBABEL project in Europe (promoting research on under-described endangered languages) and the E-MELD project in the USA (aiding the preservation of data and documentation of endangered languages) are showing.

[1] Krauss, M. E. (1992). The World's Languages in Crisis. Language 68(1).4-10.
[2] Sutherland, W.J. (2003). Parallel extinction risk and global distribution of languages and species. Nature 423, 276-279
[3] Abrams, D.M. and Strogatz, S.H. (2003). "Modelling the dynamics of language death," Nature 424, 900.


Keyword spotting in Czech: First evaluation complete, with funding from Ministries of Interior and Defense

Honza Cernocky

SLTC Newsletter, January 2009

Czech Republic, with a population of 10 million, is surprisingly home to three successful university research groups working in the area of speech recognition:

Since 2007, these three groups have been cooperating on a research project "Overcoming the language barrier complicating investigation into financing terrorism and serious financial crimes" sponsored by Czech Ministry of Interior under number VD20072010B16. The project aims at the analysis of spontaneous telephone calls from security and defense domains. Unlike English, where corpora such as Switchboard and Fisher provide sufficient amounts of training material, Czech lacked a well transcribed database of spontaneous telephone calls. This is why in 2007, the activities of the project concentrated on the creation of such a database – the consortium now holds almost 100 hours of transcribed and checked spontaneous speech data.

After several discussions with Interior and (later) Defense representatives, keyword spotting in Czech was defined as the top priority task for security and defense analysts. To compare the performances of systems, an evaluation of keyword spotting systems was organized in 2008, with the final run in November and post-evaluation workshop on November 21st. Systems were compared using standard metrics such as FOM (figure of merit) and EER (equal error rate), and their speed and ability to handle OOV (out of vocabulary) words were also compared.

TUL built on its extensive experience with the recognition of Czech and designed a system based on LVCSR with large vocabulary of 350k words and 410k pronunciation variants, without language model. Note that Czech is a highly inflective language so that 50k vocabularies common for English are not sufficient. The advantage of the system is its high speed of 0.15 xRT, and ability to detect colloquial variants even if a correct form of a word is entered. TUL group uses a simple acoustic modeling based on context-independent models with high numbers of Gaussians, with a state-of-the-art feature extraction: MFCC coefficients processed by Heteroscedastic Linear Discriminant Analysis (HLDA) transform.

UWB experimented with two systems: the first was a purely acoustic keyword model which works against a background model, with the resulting likelihood ratio thresholded. The second used LVCSR lattices. While the acoustic system provided better results for the development set with rather artificial selection of "good" (read: "long") keywords, the advantages of LVCSR-based system fully emerged in the test on evaluation data, where the selection of keywords was not limited. The acoustic modeling in this system uses not only discriminative training of HMMs, but also discriminative adaptation to individual conversation sides. UWB experimented also with fusion of individual systems and has shown their complementarity.

BUT tested 4 systems in this evaluation: FastLVCSR was based on LVCSR with insertion of keywords into language model; HybridLVCSR used full-fledged word and subword recognition and indexing; and two acoustic systems were based on GMM/HMM and NN/HMM. While LVCSR systems are more precise, the advantage of acoustic ones is in their speed. HybridLVCSR is worth mentioning as it allows for pre-processing large quantities of data off-line with subsequent very fast searches, including OOVs. BUT built on its experience in LVCSR and keyword spotting in European Community-sponsored AMI and AMIDA projects, as well as its participation in 2006 NIST Spoken Term Detection evaluation.

The research groups consider this event very important, as it creates more confidence in speech technologies in Czech security and defense community, and hope it will have a positive impact on their future funding.

Picture of the leaders of Czech speech groups (from the left): Honza Cernocky (BUT), Jan Nouza (TUL), Ludek Muller (UWB)

The leaders of Czech speech groups (from the left): Honza Cernocky (BUT), Jan Nouza (TUL), Ludek Muller (UWB)


Loebner Prize at InterSpeech 2009

Filip Jurcicek

SLTC Newsletter, January 2009

Since 1991, the Loebner Prize contest has tried to find a computer which is able to pass the Turing Test. The test was proposed by Alan Turing in his article "Computing Machinery and Intelligence" as a proxy for his original question "Can machines think?"

In the Loebner Prize’s implementation of the Turing Test, a human judge is engaged in a conversation with one human and one chatbot, a computer which mimics humans. The only mean of communication is a keyboard and a computer screen, and both the chatbot and the human must try to convince the judge that they are human. At the end, the judge must decide who responded more human-like.

Hugh Loebner offers three awards. First, there is an annual competition for the most human-like chatbot among contestants. The winner of the annual competition receives a bronze medal and a cash award of $3,000. This year’s annual competition will be held in conjunction with InterSpeech, in Brighton, UK, on 6 September 2009. Chatbots will interact with judges for 10 minutes each.

Hugh Loebner also offers a periodic competition for a silver medal. To win the silver medal, which includes a cash prize of $25,000, a chatbot must interact with judges for 25 minutes and convince 50% of the judges that the chatbot is human.

Once technology advances to an appropriate level, a competition for a gold medal grand prize will be organized. The grand prize, which includes a cash award of $100,000, will go to a chatbot which passes the wider Turing Test involving audiovisual input.

During the last annual Loebner Prize contest on October 12, 2008 at the University of Reading, five chatbots competed in the contest. The bronze medal was won by a chatbot named Elbot, which was developed by Fred Roberts from Artificial Solutions, in Germany. Elbot convinced three of the twelve judges that it was human, whereas the other four chatbots managed to convince at least one judge.

Although these chatbots are not perfect, many companies have already successfully deployed chatbots on their web sites. For example, you can find Valerie at Virgin Holidays, Jenn at Alaska Airlines, or Spike at Gonzaga University website. If you ask Jenn about flight from Boston to LA on the next Sunday, it will take you to the partially filled web page where you can specify the rest of the details of your trip and list available flights. Spike correctly answers questions such as "What can I study at the university?" and "How much does it all cost?" These chatbots manage to improve user experience by easing access to buried information in the websites.

As the theme of InterSpeech 2009 is 'Speech and Intelligence', the contest perfectly complements the conference. After all, the contest is about designing intelligent machines that perceive and respond using natural language. "The contest draws public attention to NLP field and public benchmarking has become one of the main drivers in the field, and a live test should be one of a battery of such tests," says Prof. Roger Moore, General Chair for InterSpeech 2009.

In addition to finding which chatbot is the most human-like, some other interesting findings can be observed. For example, Dr. Philip Jackson, responsible for the Loebner Prize contest at InterSpeech, expects to learn something new about how humans differentiate between human and machine conversation.

In any case, the contest promises a lot of fun! The imperfection of the chatbots is always amusing and there is nothing more pleasant than to know that humans are unlikely to be replaced with chatbots.


Speech Mashups – Speech Processing on Demand for the Masses

Giuseppe (Pino) Di Fabbrizio

SLTC Newsletter, January 2009

AT&TTM is now granting access to its speech processing technology for use in building speech recognition prototype applications that run on iPhones, BlackBerrys, and other networked devices. This access is being offered via a new approach to speech services, the speech mashup, that merges speech and web services into one consistent application framework without the need to install, configure, or manage speech recognition software and equipment.

Traditional speech-enabled services rely on telephony platforms that combine media processing, media interaction and network signaling into a single architecture driven by high-level programming languages such as VoiceXML. While this approach reduces latencies and increases channel density, many speech services require only minimal network interaction, are able utilize simpler media processing capabilities, and can tolerate larger latencies.

Speech access to information search, for example, is a growing business in the mobile services domain where broadband wireless data access is now a viable channel to transmit speech and create rich media interactions. Thanks to the proliferation of service-oriented architecture (SOA) and public web-based interfaces, producing new web services is now easy and within reach of ordinary developers. Major web industry players are opening up their walled garden of proprietary content (GoogleTM Web API, Yahoo! ® Developer API, Flickr® Services, YELLOWPAGES.COMTM API, etc.), allowing consumers and enterprises to access technology that would otherwise be unavailable. Mashups, or web application hybrids, are rapidly becoming the most popular approach to aggregate these web services and create new ones.

AT&T speech mashup architecture.

AT&T Labs – Research extended this successful paradigm by adding speech processing capabilities and created the AT&T Speech Mashups - a new software framework that casts AT&T’s WATSON speech recognition and Natural Voices Text-to-Speech Synthesis as a web service to economically bring speech processing technologies to the larger web and mobile developer community. This new capability provides network-hosted speech technologies for multimedia devices with broadband access (iPhone, BlackBerry®, IPTV set-top box, SmartPhones, etc.) without having to install, configure, and manage speech recognition software or equipment. Speech mashups enable easy and rapid development of new speech and multimodal mobile services as well as new web-based services. The software implementation is based on well-established web programming models, such as SOA, REST, AJAX, JavaScript and JSON.

AT&T CTO, John Donovan, talking to the press at the 2008 AT&T Technology Showcase.

The concept behind the speech mashup technology is intuitive and similar to the familiar web application approach. The speech is first captured on the device (the client) through the microphone and compressed using one of the available speech coders (for example the AMR coder at 12.2 kb/s). Then an HTTP connection is established with the speech mashup portal (the server), which delivers the bit stream to the AT&T WATSON speech recognizer engine along with a set of parameters including the reference to the grammar used to recognize the utterance. The recognition results are posted back to the client and used by the client to take the next action. Depending on the complexity of the task, a semantic interpretation could be added to the results, so that natural language variation of the same intent can be interpreted properly. The speech mashup portal makes the AT&T WATSON speech engine accessible from any network as web services and exposes it through a simple HTTP API. It takes care of uploading and compiling the user’s grammars, logging the service activities, and provides tools for utterance transcription. Full documentation and code samples are provided online as well.

iPizza - multimodal pizza ordering prototype.

Speech mashup technology was publicly demonstrated during the AT&T Technology Showcase held in New York City on September 15, 2008. AT&T showed several futuristic services that envision the combination of iPhone (or iPhone-like devices) and speech recognition as main service interaction mode. Among many service concepts, the integration of U-Verse, the AT&T IP-based TV service, and the iPhone inspired several potential new multimodal prototypes. One example is iMOD (Movie On Demand), a multimodal interface that combines speech input with graphical interaction on the iPhone to enable users to rapidly find movies on demand using a mobile device. Users can speak queries like "Action movies with Bruce Willis" or "Movies directed by Woody Allen and starring Diane Keaton" and play video clips on the phone itself or start watching the movie on TV.

iMOD - multimodal movies on demand.

Another service prototype, iPizza, implements a multimodal interface for ordering pizza ordering. It combines speech input with graphical interaction on the iPhone to enable users to rapidly select menu items on a mobile device. Users can speak naturally and request multiple items at the same time. The web interface allows user an easy navigation to update the items in the shopping cart. Full ordering requests can be formulated in one sentence like: "I’d like to order a pizza with mushrooms and ham, two Diet Pepsi and baked cinnamon sticks."

Speak4it - multimodal local business search.

Finally, created in collaboration with YELLOWPAGES.COM and available at the Apple Store for iPhone customers, Speak4it (http://www.speak4it.com) demonstrates how to access local business listing with natural language queries. Examples include "Italian restaurants in Florham Park New Jersey" or, relying on the phone GPS, “Show me the nearest Bank of America offices.”

AT&T is planning to make more tools available for the speech research community, including more code examples for the iPhone and more general purpose precompiled grammars. Send an email to watsonadm [at] research.att.com to request a speech mashup account for non-commercial use.


Introducing Toastmasters, a club for improving public speaking skills

Svetlana Stoyanchev

SLTC Newsletter, January 2009

Technical presentations play an important role in a scientist’s career. For the scientists in speech and natural language processing, like many other research fields, conferences and workshops are the places that allow researchers to share their work with peers, learn about research of their colleagues, and find future collaborators. Despite the importance of presentation skills in a scientist’s career, an informal survey I conducted on students and graduates from seven different universities found that computer science graduate programs often lack formal training in presentation skills. Although graduate students do usually receive comments from peers and supervisors about the technical content of their talks, the advice about presentation itself is often limited to comments about length and amount of detail in the presentation. Students seem to rarely receive advice on how to improve their core presentation skills. Without an effective presentation, solid technical work risks going unnoticed.

It is a common misconception that being an effective speaker and ability to speak confidently in front of an audience is an innate talent. In fact, these are skills can be learned by anyone and can be of great help in anyone’s career. This article highlights one non-obvious venue for this training available to many graduate students and researchers: Toastmasters.

Toastmasters International is a nonprofit organization started in 1924 that provides public speaking training and helps its members become better speakers. There are currently 11,700 Toastmasters clubs in 92 countries, and New York City alone has 70 clubs. The clubs meet bi-monthly and provide a flexible way for people with busy schedule to improve presentation skills. Toastmasters members come from different career paths. At Toastmasters members learn a number of techniques that help them become better presenters, overcome stage fear, and deliver their speeches more effectively. The learning is done through theoretical explanation, practice, and detailed evaluation. Among others Toastmasters exercises include learning to use gestures, vocal variety, how to effectively make a point and inspire your audience. These meetings provide a friendly atmosphere and detailed evaluation for the presenters. Each Toastmasters member makes short 5 – 10 minute speeches as frequently as their schedule allows on a topic of their choice. Each speech has a focus on one of the speaking skills, like using metaphors or getting to a point effectively. After a presentation, a speaker receives a detailed verbal evaluation from other meeting participants with constructive criticism as well as positive encouragement about the speech.

Besides making prepared speeches, each Toastmasters meeting has a "Table topics" session, where one of the participants prepares a set of questions. This exercise allows one to practice giving an impromptu speech in front of an audience.

In my experience attending Toastmasters meetings for the past six months, my view is that students and researchers can gain a great benefit from Toastmasters’ informal and flexible meetings. Joining the program costs around $100 a year. Anyone can try out different meetings for free as a guest until they find the right club. I believe that Toastmasters can improve the presentation skills of researchers across a range of experience levels, from new students with limited experience to highly proficient instructors. I encourage researchers interested in honing their presentation skills to find a Toastmasters chapter near you and try it.


Spoken Document Summarization - An Overview

Annie Louis

SLTC Newsletter, January 2009

Efficient organization, retrieval and convenient browsing of multimedia content is an attractive application today. A large proportion of multimedia documents involve speech-- news broadcasts, meetings, interviews, technical presentations, movies and lectures. Summaries for such spoken documents will be an integral part of the browsing interface, facilitating search, indexing and retrieval. These summaries can either be in text format or a selection of key audio snippets from the documents. Without doubt, spoken document summarization has generated a lot of interest lately.

Automatic text summarization has made great strides, thanks to decades of research pursuits in this area. We have a fairly good understanding of the techniques and evaluation methods that work best for both single document and multi-document summarization. Methods have been developed for texts ranging from newswire articles to scientific literature, biographies to online blogs and multilingual documents. Problems like reducing redundancy and organizing summary sentences in a readable format have all been examined and improvements are constantly being hypothesized, tested and accepted. Content selection for summaries has also been designed to cater to specific queries and user need.

Speech summarization has evolved more slowly and poses a different set of challenges [3, 6]. In the case of written text, there is usually a clear organization of content into titles, sentences and paragraphs. On the other hand, speech transcripts are significantly different from written texts in structure. Speech disfluencies and errors in ASR propagate into transcripts. There is no paragraph information and segmentation into utterances is not trivial. Moreover, as one moves from monologue to dialogue, there needs to be a significant change in the approach to summarization.

Consider meetings for example. A summary must be similar to the minutes of the meeting; give the proposals, plans, agreements/ disagreements and decisions reached during its course. The identification of these items of content in conversational speech needs special techniques based on speech acts, identity of the speaker, turn taking, acoustic and prosodic characteristics of the utterances [4, 5]. Apart from content selection, we must devise evaluation measures and also tackle the problem of producing coherent summaries. In text summarization, a common approach is to select important sentences from the input and present them with some ordering. Utterances chosen from a dialogue can be more difficult to combine into a coherent audio summary. The utterances may come from different participants and vary in acoustic properties.

A similar situation arises with other aspects of spoken document processing like named entity recognition, information extraction, segmentation and topic analysis [2]. To make faster progress with multimedia processing, valuable technology transfer from the text domain needs to be augmented with speech techniques. Some techniques used for text summarization have been adapted to work with speech transcripts [8, 9] and evaluation metrics from text domain also show promise for evaluating speech summaries [7]. To this end, it is necessary to actively seek opportunities for interaction, knowledge and technology transfer between text and speech communities.

The Document Understanding Conferences (DUC) have been conducted by NIST yearly since 2001. These started out as large scale text summarizer evaluation workshops but starting 2008, have been extended with Textual Entailment and Question Answering tracks as well. DUC is now a Text Analysis Conference (TAC) with a wider audience. At the planning session in TAC 2008, a proposal (from the International Computer Science Institute, Berkeley) to include a meeting summarization task was positively received by many participants as a new and exciting challenge. If accepted as a track in future TAC workshops, we can expect faster developments with resource building, an opportunity to compare methods on common test sets, standardization of evaluation techniques and a sense of larger community.

  1. I. Mani and M. T. Maybury, editors, "Advances in Automatic Text Summarization", MIT press, 1999.
  2. L. Lee and B. Chen, "Spoken Document Understanding and Organization", IEEE Signal Processing Magazine, September 2005.
  3. K. Zechner, "Summarization of Spoken Language - Challenges, Methods, and Prospects", Speech Technology Expert eZine, Issue 6, January 2002.
  4. D. Hillard, M. Ostendorf and E. Shriberg, "Detection of agreement vs disagreement in meetings: training with unlabelled data", Proceedings of HLT/ NAACL 2003.
  5. E. Shriberg, A. Stolcke, D. Hakkani-Tur and G. Tur, "Prosody-Based Automatic Segmentation of Speech into Sentences and Topics", Speech Communication, September 2000.
  6. K. McKeown, J. Hirschberg, M. Galley, and S. Maskey, "From text summarization to speech summarization", Proceedings of ICASSP, Special Session on Human Language Technology Applications and Challenges for Speech Processing, 2005.
  7. A. Nenkova, "Summarization evaluation for text and speech: issues and approaches", INTERSPEECH – 2006.
  8. A. Waibel, M. Bett and M. Finke, "Meeting browser: tracking and summarizing meetings", Proceedings of the DARPA Broadcast News Workshop, 1998.
  9. T. Kikuchi, S. Furui and C. Hori, "Two-stage automatic speech summarization by sentence extraction and compaction", Proceedings of the IEEE/ ISCA Workshop on Spontaneous Speech Processing and Recognition, 2003.

2008 - A Busy Year for the ISCA Student Advisory Committee

By: Ebru Arisoy, Marco Piccolino-Boniforti, Tiago Falk, Antonio Roque, and Sylvie Saget

SLTC Newsletter, January 2009

The ISCA Student Advisory Committee (ISCA-SAC) was established in 2005 by the International Speech Communication Association (ISCA) as part of its efforts to expand its services to all student members in a student-driven effort. ISCA-SAC's goals are to organize and coordinate student-driven projects that will have an impact on the speech and language processing community. This article describes some the major projects carried out in 2008; a description of projects completed before that can be found in a previous article.

Project Descriptions:

Round Table Research Discussion and Lunch Event: Student volunteers organized the Round Table Research Discussion and Lunch Event for students attending the Interspeech 2008 conference in Brisbane, Australia. The event was a success and gave students the opportunity to discuss their research topics, future research trends, and job opportunities with senior researchers from industry and/or academia. Discussion topics ranged from speech coding, enhancement, and recognition to spoken language understanding and assistive speech technologies.

Online Grant Application System (OGAS): OGAS was developed by student volunteers to facilitate the grant application process for both the applicant and the ISCA grants coordinator. Previously, the grant application procedure consisted of an extensive application package that would be submitted to the grant coordinator via electronic mail. Currently, OGAS allows grant requests to be submitted via a simple online electronic form and the upload of a few supplementary documents. OGAS was first tested during Interspeech 2008 and successfully handled over 80 grant applications. Student volunteers are now actively working to continue to improve OGAS for both the users and the ISCA grant coordinator.

Transcription of Saras Institute Interviews: In conjunction with the IEEE-SLTC e-newsletter and the History of Speech and Language Technology Project, student volunteers have worked hard to provide transcriptions of interviews from various researchers who have made seminal contributions to the development of speech and language technology. Transcriptions were made from comments by researchers such as Frederick Jelinek, Steve Young, Sadaoki Furui, and others, describing how they became involved in the area of speech technology: those are available in 1 2 3 4 articles in the e-newsletter. ISCA-SAC plans to continue and extend these efforts in 2009.

Website Development: Several enhancements were also incorporated to the ISCA-SAC website. For example, spam subscriptions have been eliminated by means by means of a system that asks users to perform character recognition before registering to the website. Also, a voting system has been introduced which allows users to rate the resources available on the website. And a targeted survey was conducted in order to assess site usability and usefulness, and changes are in the process of being implemented. Website redesign and enhanced user interaction remain a major goal for the near future.

Planned Projects for 2009:

Several projects have been planned (or are in planning phase) for 2009. For example, the ISCA-SAC logo competition was launched during Interspeech 2008. Also, we are planning an outreach program for Interspeech 2009, to increase awareness of speech and language research/studies for both high-school and undergraduate students.

If you have interesting ideas on how to further improve our website, or if you think of other student-oriented services/projects that could be implemented by ISCA-SAC, please do not hesitate to contact us. Moreover, ISCA-SAC is always looking for motivated volunteers; if you are interested, please email volunteer[at]isca-students[dot]org for more details.