Speech and Language Processing Technical Committee Newsletter

February 2010

Welcome to the Winter 2010 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter.

In this issue we are pleased to provide another installment of brief articles representing a diversity of views and backgrounds. This issue includes articles from 15 guest contributors, and our own 6 staff reporters. Also, in this issue, we welcome Tara N. Sainath, at IBM T.J. Watson Research Center, who joins us as a staff reporter.

We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can reach us at speechnewseds [at] listserv (dot) ieee [dot] org.

Finally, to subscribe the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Jason Williams, Editor-in-chief
Pino Di Fabbrizio, Editor
Chuck Wooters, Editor

From the SLTC and IEEE

From the SLTC Chair: New Years Message

Steve Young

News from our TC: election of new members and a vice-chair, and ICASSP review results.

2010 Signal Processing Awards

Philip Loizou

Two researchers from the SLTC community have won IEEE awards: Shrikanth (Shri) Narayanan for Distinguished Lecturer (2010-2011), and Isabel M. Trancoso for Meritorious Service Award.

IEEE SPS Board of Governors Nominations Open

Jason Williams

IEEE SPS Nominations Open for Two Director Positions

Jason Williams

IEEE Signal Processing Society Newsletter

The IEEE Signal Processing Society, our parent organization, also produces a monthly newsletter, "Inside Signal Processing".

Articles

Cloudy with a chance of ... speech

Danilo Giulianelli and Giuseppe (Pino) Di Fabbrizio

This article provides a quick introduction about speech processing into the cloud to support the scalability challenges of speech-enabled applications running on mobile devices.

A Brief Overview of Meeting Understanding Tasks

Dilek Hakkani-Tür

This article gives an overview of the main tasks of meeting understanding: Meeting summarization, Action item detection, and Decision detection.

Linguists targeted for Redundancy at King's College London

Ruth Kempson

As part of a restructuring program, King's College London may eliminate several positions held by well-known researchers..

Setting up the Interaction Lab

Oliver Lemon

Setting up a new research group presents special challenges and opportunities. This article describes the work of the Interaction Lab, and some of the main issues we considered when setting up this new research facility.

Text Analysis Conference 2009: An Overview

Annie Louis

The 2009 Text Analysis Conference (TAC) was held at NIST in Gaithersburg, Maryland as a two day event from 16-17 November, 2009. TAC covers a variety of shared tasks organized every year. The participant systems are evaluated at NIST and the workshop brings together researchers and participants to share their experiences and ideas. This article provides an overview of this year's tasks.

Researchers developing way to synthesize speech with a brain-machine interface

Matthew Marge

Advances in brain-machine interfaces are now allowing a paraplegic patient to control a real-time speech synthesizer using a BMI that reads the neural signals that relate to the production of speech.

ISCA Student Activities Update

Samer Al Moubayed, Marco A. Piccolino-Boniforti, Antonio Roque, Sylvie Saget, Sunayana Sitaram, Marcel Waeltermann

The ISCA Student Advisory Committee (ISCA-SAC) was established in 2005 by the International Speech Communication Association (ISCA) to organize and coordinate student-driven projects. We were especially active in 2007 and 2008, we were pleased to continue our efforts in 2009, and we look forward to 2010.

Memories from ASRU 2009, Merano, Italy

Giuseppe Riccardi and Renato de Mori

The IEEE Workshop on Automatic Speech Recognition and Understanding, held in December 2009 in Merano Italy, published 96 papers and attracted 180 international attendees.

Island-Driven Search

Tara N. Sainath

Island-driven search is an alternative method to better deal with noisy and unintelligible speech. In this article, we discuss past research in island-driven search and propose some future research directions.

eNTERFACE Workshops on Multimodal Interfaces: A success story

Albert Salah and Thierry Dutoit

The eNTERFACE workshops are the largest activity of the SIMILAR European Network of Excellence on multimodal interfaces. Their aim is to gather a team of leading professionals in multimodal man-machine interfaces together with students in one place, to work on a pre-specified list of challenges, for 4 complete weeks.

Second Life as a research platform

Svetlana Stoyanchev

Second Life (SL) is a virtual social environment with over 11 million registered users. This article describes some of the projects that use Second Life environment for research.

New years message

Steve Young

SLTC Newsletter, February 2010

Since this is the first newsletter of 2010, I would like to start by wishing everyone in our speech and language community a happy and a prosperous New Year.

As always at this time of year, it is my pleasure to welcome new members to our Technical Committee and to thank those at the end of their term. The new members coming in are: Tom Quatieri, Bastiaan Kleijn, Bhuvana Ramabhadran, Carol Espy-Wilson, Dirk Van Compernolle, Douglas Reynolds, Hong-Kwang Jeff Kuo, Les Atlas, Olivier Pietquin, Simon King, David Suendermann, Kate Knill, Olov Engwall, Martin Russell, Junlan Feng, Seiichi Nakagawa, and Takayuki Arai. And those leaving are: Alistair D. Conkie, Yuqing Gao, Jasha Droppo, Murat Saraclar, Philipos Loizou, Stephen Cox, Tomoki Toda, Douglas O'Shaughnessy, Andrej Ljolje, Kay Berkling, Kuldip K. Paliwal, Roberto Pieraccini, and Thomas Hain. I would like to thank all of the outgoing TC members for the support that they have provided the Speech and Language community over the last 3 years.

The end of 2009 also marked the end of Alex Acero's very successful tenure as VP Technical Directions. One of the last tasks that Alex set himself was to reform the SPS bylaws such that all Technical Committees follow a uniform set of policies and procedures. Thankfully, our TC's existing bylaws are very similar to the new rules so we will not need to make any significant changes. One exception, however, is that the new rules require TC's to have a vice-Chair who is elected for a 2 year term and then automatically takes over as Chair for a second 2 year term. Anticipating this change, the Speech and Language TC has elected John Hansen as its vice-Chair. As a transitional arrangement, John will serve just 1 year as vice-Chair before taking over from me in January 2011. We will therefore need to elect another vice-Chair at the end of this year to work alongside John.

As well as being vice-Chair of our TC, John is also Technical Program Chair of ICASSP 2010 to be held in Dallas in March. In my last newsletter article, I described the imminent ICASSP 2010 review process. In the event, we had a total of 620 papers submitted. 510 were in speech and of these we accepted 249; the remaining 110 submissions were in the language area and we accepted 59 of these. Hence, the overall acceptance rate in the Speech and Language area was 50%. Of the 620 papers submitted, 600 received four reviews and the remainder had 3 reviews. All papers received a meta-review from a TC member to minimise the effects of reviewer inconsistency or disagreement. This puts quite a load on TC members who have to respond very quickly and I would like to take this opportunity to thank them for their help, especially those who were able to conduct last-minute reviews on very short notice! Whilst no reviewing system can be perfect, I believe that our TC has done as much as could reasonably be done to ensure that every paper submitted was assessed fairly. The overall process was managed by the four area chairs: Pascale Fung, TJ Hazen, Thomas Hain and Brian Kingsbury. They did a tremendous job and we owe them a great vote of thanks.

So best wishes again for the New Year and see you all at ICASSP 2010!

Steve Young

Steve Young is Chair, Speech and Language Technical Committee.

2010 Signal Processing Awards

Philip Loizou

SLTC Newsletter, February 2010

Each year the Speech and Language Processing Technical Committee (SLTC) submits nominations to the IEEE Awards Board for Distinguished Lecture (DL) positions, numerous society awards (e.g., technical achievement, meritorious service, etc) and best-paper awards. Nominations for the DL were solicited by the Awards SLTC Subcommittee in June 2009 and nominations for the IEEE awards were solicited in August 2009. After receiving all nominations, a poll was administered from the Survey Monkey web site. The SLTC members were asked to rank each candidate for every award. Voters with apparent conflict of interest were asked to abstain and select the “abstain” button on the web site. Over 95% of SLTC members had cast their votes. The names of the top candidates for each award were submitted to the IEEE Awards Board, who in turn selected the following:

Shrikanth (Shri) Narayanan for Distinguished Lecturer (2010-2011) – Dr. Narayanan’ is currently the Andrew J. Viterbi Professor of Engineering at the University of Southern California. Prof. Narayanan’s contributions are in the broad area of human-centric multimodal signal processing and applications spanning speech, audio, image and bio signal processing. His work uniquely bridges science and engineering through innovative interdisciplinary approaches and his contributions are at the forefront of the field integrating fundamental and applied problems in human communication. The DLs are expected to deliver “a message of importance to the technical community”, and the suggested lecture topics put forward in the nomination included: (1) Human-like speech and audio Processing, (2) Expressive Human Communication: Automatic Multimodal Recognition and Synthesis of Emotions, (3) Speech-to-Speech Translation: Advances, Open Challenges and Opportunities, and (4) Speaking and Singing Production: Data, Models and Applications.

Isabel M. Trancoso for Meritorious Service Award (shared with Arye Nehorai) – Dr. Trancoso is currently a Professor at Instituto Superior Técnico (IST) in Lisbon, Portugal. This award honors contributions to the Signal Processing Society. Some of Dr. Trancoso’s contributions include her long and extensive service in several European associations affiliated with speech and language processing (e.g., ESCA) as well as her heavy involvement in numerous IEEE-sponsored conferences/workshops and editorships. Among others, she served as Associate Editor of Speech Communication (1993-2002) and she launched the electronic journal European Student Journal of Language and Speech, with the support of ELSNET and the associations ESCA and EACL. She served as Editor-in-Chief of that journal during the first 3 years. She served as Editor-in-Chief of the IEEE Transactions on Audio and Speech Processing (2003-2005) and Editor of the IEEE Signal Processing Magazine (2006-2008). She was elected to the ESCA (European Speech Communication Association) Board in 1993, and was re-elected in 1997 for her second 4-year term. She was elected member-at-large of the Board of Governors of the Signal Processing Society (2006-2008) and lastly served as a member of the SLTC. In summary, her contributions to the Signal Processing Society and the field of spoken language recognition are immense spanning two continents.

Let us extend our congratulations to Prof. Narayanan and Dr. Trancoso for their accomplishments and contributions to the Signal Processing Society.

Philip Loizou was the Chair of the SLTC Awards Subcommittee which also included Frank Soong and Hideki Kawahara.

IEEE SPS Board of Governors Nominations Open

Jason Williams

SLTC Newsletter, February 2010

The following message has been reproduced from an IEEE Signal Processing Society announcement from José M.F. Moura, SPS Nominations and Appointments Committee.

In accordance with the Bylaws of the IEEE Signal Processing Society, the membership will elect, by direct ballot, THREE Members-at-Large to the Board of Governors (BoG) for three-year terms commencing 1 January 2011 and ending 31 December 2013, as well as ONE Regional Director-at-Large for each of the corresponding regions: Regions 1-6 (US), Regions 7 and 9 (Canada and Latin America), Region 8 (Europe/Middle East/Africa) and Region 10 (Asia/Pacific Rim) for two-year terms commencing 1 January 2011 and ending 31 December 2012. Nominations are requested no later than 19 MARCH 2010.

Regional Directors-at-Large are elected locally by members of the corresponding region. They serve as non-voting members of the Board of Governors and voting members of the Membership Board. They promote and foster local activities (such as conferences, meetings, social networking) and encourage new chapter development; represent their regions to the core of SPS; offer advices to improve membership relations, recruiting and service to their regions; guide and work with their corresponding chapters to serve their members; and assist the Vice President-Awards in conducting chapter reviews.

Board of Governors Members-at-Large are directly elected by the Society's membership to represent the member viewpoint in Board decision-making. They typically review, discuss, and act upon a wide range of items affecting the actions, activities and health of the Society.

Please do not hesitate to nominate yourself, particularly if you have not yet served on the Board of Governors.

The formal procedures established for this year’s election are:

Publication of a call for nominations for the positions of BoG Members-at-Large and Regional Directors-at-Large. Nominees must hold SPS Member grade to hold elective office (JANUARY).
From the responses received, a list of candidates will be assembled for each election by the Past President, who chairs the Nominations and Appointments Committee, for presentation to the Nominations and Appointments Committee (APRIL).
The Nominations and Appointments Committee reviews the candidates and conducts a ballot to create a short list of at least SIX candidates (by bylaw, at least two candidates must be submitted for each BoG Member-at-Large position becoming vacant) (APRIL-MAY). Currently, there is no minimum number of candidates required for the Regional Directors-at-Large race. Nevertheless, the Society has a stated preference for contested elections, so more than one nomination per race is desirable.
After the N&A Committee ranking ballot, the top candidates who are willing and able to serve for Director-at-Large and Member-at-Large are advanced for ballot to the SPS's voting members (JULY).
Collection and tabulation of returned ballots will be handled by IEEE Technical Activities Society Services Department in a manner that the results of the election can be confirmed by the Board of Governors at its Fall meeting (JULY-SEPTEMBER).
The three candidates receiving the highest number of votes who confirm their ability to serve will be declared elected Members-at-Large to the Board of Governors with three-year terms commencing 1 January 2011 and one candidate from each specified region receiving the highest number of votes who confirm their ability to serve will be declared elected a Regional Director-at-Large with a two-year terms commencing 1 January 2011 (SEPTEMBER).

I look forward to receiving your nominations for Regional Directors-at-Large and Members-at-Large via e-mail no later than 19 MARCH 2010. Please send them to Theresa Argiropoulos (t.argiropoulos@ieee.org). However, if you first wish to communicate with me privately before making a nomination, you can reach me at moura@ece.cmu.edu.

In making your nomination, please provide the following information:

   Name of Candidate:
   Candidate Contact Information: (address, phone, fax, e-mail)
   Brief Background of the Individual: (no more than 100 words, please)
   Information about Current SP, IEEE, or Other Society
   Volunteer Activities.

Thank you for participating in the Society’s nominations process. Your nominations are requested no later than 19 MARCH 2010.

Jason Williams is Principal Member of Technical Staff at AT&T Labs Research. His interests are dialog systems and planning under uncertainty. Email: jdw@research.att.com

IEEE SPS Nominations Open for Two Director Positions

Jason Williams

SLTC Newsletter, February 2010

The following message has been reproduced from an IEEE Signal Processing Society announcement from Michael D. Zoltowski, SPS Vice President-Awards and Memberships.

Michael D. Zoltowski, SPS Vice President-Awards and Memberships, invites nominations for the positions of Director-Industrial Relations and Director-Membership Services. The term for each Director position will be three years (1 January 2011-31 December 2013).

The Director-Membership Services is responsible for identifying value/benefit/services to members, as well as recommending policy/mechanism for recruiting members. The Director is a non-voting, ex-officio member of both the Conference and Publications Boards. The Director-Membership Services also chairs the Membership Services Committee.

The Director-Industrial Relations is responsible for identifying value/benefit/services to attract/retain industrial members. The Director is a non-voting, ex-officio member of both the Conference and Publications Boards. The Director-Industrial also chairs the Industrial Relations Committee.

Nominations should be submitted to Michael D. Zoltowski, SPS Vice President-Awards and Memberships, via mikedz@ecn.purdue.edu. When making your nomination, please provide the following information:

   Name of candidate
   Contact information
   Brief background
   Information about current SP, IEEE, or Other Society,
   Other volunteer activities.

Nominations must be received no later than 15 February 2010.

Jason Williams is Principal Member of Technical Staff at AT&T Labs Research. His interests are dialog systems and planning under uncertainty. Email: jdw@research.att.com

Cloudy with a chance of ... speech

Danilo Giulianelli and Giuseppe (Pino) Di Fabbrizio

SLTC Newsletter, February 2010

This article provides a quick introduction about speech processing into the cloud to support the scalability challenges of speech-enabled applications running on mobile devices.

Cloud computing for the enterprise

Last October 2009 in New York City, Amazon Web Services (AWS), a division of Amazon.com, hosted an event about "AWS Cloud for the Enterprise". The venue was opened to technology and business 'stakeholders' interested in learning the latest bleeding-edge technologies in cloud computing. The auditorium was crowded with more than 200 attendees representing a broad range of enterprises from finance, entertainment, telecommunication, insurances, pharmaceutical, and more. An unusual mix of industries congregated in one place to listen to the latest news about outsourcing computational and networking resources into the cloud. At the opening remarks, Dr. Werner Vogels, chief technology officer and vice president of Amazon.com, defined cloud computing as "a style of computing where massively scalable IT-related capabilities are provided 'as-a-service' across the Internet to multiple external customers." Right after the introduction, happy customers such as Netflix, Wired Magazine, Nasdaq QMX, Sony Music, and New York Life insurance, shared their experience with the attentive audience. Then, the stage passed to a series of technical presentations pointing out security tricks, architectural configurations, and step-by-step procedures to get you started in putting enterprise applications into the AWS cloud.

What was motivating the devoted participants is a concrete and affordable new way to dispose immediately of highly available and powerful hardware on-demand, without the burden of hosting and managing the entire infrastructure needed to run your own data center. Instead of physical servers, computing resources are 'dispersed' in the internet cloud where IT-managers don't need to know where the machines are physically located, and don't have to worry about hardware maintenance. However computation resource virtualization is not new to IT companies. Before cloud computing other similar approaches such as distributed computing, software as a service (SaaS), and service-oriented architectures (SOA) exploited the concept. But only recently have companies like Amazon created the infrastructure to improve the way hardware procurement was handled and streamlined the process to bring new a service online.

The successes of cloud computing are also due to the introduction of sophisticated management and monitoring capabilities (e.g., Amazon EC2 auto-scaling), where the platform is capable of monitoring system resources, and automatically scaling those resources based on parameters such as CPU and memory utilization, or network traffic. The cloud infrastructure as a service (IaaS) enforces also a complete user separation by providing both CPU and network virtualization, i.e., not only are the processes from a given user separated from other users' processes, but also network traffic is kept separated and each running virtual image only receives and processes traffic destined for it.

Besides the Amazon EC2 product there are other examples of commercial cloud computing infrastructures including GoGrid, SunCloud, 3tera, Eucalyptus, and open source solutions, such as Nimbus and Xen Hypervisor. There are also cloud platforms available, delivering a computing platform with a complete solution stack (e.g., running native cloud applications) to customers. Cloud platforms include for example Heroku (Ruby on Rail), Google App Engine (Java), Microsoft Azure (.Net), Rackspace Cloud (PHP), Salesforce (Java/C# like), and, on the open source side, GridGain (Java).
One of the biggest advantages of cloud computing is that users can convert capital expenditures into operating expenses since there is no hardware or software to buy, and they only pay for whatever resources are being used. The cloud computing model in fact promotes acquiring resources on demand, releasing them when no longer needed, and paying for them based on usage. From the user perspective though, it's like the cloud has infinite capacity. Scalability is then achieved by providing APIs that allow starting and stopping a new server instance in a matter of minutes instead of weeks, as typically required when using a traditional in-house procurement process. Beside scalability, other important properties addressed by cloud computing are cost-effectiveness, reliability, high availability, and security.

Speech in the cloud

Cloud computing seems ideal for today's speech processing computational needs. To date, there are two major areas where this approach has been tested: in traditional speech processing applications such as IVR (Interactive Voice Response) systems and in more recently exploited web-based speech processing services or speech mashups [1][2].

IVR services typically require dedicated hardware and closed software in order to support speech in real-time, and are usually very CPU intensive. With CPU virtualization, adding and removing new processing resources can be driven by channel utilization rather than pre-allocated (and often not utilized) servers based on traffic analysis. An example of cloud-based communication approach that includes also IVR-like functionalities is Tropo.com where typical telecommunication functionalities are exposed on demand though common APIs.

The second class of services relies on network-enabled devices such as Blackberry, iPhone and Adroid-based mobiles, capable of accessing the Internet at broadband (e.g., 3G, WiFi) speeds and by capturing/playing the user's speech directly over the data channel. In this case, speech processing resources can be seen as network endpoints with a web-based interface that can be easily replicated in the cloud model by creating new cloud instances. Mobile services with this model are usually created by aggregating different web services with the speech processing component (e.g., speech mashups).
In general this class of services also applies to any device supporting HTTP, such as desktops, laptops, netbooks, and PDAs.

Considering that Apple recently announced 3 billion worldwide Apple Store downloads and that the number of applications available for the iPhone alone passed the 140,000 mark, it is fair to expect that more and more of these applications will eventually be voice enabled. The uncertainty of the success of such applications and the unpredictability of calculating the traffic hitting the speech processing engine, makes this scenario an ideal candidate for putting the speech engine into the cloud, thus enabling the service provider to easily scale the processing power up or down according to traffic demand, and without the need to buy and manage dedicated hardware in advance.

Figure 1: AT&T Speech Mashup Cloud Architecture

To validate and measure the performance of this model, the AT&T speech mashup prototyping platform has been ported to the Amazon cloud (Figure 1). In this illustrative architecture, the Speech Mashup Manager runs on its own virtual instance and takes care of the user accounts and resource management databases. It also forwards all the speech mashup client requests to the appropriate speech service (either WATSON ASR or Natural Voices TTS). The ASR and TTS engines are bundled together on the same machine image, and at runtime the instance can be automatically configured to run either one or both services. To support auto-scaling, an elastic load balancer sits in front of the ASR/TTS instances, and it takes care of balancing the load by sending requests to the instances in a round-robin fashion.
In a minimal configuration the system requires one instance for the SMM manager, one load balancer, and one instance for the ASR/TTS engines. The system is currently configured so that if the CPU utilization on the ASR/TTS engine instance reaches 80 percent, the Amazon auto-scaler will start another ASR instance and automatically add it to the load balancer pool.

The database containing the user accounts and application grammars is stored instead on an Elastic Block Storage volume mounted on the SMM instance. In order to make the users and built-in grammars available to the instances running the WATSON ASR engine, a virtual tunnel over TCP/IP is setup between the SMM and the ASR/TTS instances. Then the file system containing the grammars is NFS mounted over the tunnel to allow the WATSON ASR engines to view and receive updated grammars in real-time.

Figure 2: ASR CPU utilization during traffic load simulation

Testing the Speech Mashups in the Cloud involved using a traffic simulator generating a random traffic pattern based on a Poisson process [3]. The ASR engine was stressed at different traffic loads and grammar sizes. In all test condition, the auto-scale trigger was activated when the CPU load reached the 80% of the CPU usage for more than ten minutes, and a new instance was created in few minutes to absorb the extra load (Figure 2). Conversely, the extra instance was removed from the ASR pool when the load normalized below the 40% threshold.

Reliability and security

Compared to traditional carrier-grade telephony services where the uptime is typically a solid 99.999% (or 30 sec downtime per year), cloud computing is accountable for less 'nines' with an uptime in the order of 99.95% (see, for example, Amazon EC2 SLA). In fact, service instances running in a cloud infrastructure can sometimes crash or slow down in case the neighboring instances (e.g., processes running on the same physical server) are heavy CPU or network users. In heavy load conditions, load balancers can stop routing traffic, and network latencies can suddenly increase. A solution to this problem is architecting services for redundancy and handling failures gracefully. For example, running multiple instances in different geographical locations (i.e., deploying services in physically distinct data centers), and taking frequent snapshots of the service data might help to mitigate downtime in case of failures. However, it terms of user experience, failures usually translates in a sluggish response time that can be alleviated with visual feedback in the mobile screen.

Another frequently asked question surrounding cloud computing is about security. How secure is the cloud environment? The answer is as secure as the service designer is willing to make it, which translates into spending more time architecting application security. This typically may involve several techniques: 1) enforcing strict firewall policies; 2) opening only the minimum set of communication ports necessary to provide services; 3) use secure shell to connect to instances; 4) disable privileged logins; 5) encrypt data for extra protection, etc. This process is also known as system hardening and can be consolidated by using tools such as Bastille for Linux.

Conclusion

Speech processing in the cloud is a viable solution for applications running on mobile devices, desktops, laptops and IVR systems. The infrastructure is highly available, economical, scalable, and reliable. Moreover the cloud model ultimately translates capital expenses into operational costs, allowing service providers to focus on delivering new technologies and applications, thus reducing time to market. As an example, the AT&T Speech Mashup prototyping framework [1] was successfully ported to the Amazon cloud and the current speech mashup applications were migrated to the cloud framework.

Finally, it's easy to predict how the benefits of cloud computing will soon obsolete traditional IT solutions and will captivate in the near future more and more speech-enabled application providers.

References

[1] Giuseppe Di Fabbrizio, Jay G. Wilpon, Thomas Okken, A Speech Mashup Framework for Multimodal Mobile Services, The Eleventh International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), Cambridge, MA, USA, November 2-6, 2009.

[2] A. Gruenstein, I. McGraw, I. Badr, The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces, ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces, ACM, 2008, 141-148

[3] James F. Brady, Load Testing Virtualized Servers Issues and Guidelines

[4] George Reese, Cloud Application Architectures - Building Applications and Infrastructure in the Cloud, O'Reilly, 2009.

If you have comments, corrections, or additions to this article, please contact the authors: Giuseppe (Pino) Di Fabbrizio, pino [at] research [dot] att [dot] com, Danilo Giulianelli, danilo [at] research [dot] att [dot] com.

Danilo Giulianelli is Principal Member of Technical Staff at AT&T Labs Research.

Giuseppe (Pino) Di Fabbrizio is Lead Member of Technical Staff at AT&T Labs Research.

A Brief Overview of Meeting Understanding Tasks

Dilek Hakkani-Tür

SLTC Newsletter, January 2010

Meetings provide an efficient way of interaction, and create unique knowledge sharing opportunities between people with different areas of expertise. Every day, especially within organizations, people meet for various reasons, such as discussing issues, task assignments, and planning. Although meetings are so common, there is still no globally adopted automated or semi-automated mechanism for tracking meeting discussions, following up with task assignments, saving the meeting content for later use by participants or non-participants, or checking the efficiency of meetings.

Recently, the availability of meetings corpora such as the ICSI [1] and AMI [2] corpora, and shared task evaluations such as the ones performed by NIST, have facilitated research on automatic processing of meeting speech. The ICSI corpus [1] consists of 75 natural meetings with a variable number of participants, that took place at the ICSI. The AMI corpus [2] is a multi-modal, scenario-driven data collection from meetings of 4 participants each. In general, such human-human communication in stored audio form has rapidly grown in providing ample source material for later use. In particular, the prominence of the voice search as a basic user activity has increased significantly along with the advances in speech recognition and retrieval.

However, there are still open questions: What information from these meetings would be useful for later use? Does this information depend on user/purpose? How could the transcripts of these conversations be used? Our goal in this article is not to answer these questions, but provide a brief (and far from complete) overview of some of the research done in the last decade to process meetings for providing access to their contents:

Meeting summarization aims to generate a compact, summary version of meeting discussions. These summaries can be formed by extracting original speaker utterances (extractive summarization) or by formulating new sentences for the summary (abstractive summarization). Inspired from text summarization approaches, previous work on meeting summarization mainly focused on the first approach [3,4,5]. Other work also considered interactive meeting summarization, where feedback from users (in the form of weighted keywords entered through a graphical user interface) is incorporated into summarization [6], or participants' notes are used to learn to detect "noteworthy" utterances over the sequence of meetings of the same group of participants [7].
Action item detection aims to detect task assignments to people and associated deadlines, during a meeting. These can be used to enter such information into the relevant person's calendar, or to track status and progress in the following meetings. This task can be viewed as classifying meeting utterances into one or more categories. For example, 4 types of action item related utterances are identified in [8]: task description, responsible party identification, deadline assignment, and agreement/disagreement of the responsible party.
Decision detection aims to detect decision-making sub-dialogs in multi-party dialogs. The decisions made in meetings can be used for indexing meetings, and one can go back and access the contents of the meeting around the time that a specific decision was made. They can also be used to track progress and efficiency of meetings. In related work, [9] attempted to model the structure of decision making-subdialogs, and identified the role of utterances in the decision making process, such as utterances which initiate a discussion by raising an issue, utterances which propose a resolution for the raised issue, and utterances which express agreement for a proposed resolution.

All of these tasks, along with lesser studied tasks such as argument diagramming [10] and detection of sentiments [13], agreement and disagreements [11,12] would greatly assist the ability to automatically browse, summarize, and graphically visualize various aspects of the spoken content of the meetings.

References:

E. Shriberg, R. Hulling, S. Bag, J. Ang, H. Carvey. The ICSI meeting recorder dialog act (MRDA) corpus. In SIGdial Workshop on Discourse and Dialogue, 2004.
J. Carletta, S. Ashby, S. Bourban, M. Guillemot, M. Kronen-thal, G. Lathoud, M. Lincoln, I. McCowan, T. Hain, W. Kraaij, W. Post, J. Kadlec, P. Wellner, M. Flynn, and D. Reidsma. The AMI meeting corpus. In Proceedings of MLMI.05, Edinburgh, 2005.
G. Murray, S. Renals, and J. Carletta. Extractive summarization of meeting recordings. In Proceedings of Interspeech. September, 2005.
S. Xie and Y. Liu. Using Corpus and Knowledge-Based Similarity Measure in Maximum Marginal Relevance for Meeting Summarization. In Proceedings of the ICASSP. Las Vegas, NV, 2008.
K. Riedhammer, D. Gillick, B. Favre, and D. Hakkani-Tur. Packing the Meeting Summarization Knapsack. In Proceedings of Interspeech. Brisbane, Australia, 2008.
K. Riedhammer, B. Favre, and D. Hakkani-Tur. A Keyphrase-Based Approach to Interactive Meeting Summarization. In Proceedings of IEEE Workshop on Spoken Languge Technologies (SLT). Goa (India), 2008.
S. Banerjee and A. Rudnicky. Detecting the Noteworthiness of Utterances in Human Meetings. In the Proceedings of the 2009 SIGDial Conference on Dialog and Discourse Research. London, UK. 2009.
M. Purver, P. Ehlen, and J. Niekrasz. Detecting action items in multi-party meetings: Annotation and initial experiments. In Proceedings of Machine Learning for Multi-modal Interaction. Springer-Verlag. 2006.
R. Fernandez, M. Frampton, P. Ehlen, M. Purver, and S. Peters. Modelling and detecting decisions in multi-party dialogue. In Proc. of the 9th SIGdial Workshop on Discourse and Dialogue. 2008.
R. Rienks, D. Heylen, and E. van der Weijden. 2005. Argument diagramming of meeting conversations. InVinciarelli A. and Odobez J., editors, Multimodal/Multiparty Meeting Processing, Workshop at the 7th Intl. Conference on Multimodal Interfaces.
D. Hillard, M. Ostendorf, and E Shriberg. Detection of agreement vs. disagreement in meetings: training with unlabeled data. In Proc. of HLT/NAACL. 2003.
M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. Identifying agreement and disagreement in conversational speech: Use of bayesian networks to model pragmatic dependencies. In Proc. of ACL, pages: 669-676. 2004.
T. Wilson. Annotating subjective content in meetings. In Proc. of LREC 2008, Marrackech, Morocco, 2008.

Dilek Hakkani-Tür is a Senior Research Scientist at the International Computer Science Institute (ICSI), with research interests in natural language and speech processing, spoken dialog processing, active and unsupervised learning for language processing. Email: dilek@icsi.berkeley.edu

Linguists Targeted for Redundancy at King's College London

Ruth Kempson

SLTC Newsletter, February 2010

The management of King's College London has embarked on a program of radical restructuring in both its Science and Humanities Faculties. This program involves redefining research missions of Departments in these schools under thematic areas that, in Humanities at least, downplay many major areas of research and teaching, and explicitly exclude linguistics in any department.

As part of this process, Jonathan Ginzburg has been targeted for redundancy on the grounds that he no longer fits the redefined group for applied logic and the theory of computing in the Computer Science: with testimony from leading international researchers that his work on the logical analysis of interaction falls directly within the new group description, this matter is now under appeal.

Professor Shalom Lappin and Dr. Wilfried Meyer-Viol of the Philosophy Department have been summarily told that the College is "disinvesting" from computational linguistics, and that the plan for restructuring the School for Arts and Humanities will mean the elimination of their positions in Philosophy by August 31. The move is an attempt to use computational linguistics, a non-existent entity in Philosophy, to target these two researchers. Lappin's research is fundamentally interdisciplinary, integrating core areas of philosophy, specifically intensional logic, formal semantics, and philosophy of language into cognitive science, machine learning, and computational learning theory. Meyer-Viol is first and foremost a philosophical logician who does 75% of the logic teaching in the University, who publishes both in this area, in formal grammar, and in issues at the interface of syntax/semantics. They are fully integrated into the Department's research, teaching, and administrative activities. The School is also targeting other linguists, and other departments. It is worth pointing out that the King's Philosophy Dept. was ranked third in the UK in the 2008 Research Assessment Exercise, and the Principal, Rick Trainor, has referred to the Philosophy Department as "the jewel in the crown" of the College, so that this is an extraordinary way to encourage one of its strongest departments, ironically at a time when interdisciplinary research is very generally recognized by national and international funding bodies as the way forward.

The issue is however wider than this, since these moves are made as an instance of a broad procedure affecting the entire academic community of two primary Schools of the College. Initiating movements to give people notice is going on in parallel with supposed consultation, with no attempt at finding alternative ways of reducing the salary budget: indeed all academics in Humanities have been told that they are in effect “at risk of redundancy”, even though the consultation process is only now beginning to take place. This is part of a pattern of widespread cuts and “restructuring” that is currently going on in British universities, but for such a distinguished institution of research, scholarship, and teaching to allow these management-led procedures in determining both academic content and individuals/areas to be targeted sets a very dangerous precedent for the suppression of academic freedom and the destruction of the autonomy of research across the UK and beyond.

We urge you to write to the management at King's to protest these actions. The people to send email to are:

Professor Rick Trainor (principal@kcl.ac.uk) Principal,
Professor Jan Palmowski (jan.palmowski@kcl.ac.uk), Head of School for Arts & Humanities,
Professor Keith Hoggart (keith.hoggart@kcl.ac.uk),Vice Principal for Arts &Science,
Mr C. Mottershead (chris.mottershead@kcl.ac.uk) Vice Principal for Research & Innovation

If you have comments or responses to this article, please leave a comment below, or contact the author: Ruth Kempson (ruth.kempson@kcl.ac.uk).

Ruth Kempson is Professor of Linguistics FBA, emeritus, King's College London

Setting up the Interaction Lab

Oliver Lemon

SLTC Newsletter, February 2010

The Interaction Lab

It's just over 3 months since our unique group of 5 postdoctoral researchers moved to the School of Mathematics and Computer Science (MACS) at Heriot Watt University, Edinburgh, to set up a new research lab. The Interaction Lab was created to focus on the data-driven development and evaluation of novel multimodal interfaces, using machine learning techniques to achieve increasing levels of adaptivity and robustness.

Setting up a new research lab is an exciting opportunity, which presents a variety of challenges. Reflecting on our progress in the past months, the most important stages in our development have been:

Agreeing a shared mission statement and long-term vision for the ongoing activities of the lab
Deciding on a practical and affordable voice platform for public data collections and system evaluations
Securing sustainable funding streams for the lab's Research Fellows
Agreeing a shared communication architecture and software infrastructure for the new multimodal systems we are developing
Developing and advertising the lab website, and our availability for new research projects.

Regarding our vision and mission statement, it is important to develop shared goals about the long-term aspirations for our research, and especially to allow room for future growth and exploration of new research areas. A focus on computational models of "Interaction" is allowing us to develop from our core expertise in machine learning for data-driven multimodal interaction, into new explorations of statistical modelling in interactive systems more widely, for example in environmental and ecological modelling, robotics, and affective computing.

In terms of our hardware setup, we are developing the capacity to deploy speech interfaces to the general public. One of the current challenges in Spoken Dialogue System development is testing and gathering data in realistic environments with self-motivated users, rather than staged experiments in a lab with scripted scenarios. Addressing this is a key part of our strategy. We are setting up a facility to collect research data from real users, rather than in lab-based experiments, and to test our systems "in the wild". We will soon have the capability to handle multiple parallel calls from both standard phone and via VOIP. This is possible through a plug-in to our lab-based dialogue system that automatically generates VoiceXML on the fly. The pilot system will enable visitors during the popular Edinburgh Festival to use our dialogue systems to get information about places of interest in Edinburgh, such as restaurants.

A sustainable funding stream is obviously vital to provide continuity of our research and to develop our research environment. We currently have funded projects in the region of £1 million. These projects cover an exciting range of techniques and applications, for example combining vision and multitouch interaction for Technology Enhanced Learning, and applying statistical planning methods in spoken dialogue systems. This funding is a combination of European Commission FP7 (see the CLASSiC project), EPSRC, and ESRC projects (see the ECHOES project). We are also actively involved in new funding proposals, in particular targeting the European Commission's Cognitive Systems and Language Technologies calls. We collaborate with many other institutions and with industry -- for example Cambridge University, the Institute of Education (London), and France Telecom/Orange Labs.

Lab members

The Interaction Lab would not exist without its researchers. We have brought together an excellent team of individuals, each with an international research profile, with a unique blend of complementary skills and experience:

Paul Crook (machine learning, robotics)
Mary Ellen Foster (multimodal interfaces, Natural Language Generation)
Helen Hastie (spoken dialogue systems, speech processing, evaluation)
Oliver Lemon (multimodal dialogue systems, machine learning, human-robot interaction)
Xingkun Liu (spoken dialogue systems, systems integration)

Acknowledgements and more information

Thanks to the Scottish Informatics and Computer Science Alliance (SICSA) for the funding which created the Interaction Lab. We also thank the School of Mathematics and Computer Science at Heriot-Watt University for providing a nourishing, supportive, and stimulating modern research environment.

For more information, see:

If you have comments, corrections, or additions to this article, please contact the author: Oliver Lemon, o.lemon [at] hw [dot] ac [dot] uk.

Oliver Lemon is leader of the Interaction Lab at Heriot Watt University, Edinburgh. His interests are in statistical learning approaches to multimodal interaction. Email: o.lemon@hw.ac.uk

Recent publications include:

Oliver Lemon and Ioannis Konstas, "User Simulations for Context-sensitive Speech Recognition in Spoken Dialogue Systems", Proc. European Association for Computational Linguistics, 2009
Matthew Frampton and Oliver Lemon, "Recent research advances in Reinforcement Learning in Spoken Dialogue Systems", Knowledge Engineering Review, 24(4): 375-408, 2009.
Verena Rieser and Oliver Lemon, "Natural Language Generation as Planning under Uncertainty for Spoken Dialogue Systems", Proc. European Association for Computational Linguistics, 2009
Verena Rieser and Oliver Lemon, "Does this list contain what you were searching for? Learning adaptive dialogue strategies for Interactive Question Answering", Journal of Natural Language Engineering, 15(1), 2008
James Henderson, Oliver Lemon, and Kallirroi Georgila, "Hybrid Reinforcement / Supervised Learning of Dialogue Policies from Fixed Datasets", Computational Linguistics 34(4), 2008

Text Analysis Conference 2009: An Overview

Annie Louis

SLTC Newsletter, February 2009

The annual Text Analysis Conference (TAC) provides a setting where participant systems are evaluated on common test sets to create direct comparisons, develop and test evaluation methods and to assess system progress. There were three main tracks at this year’s evaluation--Knowledge Base Population, Update Summarization and Recognizing Textual Entailment. More than 50 teams from countries around the world including Australia, China, Japan, Israel, India and several parts of Europe took part in these tracks. The results of this year’s evaluation were presented at the workshop conducted in November 2009. It featured technical sessions with presentations and posters from the participants as well as overviews of system performances analyzed by NIST.

The Knowledge Base Population (KBP) track was conducted for the first time at TAC this year. The focus was on systems that can mine specific pieces of information from a large text collection with the view to add them to an existing knowledge base. Over a million newswire documents were assembled by the Linguistic Data Consortium (LDC) to serve as a corpus for the various search tasks. The knowledge base was compiled from Wikipedia “info boxes”, the section in each Wikipedia article listing the most important snippets of information about the entity discussed. An example snapshot from the page for ‘Max Planck’ is below.

One of the tasks for the KBP systems was “entity linking”. Given the name of an entity and a document which mentions the name, systems had to determine if the entity being referred to existed in the database. Systems need to disambiguate the mention using the provided document to distinguish it from others with the same name. For example, the name ‘Colorado’ could refer to a state, a river or a desert. Then a link to the entry in the knowledge base must be provided if the entity already exists in it. Systems performed well overall on the entity linking task. There were however several clearly ambiguous mentions that most systems had problems disambiguating. In the KBP “slot filling” task, systems must use the collection of newswire documents to obtain missing pieces of information for an entity in the database. The entities could be people, organization or locations and a list of desirable attributes were defined for each of them.

Continuing along the lines of previous few years, this year’s summarization track focused on creation of Update Summaries. Two sets of 10 newswire documents each were given to the participants with the second set containing articles published later than the first. Systems should assume that the user has read the first set of documents. They must create a single summary of the second collection, but compiling only the new information as an update summary for the user. It was observed from the evaluations this year that selecting good content for the update summaries still remains difficult for systems with most of them obtaining scores significantly below human performance on this task.

This year NIST also introduced an evaluation track where participants could submit their own automatic metrics for summary evaluation. The currently used automatic metric, ROUGE, is a suite of n-gram statistics to compare a system summary with gold-standard human summaries, both written for the same source article. The degree of overlap gives a measure of content quality which has been shown to correlate highly with human judgments. Track participants were given the output of systems from the update task for scoring. The scores submitted by the participants were then checked for correlations with those assigned manually by NIST assessors. Interestingly, some of the participant systems outperformed or did comparably to some of the ROUGE metrics obtaining very good agreement with human judgements.

The third track was Recognizing Textual Entailment (RTE). Given two pieces of text, the task of an RTE engine is to predict if one entails the other. Recognizing entailment relationships between texts is useful for a variety of tasks. For example, in summarization, the system could be used to help remove redundant facts. While systems are often given two self-contained pieces of text for predicting entailment, this year, an additional new setup was also introduced where the candidates for entailment judgments included the full collection of sentences in a set of documents. This setup poses two challenges. Systems will need to work with a real distribution of entailment examples, where positive ones would be considerably fewer than the non-entailing examples. Secondly, individual sentences of the collection may not contain all the information needed for prediction. Indeed, runs of this pilot task brought out several issues that systems must address to do better on the task. For example, within as well as cross-document references to people, dates and places need to be successfully resolved for entailment prediction.

In addition, this year, all RTE participants were mandatorily required to present ablation results for the knowledge resources they used. Participants were asked to submit their results leaving out each resource at a time and running their system with the remaining modules. This way one could analyze which resources when removed had the greatest impact on performance. The results showed that the impact of resources varied greatly depending on how systems used them. For example, WordNet, one of the commonly used resources had a negative impact when removed for some systems, at the same time, improving performance for some others. But such analyses are helpful to understand which resources are overall beneficial so that they could be shared to assist faster system development.

In sum, the new tasks that were introduced and the evaluations brought out several relevant issues in task design and evaluation for text analysis systems. Participants shared their experiences with one another at the talk and poster sessions. More information at: http://www.nist.gov/tac/

If you have comments, corrections, or additions to this article, please contact the author: Annie Louis, lannie [at] seas [dot] upenn [dot] edu.

Researchers developing way to synthesize speech with a brain-machine interface

Matthew Marge

SLTC Newsletter, February 2010

One of the great challenges to near-total paraplegics is the inability to communicate effectively with others. This problem is often due to the lack of control to the articulators that control speech production. Within the past decade, researchers in neurology have developed "brain-machine interfaces" - or BMIs - that allow people to communicate with only their thoughts. Advances in this field are now allowing a paraplegic patient to control a real-time speech synthesizer using a BMI that reads the neural signals that relate to the production of speech.

Until recently, the most advanced communication technology for paraplegics was mental control of typing keys - at the rate of about one word per minute. Keys can be selected when a brain-machine interface reads a patient's neural signals from the motor cortex of the brain. Unfortunately, this method of communication, although beneficial, makes it challenging for patients suffering from paralysis to hold live conversations with others, like family members or caretakers. Dr. Frank Guenther, Professor of Cognitive and Neural Systems at Boston University, and his collaborators are leading an innovative next step forward for BMIs - reading a paraplegic patient's thoughts and converting them to speech using a speech synthesizer [1].

The patient in this study (male, 26) suffers from "locked-in syndrome," which is almost complete paralysis, but with full control of cognitive abilities. The only voluntary control that the patient has is his ability to blink. For Guenther and his group, the focus was to locate the areas in the brain that control speech and its associated articulatory motor controls. They determined the precise areas that relate to the production of formant frequencies, the spectral representation of speech. Their primary goal for this study was to determine when the patient was attempting to produce vowels with his thoughts - a goal that could potentially be achieved just by learning about the brain signals that control formant frequencies in speech.

Once electrodes are implanted in the brain, a synthesis BMI can read the neural signals that pass through - wirelessly. In their system setup, when the patient's brain produces signals that correlate to the production of formant frequencies, an FM receiver picks up the neural signals transmitted (after amplification) by the implanted electrodes. Guenther and his colleagues used an existing model, the DIVA model, to determine where to place electrodes in the brain [2]. These signals are then digitized by a recording system, which converts them into "spikes". The "spikes" are then processed into the first and second formant frequencies, F1 and F2. This information is passed as input to a speech synthesizer, which provides almost immediate speech-based feedback to the patient.

The system's accuracy at determining the patient's intended vowels increased with practice. In practice sessions, the patient was instructed to listen to an example of the vowel he should speak (e.g., the "e" sound in "heat"), then think about speaking that vowel. Response time from the speech synthesizer after reading the patient's neural signals was almost instantaneous - 50ms. Over a period of 25 sessions, the patient's accuracy with BMI-based control of the synthesizer improved dramatically - from 45% to 70%.

Their key finding was that neural electrodes, when implanted in the brain, can potentially be a reliable method for allowing patients to communicate using a real-time speech synthesizer. This rapidly increases the pace of conversation that paraplegics can have with others. Also, this research can help us understand how the areas of the brain associated with speech-based motor control transmit information. This BMI is the first of its kind that uses an implanted electrode to control a BMI wirelessly - no other major hardware beyond an FM receiver and a computer is needed. Guenther and his colleagues believe that their results could be improved by installing more electrodes in the speech areas of the brain.

While vowel production by the patient has become fairly accurate with practice (70%), open questions remain when it comes to producing consonants. Guenther and his colleagues suggest one novel approach that holds promise when combined with their current work - to read the neural signals that control the motor pathways to articulatory speech muscles. Although this strategy has potential, it needs to be tested with people that have motor control of the speech articulators. They believe that this approach will be similar to what has been conducted with mouse cursor control-based BMIs. Despite the serious challenges that lie ahead for these researchers, the end goal is worth their efforts - those without the ability to speak may one day be able to hold real-time conversations with only their thoughts.

For more information, please see:

[1] Guenther, FH, Brumberg, JS, Wright EJ, Nieto-Castanon A, Tourville JA, et al. (2009). A Wireless Brain-Machine Interface for Real-Time Speech Synthesis. PLoS ONE 4(12):e8218. doi:10.1371/journal.pone.0008218
[2] Guenther FH, Ghosh SS, Tourville JA (2006). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96: 280-301.
Frank Guenther's speech lab at Boston University.

ISCA Student Activities Update

Samer Al Moubayed, Marco A. Piccolino-Boniforti, Antonio Roque, Sylvie Saget, Sunayana Sitaram, Marcel Waeltermann

Interspeech Student Panel Discussion

The ISCA-SAC organized a student panel discussion session during Interspeech 2009. The aim was to offer students a fruitful discussion with senior researchers regarding different methods of research. This year, we addressed the question of 'Research in Academia vs Research in Industry' and the invited speakers were: Geoffrey Zweig (Microsoft Research), Philippe Bretier (Orange Labs) and Mike McTear (University of Ulster).

ISCA 10th Anniversary Design Contest

This year, ISCA is celebrating its 10th anniversary. In order to promote this event, a poster design contest on the topic of the 10th anniversary was announced in early 2009 by ISCA-SAC. It turned out to be a really difficult task to select a winner out of the 72 candidates. The ISCA Board selected five posters for the final round, these consisting of contributions by Karem Laguna, Robert Lawley, Jamal Issawi, Alberto Forcada and Johanna Vinterhav. Out of these finalists, one poster came out on top according to the jury's judgement. The winning poster was presented by the president of ISCA, Prof. Isabel Trancoso, at the opening ceremony of the Interspeech 2009 conference. The winner of the 'ISCA 10th Anniversary Poster Design Contest' is Jamal Issawi (Lebanon).

Online Grant Application System

ISCA-SAC managed the Online Grant Application System (OGAS) during Interspeech 2009. OGAS was developed in 2008 by student volunteers to facilitate the grant application process for both the applicant and the ISCA grants coordinator.

Communications

One of ISCA-SAC's main goals is in the transition of the website to Web 2.0 while simplifying access to the large range of information available on the website. A first version of the newly developed website, including a blog was launched in late 2009.

In order to develop a more interactive exchange with students and senior researchers working in the field of speech communication science and technology, ISCA-SAC has embraced new methods of social networking. A facebook group and a twitter page were created in early July 2009. Each group has already 60 and 16 members respectively.

Transcription of Saras Institute Interviews

In conjunction with the IEEE-SLTC e-newsletter and the History of Speech and Language Technology Project, student volunteers have worked hard to provide transcriptions of interviews from various researchers who have made seminal contributions to the development of speech and language technology. The students transcribed interviews with the Joseph Perkell, Lauri Karttunen, William Labov, who described how they came to be involved in the field of speech technology. These are available in 1 2 3 articles in the e-newsletter.

Upcoming Projects

During 2010, ISCA-SAC will keep on maintaining the services it used to offer during the previous years; this includes managing the online grant application system for the 2010 ISCA sponsored conferences and workshops, and mainly for the Annual Interspeech 2010 conference taking place in Makuhari, Japan.

During the conference, ISCA-SAC is planning to organize one or two scientific events including a Panel Discussion or a Round Table Lunch discussion. All the students are gladly welcome to attend and participate. In addition to Interspeech, ISCA-SAC is co-organizing the Young Researchers' Roundtable on Spoken Dialog Systems YYRSDS-10.

ISCA-SAC is also planning to widen its coverage by increasing efforts to contact Masters and PhD students around the world, and advertising its activities and work.

One of the goals ISCA-SAC is also working to achieve in the future is to reach young researchers in developing countries, by for example, creating special interest groups. This project aims at boosting the possible regional efforts and collaborations among students in these countries.

We welcome student participation at any level, and are always open to new ideas from anyone; please feel free to contact us on our web page or by email at volunteer[at]isca-students[dot]org.

Memories from ASRU 2009, Merano, Italy

Giuseppe Riccardi and Renato de Mori

SLTC Newsletter, February 2010

We hosted ASRU 2009 in Merano (Italy), a lovely Italian town at the heart of Europe, close to the border with Austria and Switzerland. This town and the surrounding region are reminiscent of past history when the multilingual European elite would gather here for important social events.

We have had 180 attendees from all over the world, mostly from North America, Europe and Japan. Central and Southern America as well as Asia were not well represented, similarly to past ASRUs, and we believe as a scientific community we should make efforts to extend our reach to those countries. A total of 223 papers were submitted for the poster sessions and 96 high-quality papers were accepted. We also assigned the best paper and best student awards, selected by the technical chairs.

The technical program was a high-paced schedule with interleaved paper presentations and thematic invited lectures. The invited lectures were outstanding and gave the opportunity to junior and senior researchers to review and appreciate the fundamentals and challenges of topics ranging from human speech recognition to acoustic modelling, to machine learning. The keynote speakers addressed important research issue in ASR , generalization, and new trends in speech and language processing, social computing. We would like to thank our invited speakers for their technical contributions that made ASRU 2009 greatly appreciated by attendees.

The social events are an important part of the workshop and is important create the network between community newcomers, colleagues and friends. The central event was the banquet at the Katzenzungen castle, one the hundreds of privately and publicly owned castles in this region. This castle is famous for hosting one of the oldest winery in Europe. We had the opportunity for once to have dinner in a magic medieval setting and closed the dinner with a special guest appearance from our talent gifted colleagues ( Video ).

One of the objective we had for ASRU 2009 was to make it long-lasting event and preserve the scientific contributions as well as discussions and impressions from our community at this point in time. For this reason we have launched an initiative to record such memories in terms of knowledge that was shared ( invited talks ), discussions, pictures and interviews to members of the community. As of now we are in the process of populating such archive and turn it into a searchable multimedia event. We believe such digital content could be used, tagged and augmented for research by our (extended) community. Researchers will be able to access, contribute and peruse the ASRU2009 digital event starting from the website , as we make progress in the sponsoring research project at University of Trento, Livememories . Stay tuned! Not to leave this effort isolated we hope future ASRU organizers share the same vision and will be glad to help.

Last but not least we would like to thank the members of the workshop technical and organizing committee, public and private sponsors for making ASRU 2009 a memorable event, so they say. See you at ASRU 2011.

For more information, see:

If you have comments, corrections, or additions to this article, please contact : Giuseppe Riccardi, riccardi [at] disi [dot] unitn [dot] it.

Giuseppe Riccardi is Professor at University of Trento (Italy). Renato De Mori is Professor Emeritus at McGill University (Canada) and University of Avignon (France)

Island-Driven Search

Tara N. Sainath

SLTC Newsletter, February 2010

Many speech scientists believe that human speech processing is done first by identifying regions of reliability in the speech signal and then filling in unreliable regions using a combination of contextual and stored phonological information [1]. However, most current decoding paradigms in speech recognition consist of a left-to-right (and optional right-to-left) scoring and search component without utilizing knowledge of reliable speech regions. This is particularly a problem when the search space is pruned, as pruning algorithms generally do not make use of the reliable portions of the speech signal, and hence may prune away too many hypotheses in unreliable regions of the speech signal and keep too many hypotheses in reliable regions. Island-driven search is an alternative method to better deal with noisy and unintelligible speech. This strategy works by first hypothesizing islands from regions in the signal which are reliable. Further recognition works outwards from these anchor points to hypothesize unreliable regions. In this article, we discuss past research in island-driven search and propose some future research directions.

Early Work in Island-Driven Search

The BBN Hear What I Mean (HWIM) speech understanding system [2] was one of the first systems to utilize island-driven search for sentence parsing. In this system, parsing does not start at the beginning of the word network, but rather can start at confident regions within the network, at places known as islands. Parsing works outwards left and right from these island regions to parse a set of gap regions. While this type of approach showed promise for small grammars, it has not been explored in large vocabulary speech recognition systems due to the computational complexities of the island parser.

In addition, [3] explores using island-information for sentence parsing. This paper looks at computing language model probabilities when the sentence is generated by a stochastic context-free grammar (SCFG). Specifically, the language model is computed by first having the parser focus on islands, defined as works with high semantic relevance. The parser then proceeds outwards towards gaps to compute the total language model score. While this paper presented solid theory behind an island-driven search parser, subsequent use of this theory in modern day ASR systems has not been heavily explored.

Furthermore, [4] explores island-driven search for isolated-word handwriting recognition. The authors identify reliable islands in isolated words to obtain a small filtered vocabulary, after which a second-pass more detailed recognition is performed. This technique is similar to that explored by Tang et al. in [5] for improving lexical access using broad classes. However, these solutions cannot be directly applied to continuous speech recognition.

Current Work in Island-Driven Search

More recent work in island-driven search has focused on extending theories from the above papers into modern day continuous ASR. For example, Kumaran et al. have explored an island-driven search strategy for continuous ASR [6]. In [6], the authors perform a first-pass recognition to generate an N-best list of hypotheses. A word-confidence score is assigned to each word from the 1-best hypothesis, and islands are identified as words in the 1-best hypothesis which have high confidence. Next, the words in the island regions are held constant, while words in the gap regions are re-sorted using the N-best list of hypotheses. This technique was shown to offer a 0.4% absolute improvement in word error rate on a large vocabulary conversational telephone speech task. However, if a motivation behind island-driven search is to identify reliable regions to influence effective pruning, identifying these regions from an N-best list generated from a pruned search space may not be an appropriate choice.

In [7], broad phonetic class (BPC) knowledge was utilized for island-driving search. First, islands were detected from knowledge of hypothesized broad phonetic classes (BPCs). Using this island/gap knowledge, a method was explored to prune the search space to limit computational effort in unreliable areas. In addition, the authors investigate scoring less detailed BPC models in gap regions and more detailed phonetic models in islands. Experiments on both small scale vocabulary tasks indicate that the proposed island-driven search strategy results in an improvement in recognition accuracy and computation time. However, the performance of the island-driven search methods for large scale vocabulary tasks is worse than baseline methods. One possible reason for this performance degradation is that the island-driven search remains left-to-right in the proposed technique, opening up the idea of doing a bi-directional search.

Open Research Directions

While some of the current work in island-driven search has shown promise, there are plenty of interesting research directions and unanswered questions. As the work in [6] indicated, one major question is how to define an island. For example, as [1] discusses, when humans process speech, they first identify distinct acoustic landmarks to segment the speech signal into articulatory-free broad classes. Acoustic cues are extracted at each segment to come up with a set of features for each segment, which make use of both articulator-bound and articulator-free features. Finally, knowledge of syllable structure is incorporated to impose constraints on the context and articulation of the underlying phonemes. Perhaps one idea is to look at defining islands though combination of articulator-free and articulator-bound cues, in conjunction with syllable knowledge.

A second topic of interest is how to perform the island-driven search. Second, the nature of speech recognition poses some constraints on the type of island-driven search strategy preferred. While island searches have been explored both unidirectionally and bi-directionally, the computational and on-line benefits of unidirectional search in speech recognition make this approach more attractive. Furthermore, if reliable regions are identified as sub-word units and not words, a bidirectional search requires a very complex vocabulary and language model. However, unidirectional island-driven search might not always address the problem of good hypotheses being pruned away. Therefore, in the future, an interesting research direction might be exploring bi-directional search strategies might which first start in the reliable island regions and works outwards to the gaps.

For more information, please see:

[1] K. N. Stevens. Toward A Model for Lexical Access Based on Acoustic Landmarks and Distinctive Features. Journal of the Acoustic Society of America, 111(4):1872–1891, 2002.
[2]W. A. Lea, Trends in Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, 1980.
[3] A. Corazza, R. De Mori, R. Gretter, and G. Satta. Computation Probabilities for an Island-Driven Parser. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9):936–950, 1991.
[4] J. F. Pitrelli, J. Subrahmonia, and B. Maison. Toward Island-of-Reliability-Driven Very-Large-Vocabulary On-Line Handwriting Recognition Using Character Confidence Scoring. In Proc. ICASSP, pages 1525–1528, 1991.
[5] M. Tang, S. Seneff, and V. W. Zue. Modeling Linguistic Features in Speech Recognition. In Proc. Eurospeech, pages 2585–2588, 2003.
[6] R. Kumaran, J. Bilmes, and K. Kirchhoff, “Attention Shift Decoding for Conversational Speech Recognition,” in Proc. Interspeech, 2007
[7] T. N. Sainath, "Island-Driven Search Using Broad Phonetic Classes," Proc. ASRU, Merano, Italy, December 2009.

If you have comments, corrections, or additions to this article, please leave a comment below or contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are in acoustic modeling. Email: tnsainat@us.ibm.com

eNTERFACE Workshops on Multimodal Interfaces: A success story

Albert Salah and Thierry Dutoit

SLTC Newsletter, February 2010

The eNTERFACE workshops were born in Mons Belgium, in 2005, as the largest and most original dissemination action of the SIMILAR European Network of Excellence on multimodal interfaces. Their aim is to gather a team of leading professionals in multimodal man-machine interfaces together with students in one place, to work on a pre-specified list of challenges, for 4 complete weeks. In this respect, it is an innovative and intensive collaboration scheme, designed to allow researchers to integrate their software tools, deploy demonstrators, collect novel databases, and work side by side with a great number of experts. After the completion of the SIMILAR project, the OpenInterface Foundation was established to continue these fruitful series of workshops.

The workshop is held on a yearly basis, in July-August, and organized around ten research projects dealing with multimodal human-machine interfaces design. As such, they are radically different from traditional scientific workshops, in which only specialists meet for a few days to discuss state-of-the art problems, but do not really work together.

The senior participants of the workshop are mainly university professors, industrial and governmental researchers presently working in widely dispersed locations. Junior researchers are mostly PhD students. The organization of these workshops was inspired by the successful DARPA workshops on speech recognition organized yearly by Prof. Jelinek at Johns Hopkins University. A call for projects is first issued, aimed for senior researchers, who then respond by proposing projects. This is an international call, and it is widely circulated in all related major scientific networks. The eNTERFACE scientific committee evaluates the proposals, and selects about ten projects. In the second phase, a call for participation is circulated, where the project leaders get to build their team months before the workshop. The would-be participants send in a CV and why they are interested in joining a project. Graduate students familiar with the field are selected in accordance with their demonstrated performance. Depending on outstanding academic promise, a small number of under-graduates or even high-school students have been selected in the past for participating in the workshop.

The funding scheme of the workshop is unusual. There are no registration fees, but participants need to provide for their travel, lodging, and catering expenses, using their own project finances or other EU, national, or regional funding. The University organizing the workshop exploits catering and lodging possibilities at minimal student rates whenever possible. Grants and sponsorships are sought from scientific societies (including OpenInterface Foundation and EURASIP), governmental bodies (including national scientific organizations like SenterNovem), municipalities, international networks (like SSPNet) and from the industry.

The workshop aims for scientific collaboration by bringing together many researchers in human-computer interaction to work on specific projects. During the workshop it becomes possible to tap into the expertise of these researchers. Participants who developed novel tools (e.g. middleware, toolboxes, platforms, hardware) organize tutorial sessions to disseminate their tools, and researchers who attend these sessions learn new tools in the fastest possible way. The projects themselves are ideal platforms to bring together modules developed by different groups. Some projects foster collaboration between computer scientists that develop the tools, and other parties who ask the relevant questions (e.g. artists, psychologists, industrial partners).

The projects of the workshop deal with many different aspects of multimodal interfaces. A few past projects should illustrate the span of the workshop:

Gathering a novel database in a field that lacks one (e.g. a 3D facial expression database),
Exploring a new modality in an innovative scenario (e.g. a multi-player game using neurophysiological sensors),
Enabling artists make use of novel interfaces (e.g. a multimodal guitar),
Building a demonstrator for a newly starting international project (e.g. an activity-related biometrics demonstrator)
Integrating a novel modality in an existing application framework (e.g. a sign-language enabled information kiosk)

The workshop is also an ideal platform for networking, which is important for young PhD students. A lab visit to another university exposes the student to a new group, but eNTERFACE exposes the student to a host of groups (in addition to several important plenary speakers), and there are plenty of opportunities for senior researchers to assess a student in one month. Many PhD students find postdoc connections during the workshop. The senior researchers leading the groups also find connections for new projects, as well as industrial partners.

The workshop uses many channels of dissemination. The mid-term and final project presentations are open to public. Each group prepares detailed project reports, which are collected in an edited volume, and sent to the participants. The reports, collected databases, and jointly developed (OpenSource) software are made available at the workshop website. All such material from the previous workshops is open to public, and can be accessed. Groups often continue their collaboration after the workshop, and project reports are extended into peer-reviewed joint publications. Springer Verlag's Journal on Multimodal User Interfaces has in the past run a special issue dedicated for extended papers from the eNTERFACE workshop. Other publications have appeared in many leading journals. Each workshop so far also enjoyed press coverage in national media.

Clearly, the eNTERFACE workshops provide a unique opportunity for younger researchers to develop their skills and enlarge their professional network by being fully involved in a collaborative project. They also make it possible for senior researchers to start multi-laboratory short-term projects, which can then give birth to larger project proposals. Thanks to active social programs, all participants also clearly benefit from literally living together for four weeks with a clear and ambitious scientific goal.

The first eNTERFACE was organized by Faculté Polytechnique de Mons in 2005. Since then, the workshop is being hosted by a different group each year; University of Zagreb in 2006, Bogaziçi University in 2007, CNRS-LIMSI in 2008, and University of Genova in 2009. It created an international community around itself, and has its regular attendees, who join the workshop every year with new projects. The eNTERFACE'10 workshop is being organized by Intelligent Systems Lab Amsterdam (ISLA-ISIS), of University of Amsterdam. The call for participation will be open until 15th February 2010.

Albert Salah is with the University of Amsterdam, Netherlands. Email: a.a.salah@uva.nl. Thierry Dutoit is with the University of Mons. Email: mailto:Thierry.Dutoit@umons.ac.be.

Second Life as a research platform

Svetlana Stoyanchev

SLTC Newsletter, January 2010

Second Life (SL) is a virtual social environment with over 11 million registered users. SL is now also becoming a platform for research in dialogue, gestures, social behaviour, and information presentation. It provides a resource for creating and using virtual spaces, designing and manipulating virtual characters, and gives access to a large pool of potential users for evaluating new technology. This article describes some of the projects that use Second Life environment for research.

Embodied Agents

Embodied conversational agents (ECA) are virtual characters in multimodal environments that normally have abilities such as speaking, moving, gesturing, changing eye gaze, and facial expressions. Virtual characters often have application-specific reasoning abilities and can answer a user's questions or guide a user through a task. For example, virtual characters in research projects have played roles such as real estate agents [1] or museum guides [2]. Researchers model dialogue, social, and visual communication behaviour using virtual characters and study how the virtual character's behaviour, such gestures, facial expressions, eye gaze, physical appearance, and positioning of a virtual agent affect human-agent interaction.

Virtual agent behaviour, like spoken dialogue, is difficult to evaluate automatically because there is no gold standard for gestures in a particular situation or a predefined answer to an unexpected question. The improvement of behavioural and dialogue models rely mainly on human evaluation. Systems implementing virtual agents can benefit from access to a large pool of users.

Description of Second Life

Second Life is a virtual world accessible via internet using a free client program. Membership is free and users create and customize their character. SL has grown to be a virtual collaborative community with over 11 million registered users. Users can buy land, and build elaborate virtual spaces for art, music events, business, and educational resources. Universities have a significant presence in Second Life as both users and researchers. Universities create virtual spaces to host meetings in Second Life, provide educational resources for students and visitors, and hold virtual courses, as well as study social interaction and user behaviour in SL environment. For example, here is a video of a space in SL created at the Open University. The space is now open for the use of researchers and educators.

Researchers at AT&T Labs have introduced services such as telephone messaging and web publishing that allow communication between SL users and the real world. This technology was demonstrated and tested by the attendees of the WWW 2008 conference with a SL site constructed for this purpose [3].

Use of 3d Environments in Research

The Second Life virtual environment is becoming a platform for conducting research. In addition to SL 'residents' who are human-controlled avatars, automatically controlled avatars (bots) can enter the environment and communicate with the residents.

Second Life provides a rich infrastructure with graphics, interaction, social network, making ECA research much more accessible. The availability of virtual spaces and potential users makes Second Life a promising environment for research on virtual agents, dialogue, models of communication, or information presentation.

For example, Breitfuss and Prendinger [4] evaluate automatic algorithm for gestures and eye gaze generation using second-life characters:

The AI lab in Bielefeld University has transferred a virtual agent to Second Life [5] for research on modelling social skills.

Jan et al. [6] describe a virtual guide character that shows visitors around the Army island in Second Life. The agent detects a user's presence, offers a tour, and leads the user to different destinations. He stops when user falls behind and can hold a dialogue about the island. To try this tour in Second Life, search for location named " Army Information Center " and teleport there. Once teleported, look for 'StaffDutyOfficer Moleno' who will give you a tour around the island.

Some of the Tools for Creating Second Life Bots

OpenMetaverse foundation provides an open-source solution for creating bots in Second Life.

At NII lab researchers have also developed MPML3D, a 3D markup language for virtual characters. The script allows the manipulation of verbal and non-verbal behaviour of agents in Second Life. The script's interpreter runs as a server and is publicly available.

Advantages of Second Life

Second Life provides an environment where 3D characters can be created, used, and tested. Second Life is already populated with 11 million users providing a rich subject pool for running research studies on virtual characters and testing dialogue and multimodal applications. SL's environment is multi-modal, multi-user, and situated. An additional bonus is that text-based conversations are common in SL, allowing for natural dialogues through text interface that bypass Automatic Speech Recognition. Users may be attracted to the research project by advertisement. Research participants may be paid with Linden Dollars (250L$ = 1US$ accordingly to the official SL L$ exchange). However, if an experiment involves a game, research participants may be attracted to participate without monetary compensation. Some examples of such experiments that were not done in Second Life include a treasure hunt game used evaluate Natural Language Generation of instructions for the players of the game [7], the ESP game in which the players aim at achieving consensus on image description, and Restaurant Game that captures behaviour and dialogue from human-human interactions.

Researchers venturing into SL need to take care of ethical issues of research on human subjects and compliance with the Institutional Review Board (in the US). Bots that invade privacy, or annoy other users raise questions in the user community.

Second Life is a social and gaming environment which at the same time provides a resource for researchers for designing and implementing experiments with virtual characters. It also makes research on virtual agents much more accessible to researchers and gives an access to a pool of potential users for evaluating experimental models. Second Life is only one of many virtual worlds and the future of virtual environments as experimental platforms is promising.

Thanks to Antonio Roque and Giuseppe Di Fabbrizio for providing extensive input to this article.

References:

J. Cassell, M. Stone and H. Yan Coordination and context-dependence in the generation of embodied conversation , INLG (2000)
S. Koppet al. A conversational agent as museum guide - design and evaluation of a real-world application. In: IVA. Volume 3661 of LNCS., Springer (2005)
R. Jana et al. Bridging Communication Services – Connecting Virtual Worlds to Real World: a Service Provider’s Perspective In WWW 2008
E. Weitnauer et al. Intelligent agents living in social virtual environments--bringing Max into Second Life , Intelligent Virtual Agents (IVA08)
W. Breitfuss et al. Automatic generation of non-verbal behavior for agents in virtual worlds: A system for supporting multimodal conversations of bots and avatars. Proc HCI International 2009
D. Jan et al. A Virtual Tour Guide for Virtual Worlds, Intelligent Virtual Agents 2009
D. Byron et al. Report on the First NLG Challenge on Generating Instructions in Virtual Environments (GIVE)
If you have comments, corrections, or additions to this article, please contact the author: Svetlana Stoyanchev, s.stoyanchev [at] open [dot] ac [dot] uk.

Svetlana Stoyanchev is Research Associate at the Open University. Her interests are in dialog and information presentation.