Speech and Language Processing Technical Committee Newsletter

February 2011

Welcome to the Winter 2011 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter.

In this issue we are pleased to provide another installment of brief articles representing a diversity of views and backgrounds. This issue includes articles from 13 guest contributors, and our own 8 staff reporters and editors.

We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can submit job postings here, and reach us at speechnewseds [at] listserv (dot) ieee [dot] org.

Finally, to subscribe the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Jason Williams, Editor-in-chief
Pino Di Fabbrizio, Editor
Martin Russell, Editor
Chuck Wooters, Editor


From the SLTC and IEEE

From the IEEE SLTC chair

John Hansen

An update on ICASSP 2011, what to do when authors can't present papers, and a look ahead.

IEEE Signal Processing Society Newsletter

The IEEE Signal Processing Society, our parent organization, also produces a monthly newsletter, "Inside Signal Processing".


2010 ISCA Student Activities Update

Samer Al Moubayed, Catherine Lai, Matt Speed, Marcel Waeltermann

The ISCA Student Advisory Committee (ISCA-SAC) was established in 2005 by the International Speech Communication Association (ISCA) to organize and coordinate student-driven projects. After our work in 2008 and 2009 we are looking forward to a busy 2011.

First "Show & Tell" event at Interspeech 2011

Mazin Gilbert

We are delighted to host the first "Show & Tell" event at Interspeech 2011. Show & Tell will provide researchers, technologists and practitioners from academia, industry and government the opportunity to demonstrate their latest research systems and interact with the attendees in an informal setting.

The CLASSiC Project: Computational Learning in Adaptive Systems for Spoken Conversation

Oliver Lemon

This article describes some of the main achievements to date in the EC FP7 project "CLASSiC", which ends in early 2011. The project focuses on statistical methods for dialogue processing, and is one of the largest current European research projects in speech and language technology. It is coordinated by Heriot-Watt University, and is a collaboration between Cambridge University, the University of Geneva, University of Edinburgh, Supelec, and France Telecom / Orange Labs.

Interview: AAAI Symposium on Dialog with Robots Brings Together Researchers in Speech Technology, Artificial Intelligence, and Human-Robot Interaction

Matthew Marge

In November, an AAAI symposium sought out to address a growing problem in robotics - how should we build dialog systems for robots? This meeting brought together researchers from communities that traditionally haven't had a clear line of communication - human-robot interaction and spoken dialog. We had a chance to interview one of the organizers about the workshop and existing challenges facing human-robot dialog researchers.

Automatic speech recognition: can we speak far from the microphones?

Maurizio Omologo

This article covers recent work in distant speech processing, including the EU-funded DICIT project.

Some thoughts on Language, Dialect and Accents in Speech and Language Technology

Martin Russell, Abualsoud Hanani and Michael Carey

This short article presents some brief thoughts on regional accent and dialect in speech and language technology research, and their relationship with language recognition.

An Overview of Watson and the Jeopardy! Challenge

Tara N. Sainath

Over the past few years, IBM Research has been actively involved a project to build a computer system, known as Watson, to compete at the human championship level on the quiz show Jeopardy!. After four years of intense research, Watson can perform on the Jeopardy! show at the level of human expertise in terms of precision, confidence and speed. The official first-even man vs. machine Jeopardy! competition will air on television February 14, 15, and 16.

Automatic Identification of Discourse Relations in Text

Svetlana Stoyanchev

Structure makes text coherent and meaningful. Discourse relations (also known as rhetorical and coherence relations) link clauses in text and compose overall text structure. Discourse relations are used in natural language processing, including text summarization and natural language generation. In this article we discuss human agreement in discourse annotation task and review approaches to automatic discourse identification.

Odyssey presentations indexed in Brno University of Technology's superlectures.com

Josef Zizka, Igor Szoke and Honza Cernocky

Superlectures.com is an innovative lecture video portal that enables users to search for spoken content. This brings a significant speed-up in accessing lecture video recordings. The aim of this portal is to make video content easily searchable as any textual document. The speech processing system automatically recognizes and indexes Czech and English spoken words.


From the SLTC Chair

John H.L. Hansen

SLTC Newsletter, February 2011

This is the first SLTC Newsletter of 2011, and also represents my first contribution as SLTC Chair. I want to extend a sincere thanks and appreciation to Steve Young, the outgoing SLTC Chair for his outstanding leadership, dedication, and commitment to the Speech and Language Processing TC. His hard work has resulted in great progress for speech and language within the IEEE Signal Processing Society. I also want to extend a warm welcome to our new vice-Chair, Doug O'Shaughnessy. I look forward to working with Doug and the rest of the SLTC Members over the next two years.

Well, it just seems like IEEE ICASSP-2010 was just here, and we're now well into the ICASSP-2011 process. One of the challenges which IEEE Signal Processing Society has moved forward on this round is to better articulate the process for presenting papers at ICASSP and ICIP. The process of determining which papers were not presented has been left to the local organizers. With so many travel restrictions these days, including visa's, etc., as well as what constitutes a personal reason, it is often hard to resolve the "no-show" papers. We all expect papers which have been accepted to be presented, and the high quality of papers accepted to ICASSP (and ICIP) should result in all of these papers being presented. I am happy to say that IEEE SPS, based on feedback from the ICASSP-2010 Technical Committee, has drafted formal guidelines which make it easier to resolve these issues (if you do not know, when a paper is accepted, at least one author must be registered at the full rate, and present the paper; if he/she cannot present the paper for any reason, it is the author's responsibility to find a suitable person technically knowledgeable on the subject matter to present the oral/poster). Posting these guidelines when authors register and making this clear will help reduce conflicts and ensure all accepted papers are presented.

Now, with respect to ICASSP 2011, we are only several months away and all acceptance/rejection notes have been sent out as of last week. This year, we had 691 papers submitted to the Speech-Language areas, 582 in speech processing (Area 13) and 109 in language processing (Area 14). In total, 2,616 reviews were completed, and 96.8% of the papers had 4 or more reviews completed (plus a Meta-Review from an SLTC Member). Many thanks to all of the TC members for their work in the reviewing/meta-review process here – this has been a major accomplishment. In particular, the enormous efforts of our Area Chairs, Pascale Fung, TJ Hazen, Brian Kingsbury and David Suendermann cannot be underestimated. Over the past three months, they have been hard at work resolving countless issues. What is left to do now is determine best student paper nominees and session chairs. Many thanks to all who participated in this process!

The process of running the SLTC represents the work of many volunteers, and I ask for your continued time in seeing that we continue the advancements seen in the last three years. The number of TC members is expanded which should help in addressing the range of duties/tasks we have to accomplish. Also, the vice-Chair will help ensure that we have continuity as we move forward. I will be sending out the Sub-Committee list in the next few weeks. We need to continue to work towards having speech and language research recognized in the SPS, including paper awards, fellow nominations, and service/technical awards. The 2010 SLT Workshop in Berkeley, CA (http://www.slt2010.org/) was very successful. Finally, while speech and language continues to expand in the IEEE Signal Processing Society, one of our interests is to reach out to other parts of the world to include new members. I encourage you to renew or establish new collaborations with places that have not seen much representation at ICASSP. With +6000 languages spoken in this world, we should be able to reduce communication barriers for speech processing and language technology advancements in all languages.

I look forward to seeing all the SLTC members in Prague at this ICASSP!

John H.L. Hansen
January 2011

John H.L. Hansen is Chair, Speech and Language Technical Committee.


2010 ISCA Student Activities Update

Samer Al Moubayed, Catherine Lai, Matt Speed, Marcel Waeltermann

SLTC Newsletter, February 2011

The ISCA Student Advisory Committee (ISCA-SAC) was established in 2005 by the International Speech Communication Association (ISCA) to organize and coordinate student-driven projects. After our work in 2008 and 2009 we are looking forward to a busy 2011.

Interspeech 2010

The ISCA-SAC were proud to host student-oriented events during Interspeech, held this year in Makuhari, Japan. The jewel in the crown was perhaps the student panel session, building on the success of the event at the previous year's conference. The theme of this year's panel session was "2010-2020 - Speech Technology in the Next Decade". The event gave research students an invaluable opportunity to discuss the likely challenges and themes for speech research over the next ten years, with prominent figures in the field. The contributing speakers and panelists were Alan Black (Carnegie Mellon University), Nick Campbell (Trinity College Dublin), Ciprian Chelba (Google) and Bowen Zhou (IBM Watson Research Center).

A very successful social reception was also held for students during the conference, allowing members to network in a more relaxed environment.

Website Goes Live

After an overhaul of the main ISCA website, the new ISCA students website has gone live after many months of development. The website features an interactive blog, a new forum, links to various ISCA resources and the ISCA grant system. Take a look now at http://www.isca-students.org/.

Contributions to YRDDS

The 2010 Young Researchers' Roundtable on Spoken Dialog Systems (YRDDS) was held at Waseda University, Tokyo in September and met with great success. The event is designed for students, post docs, and junior researchers working in research related to spoken dialogue systems. ISCA-SAC have a history of contributing to the event, with past and present members of the ISCA-SAC on the organising committee. ISCA-SAC looks forward to the 2011 event in Portland, Oregan.

Interspeech 2011

In 2011 Interspeech will be held in Florence, Italy. ISCA-SAC are very excited to be planning a number of student-oriented events for the conference, and looks forward to seeing you there.

Call for Volunteers

The ultimate goal of ISCA-SAC is to drive student-oriented events and support structures in speech and language research (read our mission at http://www.isca-students.org/?q=mission). To fulfil this we rely on our student volunteers.

Volunteering is a fantastic opportunity to really get involved and noticed in this research community. There are many different ways students can help out, and the nature of the tasks involved makes the work very flexible.

If you're interested in getting involved, contact the committee at volunteer [at] isca-students [dot] org


First "Show & Tell" event at Interspeech 2011

Mazin Gilbert

SLTC Newsletter, February 2011

Overview

We are delighted to host the first Show & Tell event at Interspeech 2011. Show & Tell will provide researchers, technologists and practitioners from academia, industry and government the opportunity to demonstrate their latest research systems and interact with the attendees in an informal setting. Demonstrations need to be based on innovations and fundamental research in areas of human speech production, perception, communication, and speech and language technology and systems.

Demonstrations will be peer-reviewed by members of the Interspeech Program Committee, who will judge the originality, significance, quality, and clarity of each submission. At least one author of each accepted submission must register for and attend the conference, and demonstrate the system during the Show & Tell sessions. Each accepted demonstration paper will be allocated 2 pages in the conference proceeding.

At the conference, all accepted demonstrations will be evaluated and considered for the Best Show & Tell Award.

Submission guidelines

People and dates

Chair: Mazin Gilbert (AT&T Labs Research, USA)

Important dates:

More information:


The CLASSiC Project: Computational Learning in Adaptive Systems for Spoken Conversation

Oliver Lemon

SLTC Newsletter, February 2011

This article describes some of the main achievements to date in the EC FP7 project "CLASSiC", which ends in early 2011. The project focuses on statistical methods for dialogue processing, and is one of the largest current European research projects in speech and language technology. It is coordinated by Heriot-Watt University, and is a collaboration between Cambridge University, the University of Geneva, University of Edinburgh, L'Ecole Supérieure d'électricité (SUPELEC), and France Telecom / Orange Labs

Project Summary

The overall goal of the CLASSiC project has been to develop statistical machine learning methods for the deployment of accurate and robust spoken dialogue systems (SDS). These systems can learn from experience - either from dialogue data that has already been collected, or online through interactions with users. We have deployed systems (for data collection and evaluation) for tourist information, customer support, and appointment scheduling. One system, for appointment scheduling, has been available for public use in France since March 2010.

The CLASSiC architecture

CLASSiC has proposed and developed a unified treatment of uncertainty across the entire SDS architecture (speech recognition, spoken language understanding, dialogue management, natural language generation, and speech synthesis). This architecture allows multiple possible analyses (e.g. n-best lists of ASR hypothesis, distributions over user goals) to be represented, maintained, and reasoned with robustly and efficiently. It supports a layered hierarchy of supervised learning and reinforcement learning methods, in order to facilitate mathematically principled optimisation and adaptation techniques. However, the CLASSiC architecture still maintains the modularity of traditional SDS, allowing the separate development of statistical models of speech recognition, spoken language understanding, dialogue management, natural language generation and speech synthesis. For more details, see citations below, or Deliverable 5.1.2.


A system "belief state" showing a probability distribution over possible user goals (size of the bar on the left indicates relative probability of the corresponding meanings on the right).

Research Areas

Progress is being made in several areas:

Evaluation Results

The CLASSiC systems and components have been evaluated both in simulation and in trials with real users, both in laboratory conditions and "in the wild" (i.e. with real users outside of the lab). The final evaluation of the systems is ongoing at the time of writing. (Please see our publications page for the referenced papers.)

We have obtained the following evaluation results to date:

Performance in shared challenges

CLASSiC partners have deployed the technology in the Spoken Dialogue Challenge and the CONLL shared tasks on syntactic-semantic dependency parsing. The CLASSiC technologies were amongst the top performers on these tasks.

New and open dialogue data-sets

The CLASSiC project will release freely available dialogue data to the research community at the end of the project (project Deliverable D6.5). This can be expected towards the middle of 2011. This data will consist of anonymised system audio, logs, transcriptions, and some annotated data from several of the CLASSiC dialogue systems. The released data will amount to several thousand dialogues.

Acknowledgements and more information

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 216594 (CLASSiC project).

Thanks to the CLASSiC partners for providing input to this article.

For more information, see:

Oliver Lemon is a Professor in the School of Mathematics and Computer Science at Heriot-Watt University, Edinburgh, where he leads the Interaction Lab. He is the Coordinator of the CLASSiC Project.


Interview: AAAI Symposium on Dialog with Robots Brings Together Researchers in Speech Technology, Artificial Intelligence, and Human-Robot Interaction

Matthew Marge

SLTC Newsletter, February 2011

Overview

The idea that we can have fluid conversations with robots has largely been limited to such fictional robots as C3PO from Star Wars and Data from Star Trek. Researchers in human-robot dialogue believe that we can leverage an existing established technology (spoken dialog systems), add it to robots, and redesign robots to work well with people for a variety of tasks (human-robot interaction). While several groups have developed dialog systems for robots, the spoken dialog and human-robot interaction communities have largely worked independently, publishing their work at such conferences as SIGdial and HRI.

In November, an AAAI symposium sought out to address a growing problem in robotics - how should we build dialog systems for robots? This meeting brought together researchers from human-robot interaction and spoken dialog. The AAAI Fall Symposium on Dialog with Robots provided a forum for researchers in speech technology, robotics, human-computer interaction and related fields to identify key challenges in the field, present new work in the area, and have open discussions on several emerging topics. Leading the efforts behind the workshop were Dan Bohus (Microsoft Research), Eric Horvitz (Microsoft Research), Takayuki Kanda (Advanced Telecommunications Research Institute International), Bilge Mutlu (University of Wisconsin-Madison), and Antoine Raux (Honda Research Institute).

We had a chance to interview one of the organizers about the workshop and existing challenges facing human-robot dialog researchers.

SLTC: Would you consider the workshop a success? What do you think attendees learned most from the workshop?

Dan Bohus:

My sense is that overall the symposium was a great success - I thought we had a great response from the community, with a larger than expected number of technical submissions and participants. A large number of ideas and viewpoints were discussed and reflected upon both during the technical presentations, the keynotes, and the very lively open discussion sessions. People I spoke with were all excited about the symposium and about getting the dialog and HRI communities to interact more closely, and I am hopeful we can carry that momentum forward: a few follow-up efforts are already in place, for instance there is a special theme on situated dialog at this year's SIGdial...personally I enjoyed a lot seeing the diversity of viewpoints and angles from which people come towards this space.

SLTC: What were your impressions from the discussion sessions? Can you identify any focus areas from them? Particularly those that speech and language researchers should consider.

DB:

Like you point out, there were indeed a variety of topics raised and discussed throughout the symposium. For me, some of the major areas that emerged from the paper submissions and from the discussions we had surrounded issues of modeling communicative mechanisms in embodied settings (e.g., attention, engagement, turn-taking, grounding), the role of physicality in communication, the interplay between communication and actions (and the various challenges that interplay raises at different time scales), aspects of learning from and through interaction, spatial language understanding, etc. I think each of these areas poses interesting challenges and I think as a community we are only beginning to chart the landscape at this intersection of spoken dialog and HRI. There were also a number of interesting papers at the symposium describing various research systems, platforms, toolkits, etc., reflecting vibrant efforts in this space, and also highlighting the need for developing common challenges, toolkits, metrics, etc. as this nascent community moves forward.

Lessons learned

The organizers also composed a final report for the meeting [1]. Here's an excerpt summarizing the key contributions from the workshop:

Ideas spanning a spectrum of interrelated research topics were presented and discussed during oral presentations and a poster session. Recurrent themes centered around challenges and directions with the use of dialog by physically embodied agents, taking into consideration aspects of the task, surrounding environment, and broader context. Several presentations highlighted problems with modeling communicative competencies that are fundamental in creating, maintaining and organizing interactions in physical space, such as engagement, turn-taking, joint attention, and verbal and non-verbal communicative skills. Other presenters explored the challenges of leveraging physical context in various language understanding problems such as reference resolution, or the challenges of coupling action and communication in the interaction planning and dialog management process. A number of papers reported on developmental approaches for acquiring knowledge through interaction, and focused on challenges such as learning new words, concepts, meanings and intents, and grounding this learning in the interaction with the physical world. The topics covered also included interaction design challenges, descriptions of existing or planned systems, research platforms and toolkits, theoretical models, and experimental results.

Several discussion sessions allowed researchers from the diverse fields that gathered together at the meeting to talk about what's next:

The symposium included three moderated, open discussions that provided a forum for exchanging ideas on some of the key topics in the symposium. The first discussion aimed to address some of the challenges at the crossroads of dialog and HRI. The physicality of such interactions was highlighted as a critical factor and the prospect of identifying a core, yet simple set of principles and first-order concepts to be reasoned about, or a “naïve physics” of situated dialog and discourse, was raised and discussed. A second open discussion centered around the interplay between action and communication, and highlighted ideas such as viewing communication as joint action and the importance of creating models for incremental processing that can support recognition and generation of actions and phenomena occurring on different time scales. The final discussion addressed several other fundamental issues such as how we might move forward in this nascent field. Discussion touched on the need for unified platforms and challenges for supporting comparative evaluations of different techniques, the pros and cons of simulation-based approaches, and even the value of revisiting fundamental questions: Why should we endow robots with the ability to engage in dialogue with people? What assumptions are we making - and which can we make?

We look forward to hearing more about how this rather new field grows!

References

If you have comments, corrections, or additions to this article, please contact the author: Matthew Marge, mrma...@cs.cmu.edu.


Automatic speech recognition: can we speak far from the microphones?

Maurizio Omologo

SLTC Newsletter, February 2011

Automatic Speech Recognition (ASR) technologies are often exposed to a usage under highly mismatched conditions, due to environmental noise and to room acoustics (in particular, reverberation) which combine together with speech input at microphone level. The degree of degradation in the input signal, and the consequent drop in ASR performance, can be very significant when the distance between user and microphone increases [1].

Although this limitation is not a crucial problem for some applications in which the user can adopt a hand-held microphone or wear a head-mounted one (e.g., dictation), in many other cases nowadays it represents one of the main reasons hindering a wide acceptance of voice technologies at customer level. During the last decade, ASR has become an increasingly popular core technology in several application domains under which no constraints on user-microphone distance should hold (e.g., automated home, support to impaired users, robot companions, gaming, etc.).

Over the years a significant progress has been made on microphone array processing and technologies. Very effective array solutions and devices have been realized, and are also available in the market, for capturing talkers in conferencing applications and meetings. However, if the design of the array processing has been determined by the primary goals of tracking the speaker, enhancing distant speech, and reducing acoustic echo in a communication, their use for ASR-based applications might often be limited to rather “controlled” and simple tasks. In other words, the current fragility of ASR technology to all the variability introduced by distant talking interaction cannot be overcome simply by replacing the close-talking microphone with a microphone array and its processing. For more complex recognition and understanding tasks, in general the characteristics of the array output signal are still far from the ideal one which would be expected at the ASR engine.

Recently, under the European Union (EU)-funded project DICIT (Distant-talking Interfaces for Control of Interactive TV) [2] we have observed that some benefits can be found by training acoustic models of the ASR engine with signals which have been pre-processed by the multi-microphone front-end (including processing for automatic speaker localization, adaptive beamforming, and acoustic-echo cancellation). The reason for this improvement is that typical distortions introduced by the given front-end represent a new knowledge learned and then exploited by the system. This is, however, a trivial and crude approach that does not allow us to understand deeply the problem.

Efforts so far spent at Fondazione Bruno Kessler (FBK)-IRST labs in realizing showcases and prototypes of distant-talking interaction also showed us that applying a speech enhancement technique that seems to be very good at perceptual level, or decreasing the relative word error rate for a given recognition task, often do not correspond to any significant impacts in terms of understanding capabilities.

Another evidence to recall (and many others are not reported here just for lack of space) regards the quality of the array device: our experience is that only a microphone array with very good performance (e.g., in terms of quality of Analog-to-Digital conversion and of immunity to electrical noise) can allow one to obtain experimental results that confirm the validity of a given theory and of related practical solutions.

Many innovations steps are necessary for a target as the human-computer interaction capability shown by HAL 9000 in 2001: Space Odyssey, or by Star Wars, i.e., recognizing, understanding and reacting intelligently to fluent spontaneous speech, with the human far from the microphones and free to move. Realizing this type of scenario, based for instance on a distribution in space of several microphone arrays, is extremely complex due to the diverse disciplines which are involved, from acoustics-oriented signal processing to spoken dialogue management and human-computer interaction design. The best results will probably come out from approaches based both on a deep knowledge of all the basic technical problems and on a synergetic combination of related techniques.

Different research communities are clearly interested in the above-mentioned technical problems. For a more effective process of innovation and progress (as very well highlighted in [1]), all of these communities of experts in different fields should interact, cooperate tightly, sharing corpora and tasks, tackling together the same final objectives and targeted scenarios, and eventually finding benefits from this complementarity.

Along this direction, over the last years some actions have been taken worldwide. This paper has not the scope of providing any state-of the-art with this regard. However, let us mention the EU-funded projects CHIL [3], AMI and AMIDA [4], which addressed both microphone array processing and speech recognition technologies, with application scenarios referred to lectures and meetings. More recently, the EU-funded DICIT project realized a spoken dialogue system for voice-enabled control of TV and access to related information, which supported three languages (i.e., English, German, and Italian). Public documents as well as video clips regarding the use of the final DICIT prototype in real noisy conditions can be found in the project web site [2].

In terms of international actions, it is also worth mentioning the HSCMA workshop [5] which will be held this year in Edinburgh, and aims to continue a tradition initiated originally by two camps of specialists, one mainly comprising experts of acoustics-oriented signal processing and Microphone Arrays (MA), and the other composed largely of experts in Hands-free Speech Communication (HSC), most especially in automatic speech recognition.

Another forthcoming event that deserves to be mentioned is the PASCAL CHiME Speech Separation and Recognition Challenge [6],[7] which addresses the problem of separating and recognizing speech artificially mixed with other speech. Speech separation represents, in fact, another frontier area to make distant-talking interaction systems robust to manage multiple subjects speaking simultaneously, to track them in space, and, in general, to reduce the negative impact on ASR of other possible active sources that would interfere with the user.

For any questions or comments, please contact Maurizio Omologo at the following e-mail address: omologo@fbk.eu. Information about the research activities being conducted at FBK-irst on the given topics are also available at http://shine.fbk.eu.

For more information, see:

References

Maurizio Omolog is a Senior Researcher and Project Leader at the Fondazione Bruno Kessler in Trento, Italy. Email: omologo@fbk.eu


Some thoughts on Language, Dialect and Accents in Speech and Language Technology

Martin Russell, Abualsoud Hanani and Michael Carey

SLTC Newsletter, February 2011

Accent and Dialect in Speech and Language Technology

-->

In recent years the topics of 'accent' and 'dialect' have become common in speech and language technology research. A search of the Interspeech 2010 proceedings for the word 'dialect' returns 74 references, of which 64 refer to some extent to work concerned with variations caused by dialect or accent (this is approximately 8% of the total number of papers presented at Interspeech). Of these, 40% are concerned with speech science, referring to 'dialect' in the contexts of 18 different languages, and 60% with technology. In speech technology the most common references are to dialect as a source of variability in speech recognition, however five papers address the problem of dialect recognition directly. There is some ambiguity in the speech technology literature between the terms 'dialect' and 'accent' and some authors also use the term 'variety' (for example, Koller, Abad, Trancoso and Viana discuss varieties of Portuguese in [1]). In British English 'accent' normally refers to systematic variations in pronunciation associated with particular geographic regions, while 'dialect' also includes the use of characteristic words in those regions. So for example, when a speaker from Yorkshire in the North of England pronounces "bath" to rhyme with "cat" rather than with "cart" they are exhibiting a Yorkshire (or at least northern English) accent, but when they use the word "lug" to mean "ear" or "flag" to mean "paving stone" these are examples of Yorkshire dialect [2]. These issues are discussed in more depth in the three volumes by Wells, probably the best known works on the accents of English [3].

In these terms, most speech technology is concerned with accent. A search for the word 'accent' returns 1771 instances in 117 documents in the Interspeech 2010 proceedings, but of course, accent certainly has more than one meaning in the context of speech and language science.

Early work on accent recognition follows the well-known characterisation of a language as a "dialect with an army and a navy" [4], with researchers typically using GMM-SVM (Gaussian Mixture Model / Support Vector Machine) and PRLM (Phone Recognition – Language Modelling) methods from Language Identification. However, some recent research has exploited specific properties of accents. In [5] Biadsy, Hirschberg and Collins use the fact that, at least to a first approximation, accents share the same phone set, but the realisation of these phones may differ. They build phone-dependent GMMs which in turn are used to create 'supervectors' for classification using an SVM, and report improved performance compared with a conventional GMM-SVM-based language recognition system. Huckvale [6] takes this a step further with his ACCDIST measure, by exploiting the fact that British English accents can be characterised by the similarities and differences between the realisations of vowels in specific words (for example, from the previous paragraph, for a northern English accent the 'distance' will be small between the vowels in "bath" and "cat" but large between the vowels in "bath" and "cart", whereas for a southern English accent the opposite will be the case). Huckvale reports an accent recognition accuracy of 92.3% on the 14 accent Accents of the British Isles speech corpus [7]

In more recent work in our laboratory we have applied techniques from language identification to the problem of discriminating between speakers from different groups from within the same accent, specifically between second-generation Asian and white speakers born in Birmingham. We achieved a recognition accuracy of 96.51% using 40s of test data with a Language Identification system that fuses the outputs of several acoustic and phonotactic systems. This is much better than we expected and compares well with the 90.24% achieved by human listeners [8].

The fact that it is possible to decide automatically which accent group, or even which social group within a particular accent group, an individual belongs to, and to achieve this using as little as 40s of data, has interesting implications for automatic speech recognition. First, it confirms that there are significant acoustic and phonotactic differences even within a "homogeneous" accent group. Second, it shows that these differences are sufficiently large be detected automatically. Hence it may be possible to identify suitable acoustic, lexical and even grammatical models automatically for rapid adaptation. It will also be interesting to see if the ideas that are developing in the context of dialect and accent recognition can be 'pulled back' to achieve improved results in Language Identification.

References

  1. Oscar Koller, Alberto Abad, Isabel Trancoso and Ceu Viana, "Exploiting variety-dependent Phones in Portuguese Variety Identification applied to Broadcast News Transcription" Proc. Interspeech 2010, pp 749752.
  2. The Yorshire dialect website
  3. J. C. Wells, "Accents of English", volumes 1, 2 and 3, Cambridge University Press, 1982.
  4. "A language is just a dialect with an army and navy" - wikipedia reference
  5. Fadi Biadsy, Julia Hirschberg, Michael Collins, "Dialect Recognition Using a Phone-GMM-Supervector-Based SVM Kernel", Pro. Interspeech 2010, pp 753-756.
  6. Mark Huckvale, "ACCDIST: An Accent Similarity Metric for Accent Recognition and Diagnosis", in Speaker Classification II, Lecture Notes in Computer Science, 2007, Volume 4441, 2007
  7. ABI website
  8. Abualsoud Hanani, Martin Russell and Michael Carey, "Speech-Based Identification of Social Groups in a Single Accent of British English by Humans and Computers", to appear in Proc. IEEE ICASSP 2011

Martin Russell is a Professor, Michael Carey an Honorary Professor and Abualsoud a Research Student in the School of Electronic, Electrical and Computer Engineering at the University of Birmingham, UK. Email: {m.j.russell@bham.ac.uk, m.carey@bham.ac.uk, aah648@bham.ac.uk


An Overview of Watson and the Jeopardy! Challenge

Tara N. Sainath

SLTC Newsletter, February 2011

Over the past few years, IBM Research has been actively involved a project to build a computer system, known as Watson, to compete at the human championship level on the quiz show Jeopardy!. After four years of intense research, Watson can perform on the Jeopardy! show at the level of human expertise in terms of precision, confidence and speed. The official first-ever man vs. machine Jeopardy! competition will air on television February 14, 15, and 16.

A Deep question-answer (QA) architecture and technology [1] was developed as part of the Jeopardy! challenge to handle QA problems like Jeopardy!. In this article, we describe this DeepQA architecture in more detail, focusing specifically on the speech and natural language processing (NLP) components of the system.

Deep QA Architecture

Watson's Sources

The first step in any QA architecture such as DeepQA is to create a content server which contains relevant information that can be used when answering a question. The creation of the content server is done offline in two steps. First, example questions from the Jeopardy! problem space are analyzed to produce a description of the types of questions that must be answered, as well as to characterize the domain of the question. Given the large vocabulary nature of a task such as Jeopardy!, sources such as encyclopedias, dictionaries, Wikipedia, news articles, etc,. were analyzed to create the content server.

Second, a process called automatic corpus expansion was performed. In this process, a set of seed documents are identified. Then a set of text nuggets extracted from the retrieved web documents are scored to see which nuggets are most informative for each seed document. The most informative nuggets are added to an expanded corpus.

Question Analysis

When a question is received in real-time by the DeepQA system, the first step in the process is question analysis. In this step, the system tries to understand the question itself and performs initial analysis to determine how the question should be further analyzed by other components in the system. A parser, tuned to address the Jeopardy! question phrasing, is used to perform various analyses on the question. Furthermore, additional NLP analysis components are used to process the question, including named entity detection, relation detection, semantic role labeling, etc. The analysis is particularly challenging because Jeopardy! questions can have many nuances, including imprecise information and extraneous information. Therefore, questions must be analyzed to determine what is more vs. less important, and then different search queries are formulated based on this analysis.

Search and Scoring

In the search component, hypotheses are generated by taking the various search queries and searching through the large corpus described above, roughly the size of 1 million books. The search process reduces the 1 million books to less than a hundred candidate passages. This process has been heavily optimized for speed and takes approximately 1 second.

Next, the passages are analyzed by our NLP components much the same way the question was and a set of candidate answers are extracted. The candidates and the analyzed passages are the passed to a scoring component, which contains over 50 scorers that score different aspects of how well a proposed answer matches what the question was asking. These aspects could include temporal scoring, spatial relation scoring, categorical scoring, etc. The scores from different methods are combined to find the top hypothesis from the set of candidate answers.

Answer Generation

Once the textual answer is formulated, the computer then ``speaks" the answer. This is handled by a real-time unit-selection concatenative text-to-speech (TTS) engine, consisting of a language-dependent text-processing front-end and a back-end that handles unit search and waveform generation. One of the challenges in designing the TTS system for Jeopardy is the large open-ended vocabulary nature of the task, as most TTS systems are usually customized for applications requiring narrower vocabulary. A significant portion of this large vocabulary contains words of foreign origin that do not conform as well to the letter-to-sound rules of English, and which required special attention. Another aspect was how to handle text-normalization issues. Examples of these are commonly occurring Roman numerals, as well as idiosyncratic punctuation that is commonly observed in Jeopardy's categories, such as word-internal quotes, multiple dashes, etc. Homograph disambiguation is yet another issue that the system needs to contend with, as the answers to the questions are usually fairly short in length, and therefore the system has access to little or no disambiguating context to decide between alternative pronunciations for a given spelling.

In order to design a robust TTS system, researchers searched through records of previous games to identify and rank salient topics according to frequency of occurrence, and then used these to consult on-line and other sources (Wikipedia, dictionaries, etc.) and assemble a representative vocabulary for each topic.

Human listeners spent more than a year listening to the TTS outputs to identify and correct errors. Related groups of errors that were more systemic were corrected by improving the set of rules in the text-processing front-end. More exceptional cases (such as foreign words that violate the phonotactics of English) were addressed by adding them to exceptions dictionary (or, more generally, context-dependent or phrase dictionaries) that could be consulted for look-up before the standard text-processing rules of the front-end could be activated.

Future

The DeepQA system has certainly pushed state of the art in open-domain QA systems, in terms of depth, breadth and robustness. This QA architecture is now being extended to other applications with similar scenarios to Jeopardy, such as call center and medical domain applications.

Acknowledgements

Thank you to Jennifer Chu-Carroll, Raul Fernandez and Bhuvana Ramabhadran of IBM T.J. Watson Research Center for useful discussions related to the content of this article.

For further information about the project, please visit the following website:
http://www-03.ibm.com/innovation/us/watson/index.shtml

References

[1] D. Ferrucci et al., "Building Watson: An Overview of the DeepQA Project," Association for the Advancement of Artificial Intelligence, Fall 2010.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com


Automatic Identification of Discourse Relations in Text

Svetlana Stoyanchev

SLTC Newsletter, February 2011

Structure makes text coherent and meaningful. Discourse relations (also known as rhetorical and coherence relations) link clauses in text and compose overall text structure. Discourse relations are used in natural language processing, including text summarization and natural language generation. The area of discourse has become a prominent part of natural language processing. Discourse theory was formalized by Mann and Tompson in 1988 [1]. It proposes a set of relations such as evidence, concession, justification, background, circumstance, etc. These relations were further refined, extended, and applied to corpus annotations. In a recent SLTC Newsletter Article, Annie Louis described the Penn Discourse Treebank (PDTB) [2], a one million word corpus annotated with discourse. In this article we discuss human agreement in discourse annotation task of PDTB and RST (another publicly available resource annotated with discourse) and review approaches to automatic discourse identification.

Uses of Discourse

Discourse relations are extensively used in natural language processing applications, including text summarization and natural language generation. In summarization, discourse relations help identify which text should be included in a summary [5, 6] and produce appropriate ordering of sentences in a summary [7]. In generating speech from text, discourse relations are used to achieve a higher prosodic quality of the speech [8]. In our current work on dialogue generation from text we rely on discourse relations parsing as initial text processing step [9]. Education is another field that uses discourse relations theory. In automatic essay scoring, discourse relations have shown to improve performance [10]. In automated tutoring systems, discourse structure of the tutoring dialogue is used for studen's performance analysis [11]. In question answering, discourse relations about reason can help identify answers to "why" questions.

Practical NLP applications rely on automatic detection of discourse relation. Louis et al. [13] notes that structure of discourse relations is the most useful for content selection in summarization task, but "the state of current parsers might limit the benefits obtainable from discourse".

Human Agreement on Annotations

Performance of automatic classifiers is bound by the inter-annotator agreement on the experimental dataset. If human annotators do not reliably agree on their tag assignments, automatic algorithms can not outperform them.

In the RST corpus annotation [2], the authors report a marked improvement in annotation agreement over a 10 month period. Inter-annotator agreement of assigning relation rises from kappa = 0.60 in April 2000 to 0.76 in January 2001. This shows that even for humans, discourse relation detection is a very difficult task. To achieve a reliable level of discourse annotation, people have to be extensively trained.

RST annotation scheme posits a single tree structure on the document. RST annotation involves 1) segmenting text and 2) assigning a discourse relation from a list of 110 relations (which is a rather daunting task). PDTB[1] annotation scheme, on the other hand, does not assume a particular structure. In comparison with RST annotation process, PDTB annotation involves 1) span detection for explicit and implicit discourse connectives ("but", "because", "when", etc.) and 2) disambiguation of discourse connectives. In an implicit relation where discourse connective is not present, PDTB annotator first chooses the most applicable connective. The PDTB annotation scheme allows annotators to choose between three levels of granularity: four classes at the topmost level - "Comparison", "Contingency", "Temporal" and "Expansion" that are further subdivided into types and subtypes. On the topmost level inter-annotator agreement is 94%, while in the lowest subtype level (which corresponds to the RST annotations) it is 80%.

Observation on Tree Structure in RST Annotation

Relations on the leaf (lowest) level of RST tree correspond to relations within a sentence or between consecutive sentences. Relations on the higher level of the tree correspond to the relations between multiple sentences or paragraphs. In our attempt to annotate full tree structure on a monologue paraphrase of an expository dialogue [14], we found practically no agreement on the higher level relations and structure of RST trees. The agreement on the leaf level is 0.62, comparable with the initial (before training) RST annotations. This lack of agreement on higher level of the tree may be caused by the type of data set, or by the fact that discourse relations between multiple sentences are more ambiguous. Further experiments are needed to determine the cause for the lower agreement between leaf and higher levels of the tree.

Automatic Detection of Discourse Relations

Automatic detection of discourse relations was applied and tested on different data sets:

Soricut and Marcu (2003) [15] use RST corpus to train and test a sentence-level discourse parser. The authors use lexical and syntactic information in a sentence to first, identify segments, and second, to identify discourse relations between them. They report that segmentation task achieves 0.85 f-score and is not strongly affected by syntactic parser's errors. They find that relation tagging task on manually segmented data achieves f-score 0.75 - not very different from human performance of 0.77.

In an approach to automatic sense disambiguation based on the GraphBank corpus, Wellner et al. [16] use data pre-processing techniques such as event detection, modal parsing (identifying subordinate verb relations and their types), and temporal parsing over events. The authors also use knowledge resources (World Sketch Engine and Brandeis Semantic ontology) for similarity measures. Using maximum entropy classification with an extensive set of linguistically motivated features, their method achieves 81% accuracy on sense assignment task (using a set of 10 coarse-grained relations and assuming that nuclearity- relation direction - is given).

Pitler et al. [17, 18] investigate automatic detection of discourse relations between and within sentences using PDTB corpus evaluating explicit (signalled by a discourse connective) and implicit (not signalled) relations separately. The authors find that recognition of explicitly signalled relations is very good (>90% accuracy). But for relations that are implicitly conveyed, their approach yield accuracy results below 50% for 6-way (4 high level relation classes + Entity relation + no relation) classification. Similarly, 40% accuracy on identifying implicit relations was also achieved by Lin et al. [19].

DuVerle and H. Prendinger [20] develop a full RST structure parser using a range of lexical, semantic, and structural features with Support Vector Machine classification. They report achieving 73% of human inter-annotator agreement f-score. The parser web interface is publicly available.

Conclusions

Discourse relations are useful in many NLP tasks and automatic detection of discourse structure can benefit practical NLP applications. While sentence-level structure can be extracted with accuracy close to human agreement, extracting overall document structure is more challenging. Explicit discourse relations (signalled with a connective such as because, but, however, etc) between clauses can be recognized with high accuracy, however there is room for improvement for recognition of implicit discourse relations, which constitute over 45% in the PDTP corpus.

For more information, see:

References on theory of discourse:

[1] William C. Mann and Sandra A. Thompson, "Rhetorical structure theory: Towards a functional theory of text organization", Text, 8, 1988

[2] Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber, "The Penn Discourse Treebank 2.0", Proceedings of LREC 2008.

[3] Florian Wolf and Edward Gibson, "Representing discourse coherence: A corpus-based study", Computational Linguistics, 2005

[4] Lynn Carlson, Daniel Marcu and Mary Okurowski, "Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory", In Proceedings of the Second SIGdial Workshop on Discourse and Dialog, 2001

References on the use of discourse annotations:

[5] D. Marcu. From discourse structures to text summaries. In In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, pages 82-88, 1997.

[6] I. Mani, E. Bloedorn, and B. Gates. Using cohesion and coherence models for text summarization. In AAAI Symposium Technical Report SS-989-06, pages 69-76. AAAI Press, 1998.

[7] R. Barzilay, N. Elhadad, and K. McKeown. Inferring strategies for sentence ordering in multidocument news summarization. J. Artif. Intell. Res. (JAIR), 17:35-55, 2002.

[8] M. Theune. Contrast in concept-to-speech generation. Computer Speech & Language, 16(3-4), 2002.

[9] P.Piwek and S. Stoyanchev. Generating expository dialogue from monologue:motivation, corpus and preliminary rules. In Proceedings of 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2010

[10] E. Miltsakaki and K. Kukich. Evaluation of text coherence for electronic essay scoring systems. Natural Language Engineering, 10:25-55, March 2004.

[11] M. Rotaru and D. J. Litman. Discourse structure and performance analysis: Beyond the correlation.

[12] M. Theune. Contrast in concept-to-speech generation. Computer Speech & Language, 16(3-4), 2002.

[13] Annie Louis, Aravind Joshi and Ani Nenkova, Discourse indicators for content selection in summarization, Proceedings of SIGDIAL 2010.

[14] Constructing the CODA Corpus: A Parallel Corpus of Monologues and Expository Dialogues S. Stoyanchev and P. Piwek 7th international conference on Language Resources and Evaluation (LREC) 2010, Malta

References on automatic detection of discourse relations:

[15] R. Soricut and D. Marcu. 2003. Sentence level discourse parsing using syntactic and lexical information. In HLT-NAACL.

[16] B. Wellner, J. Pustejovsky, C. Havasi, A. Rumshisky, and R. Sauri. 2006. Classification of discourse co-herence relations: An exploratory study using multiple knowledge sources. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue.

[17] Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani Nenkova, Alan Lee, Aravind Joshi, "Easily Identifiable Discourse Relations", Proceedings of COLING, 2008.

[18] E. Pitler et al. Automatic sense prediction for implicit discourse relations in text. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009

[19] Z. Lin, M. Kan and H. T. Ng Recognizing Implicit Discourse Relations in the Penn Discourse Treebank . In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2009

[20] D. A. duVerle and H. Prendinger. A novel discourse parser based on support vector machine classificaion. Proceedings of ACL, 2009

If you have comments, corrections, or additions to this article, please contact the author: Svetlana Stoyanchev, s.stoyanchev [at] open [dot] ac [dot] uk.

Svetlana Stoyanchev is Research Fellow at the Open University. Her interests are in dialogue systems, natural language generation, discourse, and information presentation.


Odyssey presentations indexed in Brno University of Technology's superlectures.com

Josef Zizka, Igor Szoke and Honza Cernocky

SLTC Newsletter, February 2011

Superlectures.com is an innovative lecture video portal that enables users to search for spoken content. This brings a significant speed-up in accessing lecture video recordings. The aim of this portal is to make video content easily searchable as any textual document. The speech processing system automatically recognizes and indexes Czech and English spoken words.

The main features of the Brno University of Technology (BUT) lecture browser are:

Search in speech

The BUT lecture browser enables users to search for what was spoken during all lectures or the search can be restricted to a specific one. If the search is performed globally, a list of talks matching the search query is shown. Then, the user can select a list of results for each talk. The results are displayed along with their confidence scores where those with the highest confidence are shown first. The user can begin playing the video from any result. In the playback mode, the words are highlighted as they are spoken. The results are accompanied by the transcript of the surrounding segment and shown on the timeline to help the user navigate around.

The transcripts are generated using the BUT speech recognition system. The system architecture design was a contribution of the members of the BUT Speech@FIT group, as well as the training of the ASR language models, while the recognition software is based on BS-CORE library produced in cooperation of BUT and its spin-off Phonexia. The indexing and search system is built on Apache Lucene.

Odyssey2010

In summer 2010, Odyssey: The speaker and language recognition workshop took place at the Faculty of Information Technology, Brno University of Technology. All talks were recorded using a fixed camera facing the projection screen and positioned in the middle of the lecture room. This set-up ensured that the recorded video included both a lecturer and projected slides. Each speaker was asked to fill a consent form in which he/she confirmed what can be done with the recordings. Most of them agreed to make them publicly available. However, some restricted the availability of their recordings to the workshop attendees and some did not want to make them public at all.

As the language used during the conference usually contains a lot of technical terms, the recognition vocabulary had to be extended to include new words. They were extracted mainly from various scientific papers in the field of speech processing and from the slides presented. Special attention was given to pronunciation of various abbreviations, such as JFA, GMM, etc.


Search results page with hits for "MLLR system"

Data processing

The lecture browser works with preprocessed data. As soon as a video recording is available, a sequence of scripts prepares data for the lecture browser. First, the audio track is extracted from the video recording. Then, it is normalized, converted into a suitable format and merged back with the video. After that, the video recording is converted into Flash video format and, afterwards, image thumbnails and an MP3 file are created. The audio track is processed using our speech processing system. Finally, files with transcription, subtitles and other information are generated.

Lectures, particularly in the academic environment, are mostly based on the presentation of slides. Identifying when a particular slide was presented in a video is an important cue for navigating in the recordings and necessary for providing the user with a high-quality version of the projected slide. If the PPT or PDF file is available, it can be automatically synchronized with the video recording.


Lecture page with transcription and synchronized slides

Conclusion

A possibility to search in video can dramatically help the users to navigate the recordings. Using a web-based interface, the lecture browser runs on many computers without a need to install any special software. The demo version that contains recordings of the Odyssey workshop is available at: http://www.superlectures.com/odyssey. This website also includes more information on the BUT lecture browser.

Primarily, the lecture browser was developed to help students prepare for their final examinations at our faculty, however, there are plenty of other cases where the BUT lecture browser can be of a great help. We will be happy to hear about them.

Josef Zizka is staff member of BUT Speech@FIT group. He is responsible for superlectures.com system development and user interface. Email: zizkaj@fit.vutbr.cz

Igor Szoke is researcher at BUT Speech@FIT group. He is responsible for BUT's keyword spotting and spoken term detection technologies. Email: szoke@fit.vutbr.cz

Honza Cernocky is Head of Department of Computer Graphics and Multimedia, Faculty of Information Technology, Brno University of Technology, and managing head of BUT Speech@FIT group. Email: cernocky@fit.vutbr.cz