<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" ><head> <h1>Speech and Language Processing Technical Committee Newsletter</h1> <h4>October 2009</h4> <p>Welcome to the Autumn 2009 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter.</p> <p>In this issue we are pleased to provide another installment of brief articles representing a diversity of views and backgrounds. We are delighted to feature articles from 19 guest contributors, and our own 5 staff reporters. We believe the newsletter is an ideal forum for updates, reports, announcements and editorials that do not fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can reach us at speechnewseds [at] listserv (dot) ieee [dot] org.</p> <p>Finally, to subscribe the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.</p> <p>Jason Williams, Editor-in-chief <br>Pino Di Fabbrizio, Editor <br>Chuck Wooters, Editor</p> <hr> <h2>From the SLTC and IEEE</h2> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/sltc-new-members/" title="SLTC Newsletter : October 2009 : SLTC New Members Election">SLTC New Members Election</a></h3> <h4>Geoffrey Chan</h4> <p>The SLTC has elected 18 new members, and expanded to a total of 48 members to handle the growing number of speech and language submissions to ICASSP.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/icassp-paper-review-process/" title="SLTC Newsletter : October 2009 : The ICASSP Paper Review Process">The ICASSP Paper Review Process</a></h3> <h4>Steve Young</h4> <p>One of the main responsibilities of the Speech and Language Processing Technical Committee is to manage the review process for speech and language papers at ICASSP. This article explains how this is done.</p> <h3><a href="http://www.signalprocessingsociety.org/newsletter">IEEE Signal Processing Society Newsletter</a></h3> <h4></h4> <p>The IEEE Signal Processing Society, our parent organization, also produces a monthly newsletter, "Inside Signal Processing".</p> <hr> <h2>CFPs, Jobs, and book announcements</h2> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/cfps-2009-10/" title="SLTC Newsletter : October 2009 : CFPs">Calls for papers, proposals, and participation</a></h3> <p>Edited by Chuck Wooters</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/jobs-2009-10/" title="SLTC Newsletter : October 2009 : Job Announcements">Job advertisements</a></h3> <p>Edited by Chuck Wooters</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/books-2009-10/" title="SLTC Newsletter : October 2009 : Book Annoncements">Book announcements</a></h3> <p>Edited by Jason Williams</p> <hr> <h2>Articles</h2> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/yrrsds/" title="SLTC Newsletter : October 2009 : Young researchers face-to-face on human-machine dialog">Young researchers face-to-face on human-machine dialogue</a></h3> <h4>David Pardo, Milica Ga&#353i&#263, Joana Paulo Pardal, Ricardo Ribeiro, Matthew Marge and Fran&ccedil;ois Mairesse</h4> <p>The fifth Young Researchers' Roundtable on Spoken Dialogue Systems gathered 41 researchers in academia and industry, from a wide variety of institutions at Queen Mary University in London last month.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/saras-2009-10/" title="SLTC Newsletter : October 2009 : Hy Murveit, William Labov, Bruce Millar, and Wolfgang Hess talk to the Saras Intitute">Hy Murveit, William Labov, Bruce Millar, and Wolfgang Hess talk to the Saras Intitute</a></h3> <h4>Olga Pustovalova, Antonio Roque, and Tadesse Anberbir</h4> <p>We continue the series of excerpts of interviews from the <a href="http://www.sarasinstitute.org/">History of Speech and Language Technology Project</a>. In these segments Hy Murveit, William Labov, Bruce Millar, and Wolfgang Hess discuss how they became involved with the field of speech and language technology.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/interspeech-emotion-challenge/" title="SLTC Newsletter : October 2009 : The INTERSPEECH 2009 Emotion Challenge">The INTERSPEECH 2009 Emotion Challenge: Results and Lessons Learned</a></h3> <h4>Bjoern Schuller, Stefan Steidl, Anton Batliner, and Filip Jurcicek</h4> <p>The <a href="http://emotion-research.net/sigs/speech-sig/emotion-challenge">INTERSPEECH 2009 Emotion Challenge</a>, organised by Bjoern Schuller (TUM, Germany), Stefan Steidl (FAU, Germany), and Anton Batliner (FAU, Germany), was held in conjunction with <a href="http://www.interspeech2009.org/conference/programme/session.php?id=6810">INTERSPEECH 20009</a> in Brighton, UK, September 6-10.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/sigdial/" title="SLTC Newsletter : October 2009 : Overview of Sigdial 2009 Conference">Overview of Sigdial 2009 Conference</a></h3> <h4>Svetlana Stoyanchev</h4> <p>September 11 - 12, Queen Mary College hosted Sigdial in London, for the first time, as a conference. This year Sigdial had 125 participants and 104 submissions. There were 24 oral and 24 poster presentations, 3 demos, and two invited speeches. This article gives an overview of the topics, invited talks, and awards presented at the conference.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/blizzard-report/" title="SLTC Newsletter : October 2009 : Report of the Blizzard Challenge 2009 Workshop">Report of the Blizzard Challenge 2009 Workshop</a></h3> <h4>Tomoki Toda</h4> <p>This article provides a report of the Blizzard Challenge 2009 Workshop, an annual speech synthesis event, that took place on September 4th 2009 in Edinburgh, UK.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/nist-rt/" title="SLTC Newsletter : October 2009 : NIST Conducts Rich Transcription Evaluation">NIST Conducts Rich Transcription Evaluation</a></h3> <h4>Satanjeev "Bano" Banerjee</h4> <p>NIST recently concluded the latest edition of their Rich Transcription evaluation exercise series. The results of the exercise show that automatic transcription of overlapping speakers with distant microphones remains a difficult task, with little improvement over the previous evaluation conducted in 2007.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/gunnar-fant/" title="SLTC Newsletter : October 2009 : Gunnar Fant - 1919-2009">Gunnar Fant, 1919-2009</a></h3> <h4>Rolf Carlson and Bj&ouml;rn Granstr&ouml;m</h4> <p>A pioneering giant in speech research has passed away. Professor Emeritus Gunnar Fant died on June 6th at the age of 89 after a long and prominent research career.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/multimodal-epg/" title="SLTC Newsletter : October 2009 : Multimodal Voice Search of TV Guide in a Digital World">Multimodal Voice Search of TV Guide in a Digital World</a></h3> <h4>Harry Chang</h4> <p>This article discusses the interesting characteristics of the language models associated with multimodal search applications in the area of digital TV guides in terms of the linguistic properties associated with their text content and from a user&rsquo;s perspective with respect to spoken or typed queries.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/interspeech-report/" title="SLTC Newsletter : October 2009 : INTERSPEECH 2009">INTERSPEECH 2009</a></h3> <h4>Filip Jurcicek </h4> <p>10th Annual Conference of the International Speech Communication Association (<a href="http://www.interspeech2009.org/conference/">INTERSPEECH 2009</a>), which was held in Brighton, UK, September 6-10 2009, provided researchers with a great opportunity to share recent advances in the area of speech science and technology.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/text-structure-models/" title="SLTC Newsletter : October 2009 : Data-driven Models of Text Structure and their Applications">Data-driven Models of Text Structure and their Applications</a></h3> <h4>Annie Louis</h4> <p>Information about text structure is important for computational language processing. For example, in multi-document summarization, such information is necessary to correctly order sentences chosen from multiple documents. Language generation and machine translation systems could be informed by coherence models for better structured output. This article describes some of the empirical models of text structure that have had considerable success in several NLP tasks.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/dod/" title="SLTC Newsletter : October 2009 : Students and Researchers Come Together for Dialogs on Dialogs">Students and Researchers Come Together for "Dialogs on Dialogs"</a></h3> <h4>Matthew Marge</h4> <p>"<a href="http://www.cs.cmu.edu/~dgroup/">Dialogs on Dialogs</a>" offers students and researchers an opportunity to regularly discuss their research on an international scale.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/avixd/" title="SLTC Newsletter : October 2009 : The Association for Voice Interaction Design">The Association for Voice Interaction Design, or <i>AVIxD</i></a></h3> <h4>Jenni McKienzie, Peter Krogh</h4> <p><a href="http://www.avixd.org">AVIxD</a> (the Association for Voice Interaction Design) recently hosted its 8th workshop on voice interaction design. These workshops are held once or twice a year as an opportunity for voice interaction professionals to come together, put companies and competition aside, and tackle issues facing the industry.</p> <h3><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-10/bonn/" title="SLTC Newsletter : October 2009 : Proposed Closure of Phonetics and CL in Bonn">Proposed Closure of Phonetics and CL in Bonn</a></h3> <h4>Bernd M&ouml;bius</h4> <p>The University of Bonn, Germany, is proposing to close down the division of Language and Speech, formerly known as the "Institut f&uuml;r Kommunikationsforschung und Phonetik".</p> <hr> <p><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/subscribe/" title="SLTC Newsletter : Subscribe">Subscribe to the newsletter</a></p> <p><a href="http://archive.signalprocessingsociety.org/technical-committees/list/sl-tc/" title="Speech and Language Processing Technical Committee">SLTC Home</a></p> <hr /> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- TITLE --><html><head> <meta content="text/html; charset=unicode" http-equiv="Content-Type"> <meta name="GENERATOR" content="MSHTML 8.00.6001.18812"></head> <body> <h2>SLTC New Members Election</h2><!-- BY-LINE --> <h3>Geoffrey Chan</h3><!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Queen's University --><!-- DATE (month) --> <p>SLTC Newsletter, October 2009</p><!-- ABSTRACT. This is used both in the article itself and also on the table of contents --><!-- <p>The SLTC has elected 18 new members, and expanded to a total of 48 members to handle the growing number of speech and language submissions to ICASSP.</p> --> <p>The Speech and Language Processing Technical Committee held an election in September to elect 18 new members to serve the three-year term from 2010-2012. The new members serve to fill 13 positions which will be vacated by members whose terms end in December 2009, as well as to expand the Committee by five positions to a total of 47 elected members plus the Chair. The expansion will enable the Committee to better fulfill its mandate of championing speech and language processing (SLP) activities. SLTC members are responsible for reviewing and overseeing the review of a growing number of SLP submissions to ICASSP, sponsoring various SLP workshops, nominating people engaging in SLP for IEEE awards, supporting SLP publications such as the IEEE Transactions on Audio, Speech and Language Processing, and working with other organizations within and external to the IEEE Signal Processing Society.</p> <p>A call for nomination was announced to the Committee in early August. By the end of August, 38 candidates were nominated, each running for a position in one of ten technical areas. The number of candidates x and positions y in the technical areas are as follows: speech recognition (x=13, y= 6), speaker recognition (4, 3), speech synthesis (2, 1), natural language processing (4, 1), dialogue systems (3, 2), speech production (3, 1), speech analysis (2, 1), speech enhancement (2, 1), speech perception (2, 1), speech coding (3, 1).</p> <p>The poll was administered from the Survey Monkey web site. Voters were required to rank every candidate in every area. SLTC By-Laws stipulate that candidates must receive a majority of votes to be elected. To satisfy this requirement while avoiding time consuming run-off elections, the 2008 SLTC Member Election Subcommittee adopted a "ranked pairs" election methodology to simulate run-off elections. A program authored by Tim Hazen to implement the method was used to process the primary vote data collected on the web site.</p> <p>When the poll was closed on September 14, 95% of SLTC members had cast their votes. The 18 newly elected members are as follows:</p> <ul> <li><b>Speech Recognition:</b> Dirk Van Compernolle, John H.L. Hansen, Kate Knill, Hong-Kwang Jeff Kuo, Bhuvana Ramabhadran, Martin Russell <li><b>Speaker Recognition:</b> Carol Espy-Wilson, Seiichi Nakagawa, Douglas Reynolds <li><b>Speech Synthesis:</b> Simon King <li><b>Natural Language Processing:</b> Junlan Feng <li><b>Dialogue Systems:</b> Olivier Pietquin, David Suendermann <li><b>Speech Production:</b> Olov Engwall <li><b>Speech Analysis:</b> Tom Quatieri <li><b>Speech Enhancement:</b> Les Atlas <li><b>Speech Perception:</b> Takayuki Arai <li><b>Speech Coding:</b> Bastiaan Kleijn </li></ul> <p>Lets extend our congratulations to these new members and offer our appreciation to the members who will complete their terms by December.</p> <p><i>Geoffrey Chan is a member of the SLTC Member Election Subcommittee which&nbsp;includes also&nbsp;Yannis Stylianou and Frederic Bechet.</i></p> <hr /> <!-- TITLE --><h2>The ICASSP Paper Review Process</h2> <!-- BY-LINE --><h3>Steve Young</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Cambridge University --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <!-- <p>One of the main responsibilities of the Speech and Language Processing Technical Committee is to manage the review process for speech and language papers at ICASSP. This article explains how this is done.</p> --> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!-- <h4>Overview</h4> --> <p>The winter holidays are approaching and that signals one of the busiest times of the year for the Speech and Language Processing Technical Committee. More than 600 papers have been submitted to ICASSP 2010 in the speech and language area and it is the job of the technical committee (TC), and especially the four Area Chairs, to oversee the process by which papers are reviewed, selected and assigned to sessions.</p> <p>The first stage of the process is to allocate reviewers to papers in a way which ensures as best we can that papers are assigned to reviewers with the appropriate expertise and that there are no obvious conflicts of interest. This is done by firstly allocating papers based on their specified EDICS numbers and then manually checking the assignments.</p> <p>Our primary goal in managing the review process itself is to ensure that it is as fair and as transparent as possible. To help ensure this, we allocate four reviewers to every paper and we guarantee that all papers will receive at least three reviews. One of the four reviewers will be a TC member and once all the reviews are complete, the TC member conducts a meta-review of each paper assigned to him or her. When all of the reviewers agree, this meta-review will simply record the consensus opinion. However, when there is disagreement or the paper is on the borderline, the TC member will examine each of the reviews carefully and weigh the evidence provided by the reviewers to support the scores they gave. After weighing the evidence, the TC member makes a firm recommendation either one way or the other. By this process we hope to avoid papers being rejected because a reviewer misunderstood the paper or did not read it properly.</p> <p>The final stage of the process is to make the accept/reject decisions and allocate the accepted papers to sessions. Not surprisingly, this typically requires quite a few iterations to get right. However, eventually the programme is finalised and the results are announced to the authors. At this point, the Area Chairs will breath a huge sigh of relief.</p> <p>The ICASSP review process is hard work for all of the TCs, but it is especially so for our TC simply because we have the most papers. Over 20% of all papers submitted to ICASSP are in the speech and language area. This year the TC members and our panel of reviewers will together conduct over 2,500 reviews, and TC members will then conduct more than 600 meta-reviews. Guiding all of this will be the Area Chairs: Pascale Fung, TJ Hazen, Thomas Hain and Brian Kingsbury. The Area Chairs are the unsung heroes of ICASSP - they work really hard for many weeks to keep the process on track and on schedule. So I would like to end by thanking in advance all of the volunteers who will contribute their time and expertise to the ICASSP review process and especially the Area Chairs on whom the process so crucially depends.</p> <p><i>Steve Young is Chair of the Speech and Language Processing Technical Committee of the IEEE Signal Processing Society.</i></p> <hr> <h2>Calls for Papers & Participation</h2> <p>Edited by Andrew Rosenberg, newsletter sub-committee</p> <p>This page lists CFPs for near-term speech and language processing meetings. There is also a long-term {cms_selflink page='300' text='list of key speech and language processing meetings over the next few years'}.</p> <h3>Last updated May 12, 2016</h3> <br/> <table border="0" width="94%" id="table1"> <tr> <td width="66%"> <a href="http://ttic.edu/livescu/MLSLP2016">Call for Papers/Abstracts - Workshop on Machine Learning in Speech and Language Processing (MLSLP)</a><br>&nbsp; </td> <td valign="top"> Due Date: <strike>May 20, 2016</strike>June 1, 2016<br/> Event Date: September 13-13, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://www.uni-paderborn.de/en/veranstaltungen/itg2016/willkommen/">Call for Papers - 12th ITG Conference on Speech Communication</a><br>&nbsp; </td> <td valign="top"> Due Date: May 20, 2016<br/> Event Date: October 5-7, 2016 </td> </tr> <tr> <td width="66%"> <a href="https://sites.google.com/site/s4pdaiict2016/">Call for Participation -- Summer School on Speech Source Modelling and its Applications</a><br>&nbsp; </td> <td valign="top"> Due Date: May 31, 2016<br/> Event Date: July 4-8, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://www.sigdial.org/workshops/conference17/">Call for Papers -- SIGDIAL 2016: 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, CA, USA</a><br>&nbsp; </td> <td valign="top"> Due Date: <strike>May 15, 2016</strike> May 22, 2016<br/> Event Date: September 13-15, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://spline2016.aau.dk/">Call forParticipation -- International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE 2016) </a><br>&nbsp; </td> <td valign="top"> Event Date: July 6-8, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://www.signalprocessingsociety.org/news/870/653/Machine-Translation-Journal-Special-Issue-on-Spoken-Language-Translation/">Call for Papers -- Machine Translation Journal: Special Issue on Spoken Language Translation</a><br>&nbsp; </td> <td valign="top"> Due Date: July 15, 2016<br/> </td> </tr> <tr> <td width="66%"> <a href="http://www.signalprocessingsociety.org/uploads/special_issues_deadlines/JSTSP_SI_spoofing.pdf">Call for Papers -- IEEE Journal of Selected Topics in Signal Processing Special Issue on Spoofing and Countermeasures for Automatic Speaker Verification</a><br>&nbsp; </td> <td valign="top"> Due Date: August 1, 2016<br/> </td> </tr> <tr> <td width="66%"> <a href="http://capss2017.nytud.hu/">Call for Participation -- Challenges in analysis and processing of spontaneous speech (CAPSS 2017)</a><br>&nbsp; </td> <td valign="top"> Submission Deadline: September 1, 2016<br/> Event Dates: May 15-17, 2017 </td> </tr> <tr> <td width="66%"> <a href="http://alffa.imag.fr/interspeech-2016-special-session-proposal/">Call for Participation -- Sub-Saharan African languages : from speech fundamentals to applications - Interspeech Special Session</a><br>&nbsp; </td> <td valign="top"> Event Date: September 6-12, 2016 </td> </tr> <tr> <td width="66%"> <a href="https://sites.google.com/site/yfrsw2016/home">Call for Participation -- A Workshop for Young Female Researchers in Speech Science & Technology (YFRSW2016) </a><br>&nbsp; </td> <td valign="top"> Event Date: September 8, 2016 </td> </tr> <tr> <td width="66%"> <a href="https://goo.gl/3XlF6j">First Call for Participation -- IWSLT 2016</a><br>&nbsp; </td> <td valign="top"> Submission Deadline: September 30, 2016 <br/> Event Dates: December 8-9, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://www.journals.elsevier.com/neurocomputing/call-for-papers/special-issue-on-machine-learning-for-non-gaussian-data/">Call for Papers -- Neurocomputing Special Issue on Machine Learning for Non-Gaussian Data Processing</a><br>&nbsp; </td> <td valign="top"> Submission Deadline: October 15, 2016 </td> </tr> <tr> <td width="66%"> <a href="http://goo.gl/forms/Tcl1BhkBeQ">Call for Participation -- MediaEval 2016</a>[<a href="http://goo.gl/forms/Tcl1BhkBeQ">feedback</a>] [<a href="http://www.multimediaeval.org/">website</a>]<br>&nbsp; </td> <td valign="top"> Event Date: October 20-21, 2016 </td> </tr> </table> <hr> <h2>Book announcements</h2> <p>To post a book announcement, please email speechnewseds [at] listserv (dot) ieee [dot] org.</p> <h4>Distant Speech Recognition</h4> <p><b>Matthis Wlfel and John McDonough</b> <br>ISBN: 978-0-470-51704-8. January 2009. Hardback 584 pp. Published by Wiley.</p> <p>The performance of conventional Automatic Speech Recognition (ASR) systems degrades dramatically as soon as the microphone is moved away from the mouth of the speaker. This is due to a broad variety of effects such as background noise, overlapping speech from other speakers, and reverberation. While traditional ASR systems underperform for speech captured with far-field sensors, there are a number of novel techniques within the recognition system as well as techniques developed in other areas of signal processing that can mitigate the deleterious effects of noise and reverberation, as well as separating speech from overlapping speakers.</p> <p>Distant Speech Recognition presents a contemporary and comprehensive description of both theoretic abstraction and practical issues inherent in the distant ASR problem.</p> <p>Key Features:</p> <ul> <li>Covers the entire topic of distant ASR and offers practical solutions to overcome the problems related to it <li>Provides documentation and sample scripts to enable readers to construct state-of-the-art distant speech recognition systems <li>Gives relevant background information in acoustics and filter techniques, <li>Explains the extraction and enhancement of classification relevant speech features <li>Describes maximum likelihood as well as discriminative parameter estimation, and maximum likelihood normalization techniques <li>Discusses the use of multi-microphone configurations for speaker tracking and channel combination <li>Presents several applications of the methods and technologies described in this book <li>Accompanying website with open source software and tools to construct state-of-the-art distant speech recognition systems </ul> <p>This reference will be an invaluable resource for researchers, developers, engineers and other professionals, as well as advanced students in speech technology, signal processing, acoustics, statistics and artificial intelligence fields.</p> <hr> <!-- TITLE --><h2>Young researchers face-to-face on human-machine dialogue</h2> <!-- BY-LINE --><h3>David Pardo, Milica Ga&#353i&#263, Joana Paulo Pardal, Ricardo Ribeiro, Matthew Marge and Franois Mairesse</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- --> <!-- DATE (month) --><p>October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <!-- <p> The fifth Young Researchers' Roundtable on Spoken Dialogue Systems gathered 41 researchers in academia and industry, from a wide variety of institutions at Queen Mary University in London last month.</p> --> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>The fifth Young Researchers' Roundtable on Spoken Dialogue Systems (YRRSDS'09) was held at Queen Mary University in London last month. The event - organized by young (and not so young!) researchers, most of whom were participants in the 2008 edition in Columbus, Ohio - brought together 41 researchers in academia and industry, from a wide variety of institutions and from several continents. Many had also participated in the SIGDial or Interspeech conferences with which the event is affiliated.</p> <p>The first roundtable was organized in Lisbon, Portugal in 2005 by the "Dialogs on Dialogs" group based at Carnegie Mellon University. Its goal was to provide dialogue system researchers in the early stages of their careers an opportunity to engage in multidisciplinary discussions - this being the nature of the field -, to meet other researchers and be in touch with other research efforts, and to build bridges between the realms of academia and industry.</p> <p> This year's edition was organized in the same spirit. It featured talks by senior researchers in industry and academia and special sessions on career advice and the "grand challenges" of evaluating the performance of, and the user's experience with, dialogue systems. A demo and poster session also allowed some of the participants to present their work. All of the participants had previously submitted a position paper describing their current research and topics of interest. The proceedings of the event are available at <a href="http://www.yrrsds.org">www.yrrsds.org</a>.</p> <h4>Invited Speakers</h4> <p>Among the attendants were seven invited speakers. Philippe Bretier outlined a new methodology that is being used at Orange (formerly France Telecom) for modelling dialogue based on reinforcement learning. Tim Paek spoke of his personal motivations to work in Microsoft Research and contrasted them to what he misses from the academic world. Alan Black (Carnegie Mellon University), Michael McTear (University of Ulster) and Jason Williams (AT&T Labs - Research) shared their career experiences and offered advice to young researchers seeking to find their way in academia or industry. Michael McTear stressed the importance of publishing in high quality journals and conferences, and pointed out that young researchers should be aware of reviewing criteria and also be prepared to resubmit their work. Alan Black discussed how making progress in academia can work differently from country to country, and spoke of his own experience combining work in both academia and industry. Jason Williams spoke of the relative importance of different factors in deciding where to seek employment, such as the people with whom one will work (very important) and location (not important).</p> <p>Sebastian Mller (Deutsche Telekom Labs) was the first to approach the evaluation theme. Sebastian addressed the issue of automatic evaluation of the quality and usability of spoken dialogue systems, and how the problem is handled differently in academia and industry. Dan Bohus demonstrated the complexity of the problem of evaluation by presenting a list including many of the main metrics that have been proposed in recent years, from 'number of dialogue turns' to 'F-score for definite exophoric reference resolution'. He pointed out, as Mller had before him, that performance metrics are not "quality" from the point of view of the user's experience. Dan proposed addressing such complexity by performing local evaluations on particular aspects, although questions of significance to the whole could then arise. Finally, Alan Black captured the imagination of the attendees when he invited them to take part in the <a href = "http://dialrc.org/sdc">Spoken Dialogue Challenge</a>: the Let's Go system developed at the Language Technologies Institute at CMU will be provided to the participants so that they can replace any module with their own and thus develop improvements in specific areas to be assessed with real users. Selected systems will run live in Pittsburgh so that real users can interact with them.</p> <h4>YRR group discussions</h4> <p>Discussion sessions are the heart of the roundtable. Participants split into groups to discuss subjects of interest at greater length. The more popular topics dealt with data collection and analysis, learning systems and user-system (mutual) adaptation, which may be an indication of some of the current trends in dialogue research. The difficulty (and perhaps even undesirability) of developing standards for the development of corpora was identified, as well as ethical barriers to the collection and sharing of data from test users. Tim Paek, SIGdial president, brought forward a proposal to have research groups gather together their data in a variety of domains (including human-human interaction) and make it available so that collaborative efforts of annotation, classification and perhaps further processing may be undertaken.</p> <p>The trend towards a greater focus on the user was apparent throughout the event. The general opinion was that systems should be sensitive to context, particularly to the users' personal and changing needs (informational, cognitive and even emotional) throughout the interaction. It was widely suggested that knowledge of human-human interaction should guide the development of human-machine interaction systems, especially those with multi-modal capabilities. The importance of evaluating dialogue with real users was stressed. It was noted that it is not at all trivial to measure user satisfaction, but emotions and level of engagement were suggested as promising indicators. The peculiarities of multi-language and multi-domain dialogue systems were also discussed.</p> <h4>YRRSDS 2010...</h4> <p>All in all, for the fifth year in a row, the roundtable has proven to be a nice platform for sharing knowledge and ideas among young researchers in a friendly atmosphere. The organizers are now passing the torch for the next year's Young Researchers' Roundtable, which will take place in Japan!</p> <h4>Acknowledgements</h4> <p>Thanks to Christine Howes, Arash Eshghi and Gregory Mills for their great help with organising this event.</p> <hr> <!-- TITLE --><h2>Hy Murveit, William Labov, Bruce Millar, and Wolfgang Hess talk to the Saras Intitute</h2> <!-- BY-LINE --><h3>Olga Pustovalova, Antonio Roque, and Tadesse Anberbir</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Moldova State University, University of Southern California, Ajou University --> <!-- DATE (month) --><p>SLTC Newsletter, September 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>We continue the series of excerpts of interviews from the <a href="http://www.sarasinstitute.org/">History of Speech and Language Technology Project</a>. In these segments Hy Murveit, William Labov, Bruce Millar, and Wolfgang Hess discuss how they became involved with the field of speech and language technology. </p> <h4>Hy Murveit</h4> <p><i>Transcribed by Olga Pustovalova, Moldova State University, Moldova</i></p> <p><b>Q:</b> We are going to start off by asking a question that we have asked everybody, which is: how did you get into speech?</p> <p><b>A:</b> Well, I'm pretty sure I know the answer to that one: I was in Berkeley, and I got my Master's degree and my PhD in Berkeley. I got sent to Berkeley by Bell Labs, they generously paid for my Master's degree, and I was looking for a Master's project. And in Berkeley you have to do something to write your Master's thesis, and you have to figure out what that was. So basically you go from office to office, and talk to professors, and say, you know, 'what have you got?' I went to professor Messerschmitt's office, Dave Messerschmitt, who taught a lot of signal processing at Berkeley, probably still does, and I remember he just suggested speech recognition, and he had, I forget which book, it was some kind of cepstral analysis chapter. Dave is a great guy, and I am sure I would have done fine with Dave, but for whatever reason I kept on walking and also talked to Bob Brodersen who became my thesis advisor. So Dave sort of lit this idea of speech recognition in me, but with Brodersen I connected... I think it was I walked in his office, and there was a pile of two feet of paper on his desk, and the place was a mess, and I said, 'That's me! I would do that too.' (laughs) </p> <p>So Bob was an IC guy, not really a speech recognition guy at all, but you know, you've got to build chips to do something. If the goal is to build integrated circuits, you've got to find an application for 'em, and he was looking for one in speech recognition, it was probably also in his mind. So that's what I wound up doing for my PhD, but first for my Master's, and so I connected with Bob, but I know it was Dave Messerschmitt who first planted the seed, I remember that vividly... Nelson Morgan from ICSI -- he was also, by the way, a graduate student at the same time -- and he says the same thing about how he saw the mess in Bob's office and said, 'This must be the advisor for me,' too. I don't know if Bob is that messy anymore, and I know Morgan is not that messy anymore, but I still am.</p> <p><b>Q:</b> That's great. And why don't you just tell us a little bit about what you did from there... and then how you proceeded after you got your doctorate.</p> <p><b>A:</b> Bob's focus in these days was Special Purpose Integrated Circuits... the focus was if you make special purpose integrated circuits they could outperform computers of those days, and perhaps be done cheaper, and the idea would be to push this kind of intelligence to the periphery and have all sorts of smart things and devices that could go around. So the goal was to make a speech recognition chip, and I became sort of the algorithm guy in the group, because somebody actually had to figure out how to do speech recognition, and other people were going to build the chips. I was sort of the software guy too, you know it was in the early days. This was 1978, I think, or probably the summer '78 when I started, because I got to Berkeley in September '77, and I got my thesis eventually in '83. So believe it or not in like '78 most of the grad students in electrical engineering didn't program, you know, few of them started programming. But I was actually programming, and stuff like that, so I was the sort of a software guy, and also I did experiments on dynamic time warp speech recognition, and learned a lot about that. There wasn't really an expertise in that when I joined Bob's group, but I sort of learned it and he's a smart guy, he may not have worked in the field, but he helped me a lot and got me connected with the right people. He introduced me to George White, who at the time was doing a lot of speech recognition work and George sent me code, and coached me in speech recognition. Anyway, so, eventually, I figured out with their help the dynamic time warping technology, and we built a chip that did a thousand word, continuous speech DTW, level building kind of recognizer.</p> <h4>William Labov</h4> <p><i>Transcribed by Antonio Roque, University of Southern California</i></p> <p><b>Q:</b> One of the questions we've asked everybody is how they got into the field. And I notice that on your website you actually have a <a href="http://www.ling.upenn.edu/~wlabov/HowIgot.html">very interesting essay</a> on that very point.</p> <p><b>A:</b> Yeah, I wrote it for undergraduates, but it's been pretty widely read. Basically I came to Harvard as an undergraduate. My advisor first said to me 'Where did you get this idolatry of science?' Because I was taking this one course, Chemistry B, and I walked out and I said, 'That man is intelligent, how did he know that I have an idolatry of science? But I do. And I think I'd like to be a part of it.' But I was majoring in English and Philosophy, and I went to work for a company... and I spent 11 years as a practical chemist formulating silkscreenings, and I was pretty good at it, but it gave me two kinds of experience. One, I learned that there's such a thing as being right or wrong. And if you spray a panel with a coating that you hope will last outdoors and you come back six months later and it's all cracked up you know you're wrong. You may not know why. But it gave me a sense of knowing that there's a reality out there that can prove you right or wrong. And I think that's different from a number of my colleagues who spent their lives in the university and they've never been tested that way. And when I decided to leave that field, it was because it didn't have the generality, I couldn't publish any of my results though I thought I was pretty good at what I was doing. </p> <p>So I went back to this field of linguistics, and I remember writing an essay, a notebook full of notes, essays for an experimental science, and I found that the whole field, because these are very bright people arguing with each other vigorously about interesting questions, but the data was entirely their own reactions or what they got in a book, or formal elicitations from people, and all around me I saw people walking and talking and it occurred to me that we could build a science based upon the actual data from everyday life. And that's what I tried to do and I think it's still a minor part of linguistics, but it's had some continuity. So the whole idea is that you want to be tested by the world and find out with hard data whether you're right or wrong. So that's one part of it. And that also involved quantification, because there's so much variation in everyday life, so we had to introduce mathematical tools, and another way to look at it is this part of linguistics, at least dealing with change and variation, has become a quantitative study with a number of fairly sophisticated tools, though I don't think it's possible for all of linguistics to become quantitative, because it's not likely. But we have now the 34th annual meeting on new ways of analyzing variation, and it's become a major part of the field.</p> <p>Now it just so happened that the second big research project we got funded from was from the Office of Education, and we were trying to answer the question, 'Is there any connection between the reading difficulties, reading failure, in the inner cities, and the difference in language between Black and White population?' So I got involved in that enterprise and some very interesting results about the difference between Black and White and showed that African-American English, as we now call it, is a very different system, and very systematic, but we didn't actually succeed in improving reading levels, instead we developed a whole line of scientific work, of which the work on African-American English is a very large part. But we keep returning to the question, 'Can we use our knowledge to change and improve the world?' And since I'm interested in language change and variation, I've done studies in a number of areas to show how language is changing, it became natural to ask, 'What are the practical applications of this kind of work?' And immediately you can see that speech recognition is an area that is important. The little bit that I knew about it, I knew that the problem for speaker independent recognition the big problem was dialect diversity. And gradually we've gotten a stronger and stronger hold on these topics and now we're just publishing this atlas of North American English, which covers all of North America for the first time, and what we found and continue from the earlier findings is linguistic change is very active and increasing diversity is the rule rather than increasing convergence. So that the dialects of North America are more different from each other than they ever were.</p> <h4>Bruce Millar</h4> <p><i>Transcribed by Antonio Roque, University of Southern California</i></p> <p><b>Q:</b> [How did you get into the field of speech research?]</p> <p><b>A:</b> I didn't realize it at the time, but I actually went for an interview with the scientific civil service, in Britain, and ended up being interviewed by Walter Lawerence, of PAT fame and speech synthesis. That didn't have a great impact at the time, but I was later that year given an opportunity to do a PhD in a new multidisciplinary program at the University of Keele in the UK, under the supervision of Professor Donald Mackay. He was very much into human sensory processing of various kinds, with a heavy emphasis of his work in vision. It just happened that he had a postdoc who had previously done a PhD under him in the vision area, and was returning from a postdoc in the US, and this was Bill Ainsworth, who recently has left us unfortunately. And so we put together a small speech group and I started working in 1964, looking at very primitive forms of speech analysis. We had no computers then, we had simply the best we could do with analog electronics. Not being even an expert in electronics, but learning as I went along, really I got involved in doing things which were little more than the processing of microphone signals, looking at the time interval structure of speech, and generating patterns on cathode ray oscilloscopes, using photography, and then visual comparison between sounds and looking at the variability that occurred and the consistency that also occurred with different vowel sounds. And it was mostly vowel-based analysis at that stage. So that was how I got into it, and I completed a PhD in 68 in that area.</p> <p><b>Q:</b> Great. So you mentioned analog hardware. What kind of hardware?</p> <p><b>A:</b> We were just on the change from valve-based electronics into transistor work, in fact there was very little transistor expertise around in the laboratory where I was working, so it was basically looking at measurements of the time interval structure of clipped speech.</p> <p><b>Q:</b> Of clipped speech? Why were you interested in clipped speech?</p> <p><b>A:</b> Well, I mean the earlier work by Licklider and co. who had showed the intelligibility of clipped speech so we knew there was information there. Of course we may rather naively at that point thought, 'Ah, it must be in the time intervals,' ignoring the fact that there's a fair bit of formant information available, and the formant harmonics by the clipping. But it was really driven by the limitations on how we could process things. I don't recall there being anything, not at that stage, even like filter banks around for us to use. And I guess funding wasn't all that plentiful. So raw research students were thrown into the lab with a bundle of components and boards and batteries and good wishes to see what you could do. So we did this. And we published a paper in 1965. My first exposure to the international community was in 1965, which looking back was really quite early. We went to the International Congress on Acoustics in Liege, and was rather, again looking back, amazed to realize that a fellow student and I actually managed to sit next to Homer Dudley on a bus on an outing, and so we had an opportunity to connect with one that we now recognize as a leader in the field, way before we actually got into it. Also Jim Flanagan was there as a relatively, not junior, but certainly rising star, so I remember those people.</p> <h4>Wolfgang Hess</h4> <p><i>Transcribed by Tadesse Anberbir, Ajou University, Korea</i></p> <p><b>Q:</b> It is a great honor to be here with Professor Wolfgang Hess, and we appreciate your coming and agreeing to be interviewed. We are asking everybody the same question as a starting question, which is: How did you get into this field? What brought you into the field of speech and language? </p> <p><b>A:</b> My diploma thesis. I studied Electrical Engineering at the University of Stuttgart in the 60s, and I was a student of the late Eberhard Zwicker, who was involved in hearing research and he had developed a functional model of the ear which was, say, crudely speaking something like a channel vocoder or a filter bank with filters adapted to the characteristics of the critical bands. And he wanted to prove that this device was able to give a good preprocessor for speech recognition and especially for the recognition of 10 digits. And they looked for students to implement an interface between their analog filter and a computer to process these things on that.</p> <p><b>Q:</b> Which computer?</p> <p><b>A:</b> That was an absolutely old one it was called ER56, a development by ALCATEL, at that time ACL standard electrical loans in German, which still worked with valve tubes and had something like a memory of 10k and it worked with punched cards or punched tapes. And, so... it could be only programmed in machine code. Well, it was my task to make this interface and then I also had to be contacting the Engineer who was in charge of that computer. Then he told me, 'Why are you going and trying to make all this like some kind of time normalization which would be necessary to detect the beginning and the ending of these numbers, why are you doing that trying to do that in hardware? Come here and program it.' And so, as a diploma thesis, I made the device just fill in the frames into the computer, and did the rest by software. And that what I did as a, say, six month research project, and I've stayed in that field ever since.</p> <h4>Acknowledgements and more information</h4> <p>These interviews were conducted by Dr. Janet Baker in 2005 and were transcribed by members of <a href="http://www.isca-students.org/">ISCA-SAC</a> as described previously: <a href="http://www.ewh.ieee.org/soc/sps/stc/News/NL0801/SARAS.htm">1</a>, <a href="http://www.ewh.ieee.org/soc/sps/stc/News/NL0805/NL0805-SARAS2.htm">2</a>, <a href="http://www.ewh.ieee.org/soc/sps/stc/News/NL0807/NL0807-SARAS3.htm">3</a>, <a href="http://www.ewh.ieee.org/soc/sps/stc/News/NL0810/NL0810-SARAS4.htm">4</a>, <a href="http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-04/saras/">5</a>, <a href="http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-07/pols-karttunen-haton-saras/">6</a>. Sunayana Sitaram of ISCA-SAC coordinated transcription efforts.</p> <hr> <!-- TITLE --><h2>The INTERSPEECH 2009 Emotion Challenge: Results and Lessons Learnt</h2> <!-- BY-LINE --><h3>Bjoern Schuller, Stefan Steidl, Anton Batliner, and Filip Jurcicek</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>The <a href="http://emotion-research.net/sigs/speech-sig/emotion-challenge">INTERSPEECH 2009 Emotion Challenge</a>, organised by Bjoern Schuller (TUM, Germany), Stefan Steidl (FAU, Germany), and Anton Batliner (FAU, Germany), was held in conjunction with <a href="http://www.interspeech2009.org/conference/programme/session.php?id=6810">INTERSPEECH 20009</a> in Brighton, UK, September 6-10. This challenge was the first open public evaluation of speech-based emotion recognition systems with strict comparability where all participants were using the same corpus. The German FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech of 51 children served as a basis. The corpus clearly defined test and training partitions incorporating speaker independence and different room acoustics, as needed in most real-life settings.</p> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>Three sub-challenges (Open Performance, Classifier, and Feature) addressed classification of five non-prototypical emotion classes (anger, emphatic, neutral, positive, remainder) or two emotion classes (negative, idle). The Open Performance Sub-Challenge allowed contributors to find their own features with their own classification algorithms. However, they had to stick to the definition of test and training sets. In the Classifier Sub-Challenge, participants designed their own classifiers and had to use a large set of standard acoustic features, computed with the <a href="http://fr.sourceforge.jp/projects/sfnet_opensmile/">openSMILE toolkit</a> provided by the organisers. Participants had an option to subsample, alter, and combine features (e.g. by standardisation or analytical functions). The training could be bootstrapped, and several classifiers could be combined by tools such as <a href="http://en.wikipedia.org/wiki/Ensemble_learning">Ensemble Learning</a>, or side tasks learned as gender, etc. However, the audio files could not be used for additional feature extraction in this task. In the Feature Sub-Challenge, participants were encouraged to design 100 best features for emotion classification to be tested by the organisers in equivalent setting. In particular, novel, high-level, or perceptually adequate features were sought-after.<p> <p>Participants did not have access to the labels of the test data, and all learning and optimisations was based only on the training data. However, each participant could upload instance predictions to receive the confusion matrix and results from the test data set up to 25 times. The format contained instance and prediction, and optionally additional probabilities per class. This later allowed a final fusion by majority vote of predicted classes of all participants' results to demonstrate the best possible performance of the combined efforts. As classes were unbalanced, the primary measure to optimise was firstly unweighted average (UA) recall, and secondly the accuracy. The choice of unweighted average recall was a necessary step to better reflect imbalance of instances among classes in real-world emotion recognition, where an emotionally &quot;idle&quot; state usually dominates. Other well-suited and interesting measures as <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">the area under the receiver operator curve (ROC)</a> were also considered; however, they were not used as they are not yet common measures in the field at the time.<p> <p>The organisers did not take part in the sub-challenges but provided baselines by the two most popular approaches. First, the <a href="http://www.cs.waikato.ac.nz/ml/weka/">WEKA toolkit</a> and Support Vector Machines were used. Second, the <a href="http://htk.eng.cam.ac.uk/">HTK toolkit</a> was used to train Hidden Markov Models. They intentionally used standard-tools so that the results were reproducible.<p> <p>All participants were encouraged to compete in multiple sub-challenges and each participant submitted a paper to the INTERSPEECH 2009 Emotion Challenge Special Session. The results of the challenge were presented at a Special Session of Interspeech 2009 and the winners were awarded in the closing ceremony by the organisers. Three prizes (each 125 GBP sponsored by the HUMAINE Association and the Deutsche Telekom Laboratories) could be awarded following the pre-conditions: 1) the awarded paper was accepted to the special session after the INTERSPEECH 2009 general peer-review, 2) the provided baseline (67.7% and 38.2% UA recall for the two- and five-class tasks) was exceeded, and 3) the best result in the sub-challenge and the task was achieved.<p> <p>The Open Performance Sub-Challenge Prize was awarded to Pierre Dumouchel et al. (University de Quebec, Canada) for their victory in this sub-challenge: they had managed to obtain the best result (70.29% UA recall) in the two-class task, significantly ahead of their eight competitors. The best result in the five-class task (41.65% UA recall) was achieved by Marcel Kockmann et al. (Brno University of Technology, Czech Republic) who surpassed six further results and were awarded the Best Special Session's Paper Prize as they had received the highest reviewers' score for their paper at the same time.<p> <p>The Classifier Sub-Challenge Prize was given to Chi-Chun Lee et al. (University of Southern California, USA) for their best result in the five-class task in advance of three further participants. In the two-class task, the baseline was not exceeded by any of two participants.<p> <p>Regrettably, no award could be given in the Feature Sub-Challenge. Neither of the feature sets provided by three participants in this sub-challenge exceeded the baseline feature set provided by the organizers.<p> <p>Overall, the results of all 17 participating sites were often very close to each other, and significant differences were as seldom as one might expect in such a close competition. However, by the &quot;democratic&quot; fusion of all participants' results, the performance exceeded all of the individual results: 71.16% and 44.01% UA recall for the two- and five-class tasks. The general lesson learned thus is &quot;together we are best&quot;: apparently the different feature representations and learning architectures dominate in their combination. In addition, the challenge clearly demonstrated the difficulty of dealing with a real-life non-prototypical emotion recognition scenario - this challenge remains.<p> <p>The organizers plan to make the corpus used for the challenge publicly available. They also suggest that future challenges should consider cross age groups, languages, and culture evaluations, noisy, reverberated, or transmission corrupted speech, multimodal sources, and naturally many further aspects and related topics as non-linguistic vocalisations, or automatic speech recognition of emotional speech.<p> <hr> <!-- TITLE --><h2>Overview of Sigdial 2009 Conference</h2> <!-- BY-LINE --><h3>Svetlana Stoyanchev</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Open University --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>September 11 - 12, Queen Mary College hosted Sigdial in London, for the first time, as a conference. This year Sigdial had 125 participants and 104 submissions. There were 24 oral and 24 poster presentations, 3 demos, and two invited speeches. This article gives an overview of the topics, invited talks, and awards presented at the conference. </p> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <h4>Overview</h4> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <!-- EXAMPLE BIBLIOGRAPHY --> <p> The conference gathered dialogue researchers from all over the world: USA, Canada, Netherlands, Germany, UK, France, Japan, and others. The majority of papers presented this year were from North America. </p> <p> Topics featured at the conference included work on spoken dialogue systems as well as analysis and modeling of human conversations. The majority of papers focused on spoken dialogue systems addressing the problem of speech recognition and user utterance interpretation. A few papers on dialogue systems focused on information presentation, generation, or giving useful feedback to improve system performance (including Paksima et. al). </p> <p> One of the Sigdial focus areas this year was research on meetings, such as detection of noteworthy utterances, summarizing, and facilitating virtual meeting participants (including the papers of Akker et al, Niekrasz and Moore, Banerjee and Rudnicky, Bui et.al). Another focus area was multiparty and multimodal communication, such as a system assistant to virtual meeting participants or determining when a virtual meeting participant is being addressed. Research involving tutoring systems also had a significant presence at the conference (including Dzikovska et al, Rotaru and Litman, Dohsaka et al.). </p> <p> The best paper award was presented to Dan Bohus from Microsoft for the work on a multi-modal system that predicts user engagement in a multi-party situation. The system plays the role of a virtual secretary in a lobby of a Microsoft building. Using video and audio input, image processing for detection of faces and their orientation, the system predicts whether a user should be addressed by a virtual secretary. You can view a <a href="http://research.microsoft.com/en-us/um/people/dbohus/videos/SIdemo.wmv"> video of the system in-action </a>. </p> <p> The best student paper award was presented to Luciana Benotti from Universite Henru Poincare, France for her paper on the clarification potential of instructions. In her work Luciana analysed clarification questions asked in a human-human task-oriented dialogue. <p> <h4> Invited Speakers</h4> <p> The invited speakers this year were Psychology professor <a href="http://web.uvic.ca/psyc/bavelas/">Janet Bavelas</a> from the University of Victoria and a Computer Science professor <a href="http://www.dcs.shef.ac.uk/~yorick/">Yorick Wilks</a> from the University of Shefield. </p> <p> Janet Bavelas discussed the uniqueness of dialogue. She raised an interesting question 'what is dialogue?'. How much feedback from the audience/addressees does a speaker need in order to classify speech as a  dialogue ? In her experiments Dr. Bavelas uses three scenarios with different levels of audience feedback. The two dialogue scenarios involved a face-to-face communication with full multimodal feedback and a telephone communication with only audio channel feedback. The monologue scenario involved speech into a dictaphone with no addressee feedback. Dr. Bavela's experimental results show that gestures, figurative language, facial displays, and direct quotations are unique to dialogue situations with or without video feedback. </p> <p> Professor Yorick Wilks described a <a href="http://www.companions-project.org/">companion project </a>, a 4-year project that has completed its second year. A companion has speech recognition, speech synthesis, and internet connectivity capabilities. A companion can be a system that learns and records your experiences throughout your life. A companion has a capability to learn from pictures by verbal interaction with a user and build a user's illustrated family map. A companion may be used to entertain a lonely elderly person by pulling information from the web, news, or TV shows. A companion can be a positive motivator for a healthy lifestyle, such as good diet and exercise. A sample conversation with an automatic system companion can be <a href="http://www.youtube.com/watch?v=KQSiigSEYhU"> viewed on Youtube</a>. Professor Yorick described challenges and achievements of the ongoing Companion initiative. </p> <p> Questions about the evaluation of spoken dialogue systems were, as always, a present topic at Sigdial. Researchers at Carnegie Melon announced an open Spoken Dialogue Challenge that could serve as an evaluating mechanism for dialogue system technologies. The Dialogue Challenge invites researchers to participate in building and evaluating dialogue systems in the bus information domain, providing information such as schedules and route information for the city of Pittsburgh's Port Authority Transit (PAT) buses. The researchers are given an option to build their own system or to use and modify the existing Let's Go bus information system. By taking advantage of the existing system, the researchers may save time on system building and would be able to focus on their specific research question. See more details on the <a href="http://www1.atwiki.com/dialoguechallenge/">Dialogue Challenge wiki page</a> <p> <h4>Business Meeting Announcements</h4> <p> Participants made several suggestions for innovations in the future Sigdial conferences. One of the suggestions was to introduce mentoring/guidance to the authors who help foreign authors. Another idea proposed expanding future conferences to 3 days. </p> <p> A new <a href="http://www.dialogue-and-discourse.org/">journal on Dialogue and Discourse</a> was announced at the business meeting. </p> <p> Sigdial's web page has moved to a new domain: <a href="http://www.sigdial.org"> http://www.sigdial.org</a>. The new website is designed to be interactive and to facilitate discussion between researchers. Everyone is invited to register, participate in discussions on the topics of Corpora, Tools and Methodologies , Discourse Processing and Dialogue Systems, Semantic and Pragmatic Modeling, or to add new discussion topics. </p> <p> For references, please see <a href="http://www.sigdial.org/workshops/">the online Sigdial proceedings.</a> </p> <p><i>Svetlana Stoyanchev is a research associate at the Open University, Computing Department. Her interests are dialogue systems, adaptation in dialogue, conversion of text to dialogue for presentation. Email: s.stoyanchev@open.ac.uk</i></p> <hr> <!-- saved from url=(0022)http://internet.e-mail --> <!-- TITLE --><h2>Report of the Blizzard Challenge 2009 Workshop</h2> <!-- BY-LINE --><h3>Tomoki Toda</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Nara Institute of Science and Technology --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p> This article provides a report of the Blizzard Challenge 2009 Workshop, an annual speech synthesis event, that took place on September 4th 2009 in Edinburgh, UK. </p> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <!-- 200 - 1000 words--> <p> The Blizzard Challenge has been devised in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer are then evaluated through listening tests. </p> <p> The Blizzard Challenge 2009 was the fifth annual Challenge. This year, not only the main tasks, called <b>hub</b> tasks, but also several other tasks, called <b>spoke</b> tasks, were conducted for two languages, UK English and Mandarin Chinese. More details of individual tasks are as follows: <ul> <li><b>Hub tasks in UK English</b> <ul> <li><b>EH1</b>: build a voice from the full UK English database <ul> <li>A voice was built using all the data of a single UK English male speaker. The speech corpus size was about 15 hours. <li>Synthetic voices were submitted from 14 participants. </ul> <li><b>EH2</b>: build a voice from the specified <i>ARCTIC</i> subset of the UK English database <ul> <li>A voice was built using only the <i>ARCTIC</i> subset of which size was about 1 hour. <li>Synthetic voices were submitted from 15 participants. </ul> </ul> <li><b>Spoke tasks in UK English</b> <ul> <li><b>ES1</b>: build voices from the specified small datasets <ul> <li>Voices were built using three small subsets from the <i>ARCTIC</i> subset, (1) only the first 10 sentences, (2) the first 50 sentences, and (3) the first 100 sentences. <li>Voice conversion, speaker adaptation techniques or any other technique were allowed to be used. <li>Synthetic voices were submitted from 6 participants. </ul> <li><b>ES2</b>: build a voice suitable for synthesizing speech to be transmitted via a telephone channel <ul> <li>A voice was built using the full UK English database suitable for synthesizing speech to be transmitted via a telephone channel. <li>A telephone channel simulation tool was available to assist in system development. <li>Synthetic voices were submitted from 9 participants. </ul> <li><b>ES3</b>: build a voice suitable for synthesizing the computer role in a human-computer dialogue <ul> <li>A voice was built using the full UK English database suitable for synthesizing the computer role in a human-computer dialogue. <li>A set of development dialogues were provided. The test dialogues were from the same domain. <li>Although participants couldn't change any of the words in the sentences to be synthesized, they were allowed to add simple markup to the text, either automatically or manually, if they could be provided by a text-generation system; e.g., emphasis tags would be acceptable, but a handcrafted F0 contour would not. <li>Synthetic voices were submitted from 2 participants. </ul> </ul> <li><b>Hub task in Mandarin Chinese</b> <ul> <li><b>MH</b>: build a voice from the full Mandarin database <ul> <li>A voice was built using all the data of a young female professional radio broadcaster. The speech corpus size was about 10 hours. <li>Synthetic voices were submitted from 9 participants. </ul> </ul> <li><b>Spoke tasks in Mandarin Chinese</b> <ul> <li><b>MS1</b>: build voices from the specified small datasets <ul> <li>Voices were built using three small subsets from the full Mandarin database, (1) only the first 10 sentences, (2) the first 50 sentences, and (3) the first 100 sentences. <li>Voice conversion, speaker adaptation techniques or any other technique were allowed to be used. <li>Synthetic voices were submitted from 5 participants. </ul> <li><b>MS2</b>: build a voice suitable for synthesizing speech to be transmitted via a telephone channel <ul> <li>A voice was built using the full Mandarin database suitable for synthesizing speech to be transmitted via a telephone channel. <li>A telephone channel simulation tool was available to assist in system development. <li>Synthetic voices were submitted from 6 participants. </ul> </ul> </ul> <p>Several listening tests such as an opinion test on naturalness, an opinion test on similarity, and an intelligibility test were conducted independently for each task. The following 4 systems were also evaluated as benchmark systems: 1) <b>natural speech</b>; 2) <b>Festival</b>: concatenative speech synthesis system based on unit selection; 3) <b>HTS2005</b>: speaker-dependent HMM-based speech synthesis system; and 4) <b>HTS2007</b>: speaker-adaptive HMM-based speech synthesis system. There were <b>19 teams</b> from around the world in the challenge. </p> <p> The Blizzard Challenge 2009 Workshop took place in the Centre for Speech Technology Research, the University of Edinburgh as a satellite event of Interspeech 2009. There were around 70 attendees! (This number of attendees was surprisingly comparable to those in the 5th ISCA speech synthesis workshop (SSW5) in Pittsburgh, 2004!) The workshop started from an overview and summary of results given by Dr. Simon King. And then, each team gave a 15 minute talk for presenting the developed system. After the system presentations, attendees had general discussion about the Blizzard Challenge. The attendees deepened exchanges at a pub after the workshop. It was truly a wonderful day! </p> <p> Corpus-based speech synthesis techniques have been dramatically improved over the past several years. It may be no exaggeration to say that the Blizzard Challenges have substantially contributed to their improvements. We have learned a great deal from the challenges; we have found the effectiveness of statistical parametric speech synthesis such as HMM-based speech synthesis having a tremendous amount of potential for providing a very flexible synthesis framework; we have re-realized the effectiveness of concatenative speech synthesis based on unit selection; and we have organized our thoughts about the relationship between these two main approaches to corpus-based speech synthesis. Current techniques enable development of a general purpose TTS system capable of synthesizing quite natural speech. Some speaking styles are also achieved well. However, they still leave much to be improved. Especially it is worthwhile to develop speech synthesis techniques for providing appropriate speaking styles according to demands of various speech applications. In this challenge, only 2 participants submitted voices in <b>ES3</b> task to build a voice suitable for synthesizing the computer role in a human-computer dialogue. It is expected that more research activities will soon be focused on development of these techniques. </p> <p>Acknowledgements:</p> <ul> <li>Thanks to Dr. Simon King of CSTR, University of Edinburgh, for providing detail data of the Blizzard Challenge 2009. </ul> </p> <!-- EXAMPLE BIBLIOGRAPHY --> <p>For more information, see: <ul> <li><a href="http://www.synsig.org/index.php/Blizzard_Challenge_2009_Workshop">Blizzard Challenge 2009 Workshop - SynSIG</a> <li><a href="http://www.synsig.org/index.php/Blizzard_Challenge_2009">Blizzard Challenge 2009 - SynSIG</a> <li><a href="http://www.festvox.org/blizzard/index.html">Papers and Results of Previous Blizzard Challenges</a> <li><a href="http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2009-04/blizzard/">``The Blizzard Challenge'' SLTC Newsletter, April 2009, by Alistair Conkie</a> </ul> </p> <p>If you have comments, corrections, or additions to this article, please contact the author: Tomoki Toda, tomoki [at] is [dot] naist [dot] jp.</p> <!-- OPTIONAL: INFORMATION ABOUT THE AUTHOR(S). OK TO INCLUDE LITTLE BIO, CONTACT INFO, RECENT WORK, RECENT PUBLICATIONS, ETC. USE <p><i> ... </i></p> --> <p><i> Tomoki Toda is Assistant Professor of Graduate School of Information Science, Nara Institute of Science and Technology, JAPAN. His interests are statistical approaches to speech processing. Email: tomoki@is.naist.jp </i></p> <hr> <!-- TITLE --><h2>NIST Conducts Rich Transcription Evaluation</h2> <!-- BY-LINE --><h3>Satanjeev "Bano" Banerjee</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <!--<p>Abstract: NIST recently concluded the latest edition of their Rich Transcription evaluation exercise series. The results of the exercise show that automatic transcription of overlapping speakers with distant microphones remains a difficult task, with little improvement over the previous evaluation conducted in 2007.</p>--> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!-- <h4>Overview</h4> --> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>The National Institute of Standards and Technology (NIST) recently conducted the 2009 Rich Transcription Evaluation - a public exercise to evaluate the current state of the art in automatic speech transcription. It was the latest of a series of such evaluations that started in 2002. The ultimate goal of this exercise series is to evaluate "rich transcription" - a process that not only converts audio into a stream of words, but also captures other information in the speech such as speaker identity, prosody, disfluencies, etc. Rich transcription is expected to make speech transcriptions more useful to both humans and to downstream processes; it is likely to be particularly useful in the context of multi-party human-human dialog, such as meetings.<p> <p>As a step towards the ultimate goal of rich transcription, the just-concluded evaluation exercise featured three tasks:</p> <ul> <li>Speech-to-Text: Converting the audio into a sequence of words.</li> <li>Speaker Diarization: Detecting the speaker(s) at all points of time during the meeting.</li> <li>Speaker-Attributed Speech-to-Text: Assigning each transcribed word to one of the speakers.</li> </ul> <br/> <p>The audio used for evaluation consisted of meeting-room recordings only (unlike the previous edition of the exercise held in 2007 in which data from broadcast news and conversational telephone speech was also used). The test audio was provided by the University of Edinburgh and the IDIAP Research Institute (60 minutes each of non-American English speech) and NIST (60 minutes of American English speech). As is typical of natural meetings, some of the data consisted of overlapping speech from multiple speakers; participants were required to transcribe the separate overlapping words. No limitation was placed on the time taken to produce the transcripts. System outputs were evaluated using the Word Error Rate metric - the ratio of the number of word tokens inserted, deleted or substituted by the system (as compared to the reference transcription), divided by the total number of tokens in the reference. Speaker diarization was evaluated using Diarization Error Rate - the percentage of the meeting time that was attributed to an incorrect speaker. Speaker-attributed speech-to-text was evaluated using Word Error Rate, with the added constraint that a word was deemed to be correctly transcribed only if it was attributed to the right speaker.</p> <p>There were 8 participant teams that participated in one or more of these three tasks:</p> <ul> <li>AMI (Augmented Multi-party Interaction): University of Sheffield, IDIAP, University of Edinburgh, University of Technology Brno, and University of Twente.</li> <li>I2R/NTU: Infocomm Research Site and Nanyang Technological University.</li> <li>FIT: Florida Institute of Technology.</li> <li>ICSI: International Computer Science Institute.</li> <li>LIA/Eurecom: Laboratoire Informatique d'Avignon/Ecole d'ingnieurs et centre de recherche en Systmes de Communications.</li> <li>SRI/ICSI: SRI International and International Computer Science Institute.</li> <li>UPM: Universidad Politcnica de Madrid.</li> <li>UPC: Universitat Politcnica de Catalunya.</li> </ul> <br/> <p>The AMI, FIT and SRI/ICSI teams took part in the Speech-to-Text task, every team except FIT and SRI/ICSI took part in the speaker diarization task, and the AMI and the SRI/ICSI teams participated in the speaker-attributed speech-to-text task. The primary input condition was produced by three distant microphones placed in the center of the meeting-room table. Participants also had the option of comparing their results on this primary condition to other conditions including speech collected using array microphones and individual head-mounted microphones.</p> <p>Using multiple distant microphones and with at most 4 overlapping speakers, the word error rate of the systems was between 40 and 50%. The error rate was close to 30% for meeting segments in which the number of speakers was at most 1, and around 25% for the individual head-mounted microphone condition. These evaluation numbers are similar to the numbers achieved in the 2007 Rich Transcription (RT-07) exercise. For speaker diarization, there was a wide range of results among the six participants - from less than 10% Diarization Error Rate to over 30% in the multiple distant microphone condition. Results were slightly worse for both the array microphone condition and the single distant microphone condition. An analysis of the errors revealed that most of the errors were from attributing speech to the wrong speaker, rather than not detecting or falsely detecting speech. Compared to RT-07, the results were in the same range, although there was much variation from meeting to meeting in the test set. Unlike previous years, participants were allowed to use video recorded at the meeting to help with the diarization - this resulted in about 5% absolute improvement in the diarization error rate in the single distant microphone condition. Finally, the speaker-attributed speech-to-text word error rate was close to 50% for the multiple distant microphone condition, and around 60% for both the microphone array and the single distant microphone conditions.</p> <p>The main conclusion of the exercise was that the results represented little or no improvement over those obtained during RT-07. It was also clear that the distant microphone condition remains significantly more difficult than the close-talking microphone condition, and that more research is needed to improve speech recognition using distant microphones.</p> <!-- EXAMPLE BIBLIOGRAPHY --> <p>For more information, see:</p> <ul> <li><a href="http://www.itl.nist.gov/iad/mig/tests/rt">Webpage of NIST's Rich Transcription Evaluation project.</a></li> <li><a href="http://www.itl.nist.gov/iad/mig/tests/rt/2009/workshop/RT09-Agenda.htm">The 2009 RT Workshop agenda, including links to presentations.</a></li> </ul> <br/> <p>If you have comments, corrections, or additions to this article, please contact the author: Satanjeev Banerjee, banerjee [at] cs [dot] cmu [dot] edu.</p> <!-- OPTIONAL: INFORMATION ABOUT THE AUTHOR(S). OK TO INCLUDE LITTLE BIO, CONTACT INFO, RECENT WORK, RECENT PUBLICATIONS, ETC. USE <p><i> ... </i></p> --> <p><i>Satanjeev "Bano" Banerjee is a PhD student in the Language Technologies Institute at Carnegie Mellon University. His interests are in spoken language understanding of human-human dialog.</i></p> <hr> <!-- TITLE --><h2>Gunnar Fant, 1919-2009</h2> <!-- BY-LINE --><h3>Rolf Carlson and Bjrn Granstrm</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>A pioneering giant in speech research has passed away. Professor Emeritus Gunnar Fant died on June 6th at the age of 89 after a long and prominent research career.</p> <p>Gunnar graduated from KTH in 1945 with a Master s degree in electrical engineering. The spirit of the postwar period was characterized by optimism and the generous resources that could be devoted to research within communications technologies. The simmering activity was driven partly by the telecommunications industry and the goal of developing new aids for the hearing-impaired. At KTH, Gunnar had already written a thesis on how the perception of speech is dependent on disturbances and limitations in transmission. His first employment was at the telephone company Ericsson, where one of his responsibilities was to make acoustic analyses of Swedish speech sounds.</p> <p>Between 1949 and 1951, Gunnar was invited to the USA. At MIT and Harvard he made many stimulating contacts with colleagues from varied disciplines. One researcher who became interested in Gunnar s work was the Harvard linguist Roman Jakobson. Their collaboration led to  Preliminaries to speech analysis , today considered a milestone in the history of linguistics.</p> <p>Back in Sweden and at KTH, he founded the Speech Transmission Laboratory and achieved major successes, one of which was his attempts to produce synthetic speech that was very similar to human speech. Gunnar s good results were based on the theory that he formulated to describe at a general level what speech physically is, and to describe how vowels and consonants are produced and receive their acoustic properties. When the theory was presented in his doctoral thesis in 1959, it had already met approval all over the world. Thus already at a young age, Gunnar was an established researcher with an international reputation.</p> <p>Gunnar s research objectives were to investigate the fundamentals of speech and to transform his findings and insights into practical applications. The key to his world-leading position lies in the fact that his description of speech is completely general and ready to apply in many areas. It is independent of language; it applies to both normal and deviant speech; it covers the human voice in song as well as in speech. Gunnar s Acoustic Theory of Speech Production (1960) became an international standard. Today it provides an important reference in all education, research and development in connection with voice, speech and language.</p> <p>Gunnar was a mild person who cared about his colleagues and supported their personal and professional development. He created a unique spirit in his department, which was characterized by kindness, care, respect, scientific openness and cooperation. This feeling of scientific focus and human concern created a unique environment that was appreciated by all the researchers who made the pilgrimage to meet Gunnar. All of us who learned from and worked with him share a deep feeling of loss, but also many fond memories.</p> <p>Gunnar himself has given some personal view on his scientific career in "Half a century in phonetics and speech research" which can be found at <a href="http://www.speech.kth.se/~gunnar/">http://www.speech.kth.se/~gunnar/</a>.</p> <p><i>Rolf Carlson and Bjrn Granstrm, Department of Speech Music and Hearing, KTH.</i></p> <hr> <!-- TITLE --><h2>Multimodal Voice Search of TV Guide in a Digital World</h2> <!-- BY-LINE --><h3>Harry Chang</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, September 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>This article discusses the interesting characteristics of the language models associated with multimodal search applications in the area of digital TV guides in terms of the linguistic properties associated with their text content and from a user s perspective with respect to spoken or typed queries.</p> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <h4>Overview</h4> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>Digital TV guides are one of the most popular sources of content on the web. Millions of Internet users access their favorite online TV guides regularly. Most of these online TV guides are organized as a 2-dimensional grid along channel and time axes similar to the look and feel of an electronic programming guide (EPG) on TV. In a desktop environment, a computer mouse and full sized keyboard enable quick and easy selections of the shows in an online TV guide and allow a user to navigate between different search sub-categories. However, when the search interface moves from a desktop environment to television viewing in home settings, we find interesting opportunities for speech-driven multimodal search applications.</p> <p>For much of television's history, a traditional TV remote control (RC) with 40+ buttons was sufficient for channel surfing. With the advent of broadcasting networks and the introduction of IPTV technology the number of available channels soared from 50 or so just 10 years ago to over 500 today. Fueled by an explosive growth of video content over the Internet, average TV viewers are now facing a daunting task to browse through vast repositories of on-demand content such as pay-per-view movies and sporting events. The search task can become highly complex and frustrating when using the limited capabilities of a traditional handheld RC. Adding a voice search modality to an RC device that would enable viewers to easily express search intent with spoken words is a natural and intuitive evolution in multimodal search interfacing for EPGs and other IPTV content.</p> <p>One of the first steps in building an effective language model for a spoken language understanding (SLU) application is analyzing the language characteristics of the target content. Specifically, we need to examine the relationship between written descriptions found in the EPG of interest and the referring expressions used by viewers in their spoken or typed queries to interactive media search engines. A simple method of analyzing linguistic properties of written content in EPG is to examine the usage of words/phrases, such as their frequency f, relative rank r, and their relationship as described by George Zipf [1] sixty years ago in a mathematical function as follows: (where c is a constant for the corpus):</p> <center><img border="0" src="/uploads/images/SLTC-Newsletter/epg_fig1.JPG" alt="Image 1"></center> <h4>Linguistic Properties of EPG</h4> <p>While typical EPG data sources contain many text categories such as titles, descriptions, channel names, cast names, and so on, the title texts always occupy the largest screen space on TVs as well as in other printed media such as the TV section in local newspapers. It is reasonable to assume that the title texts would have the most significant influence in the users vocabulary. To study EPGs from this perspective, the speech technology researchers at AT&T Labs created a title corpus from an EPG covering ten TV markets in the U.S. over a 13-month period. For this article, the corpus is referred to as EPG10. Figure 1 shows the sentence length distribution of EPG10 where average sentence length = 3.1 words.</p> <p>Figure 2 shows the Zipf curve versus the actual frequency data of the n-grams (n<=5) extracted from EPG10. As reported by others [2], the Zipf law clearly does not hold true for the words in the top-tier (r <= 135) and for the words in the bottom-tier (r > 3700) of EPG10. The sharp drop-off in the Zip curve at the end indicates that many low-frequency words are tightly clustered together. A closer analysis shows that many such clusters have over 50 members with the same rank order.</p> <h4>Experiments</h4> <p>In order to study the user language model from spoken queries for EPG contents, a multimodal search prototype was built and given to a small group of the company employees to try out in a realistic home television-watching environment. A total of 606 voice input sessions (1,112 spoken words) were recorded from the users. The average sentence length of spoken expressions is 2.8 words. When excluding expressions with an actor's name, the users language model is almost entirely made up of the title words.</p> <p><center><img border="0" src="/uploads/images/SLTC-Newsletter/epg_fig2.JPG"></center></p> <p><center><img border="0" src="/uploads/images/SLTC-Newsletter/epg_fig3.JPG"></center></p> <p>To get a broader perspective of the user language models for the same domain, a much larger data set of typed queries was collected from a commercial website exclusively for searching the EPG offered by a TV service provider. The data collection took place during a 2-week period in December 2008, from an estimated 10,000 users in the U.S. The corpus contains about a half million typed queries with a vocabulary of 21,111 unique words. The analysis shows a similar influence of the EPG on the user s query language. After excluding the 1-word queries matched to common 1-word program titles (e.g., <i>Seinfeld</i>, <i>Lost</i>, <i>Heroes</i>, or <i>Frasier</i>), the average sentence length is 2 words. Most interestingly, over 95% of all word tokens in the corpus are within the vocabulary of EPG10, namely, all the title words in an EPG for regular TV programs.</p> <h4>Summary</h4> <p>The experimental results seem to suggest a strong influence of online content on users choice of vocabulary when querying EPGs via spoken or typed phrases. The words in the title texts dominate the users query vocabulary. It is also interesting to note that the average query sentence length from the user population is very close to the average sentence length of the underlying corpus which represents the word content written by a small group of professional writers. However, the low-frequency n-grams from the title corpus (EPG10) do not follow Zipf s law, making it difficult to model the relationship between their rank order and corresponding frequency. A future study will be to understand how the n-gram title texts with their frequency ranking in the bottom-tier of the Zipf s curve may influence the user s query language so that we can build more effective models for SLU-based search applications for EPGs where the underlying text content is constantly updated on a daily basis.</p> <h4>Acknowledgements and more information</h4> <p>Thanks to Bernard Renger and Michael Johnston for their collaboration in the research work that led to this article.</p> <!-- EXAMPLE BIBLIOGRAPHY --> <h4>References</h4> <p>[1]&nbsp;&nbsp;G.K. Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley</p> <p>[2]&nbsp;&nbsp;L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith. 2002. <a href="http://www.nslij-genetics.org/wli/zipf/ha02.pdf">Extension of Zipf s Law to Words and Phrases</a>&nbsp;In Proc. of the 19th international conference on Computational linguistics - Volume 1</p> <p>If you have comments, corrections, or additions to this article, please contact the author.</p> <!-- OPTIONAL: INFORMATION ABOUT THE AUTHOR(S). OK TO INCLUDE LITTLE BIO, CONTACT INFO, RECENT WORK, RECENT PUBLICATIONS, ETC. USE <p><i> ... </i></p> --> <p><i>Harry Chang is Lead Member of Technical Staff at AT&T Labs Research. His interests are multimodal dialog systems and IPTV content search. Email: harry_chang@labs.att.com</i></p> <hr> <!-- TITLE --><h2>The INTERSPEECH 2009 Emotion Challenge: Results and Lessons Learnt</h2> <!-- BY-LINE --><h3>Bjoern Schuller, Stefan Steidl, Anton Batliner, and Filip Jurcicek</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>The <a href="http://emotion-research.net/sigs/speech-sig/emotion-challenge">INTERSPEECH 2009 Emotion Challenge</a>, organised by Bjoern Schuller (TUM, Germany), Stefan Steidl (FAU, Germany), and Anton Batliner (FAU, Germany), was held in conjunction with <a href="http://www.interspeech2009.org/conference/programme/session.php?id=6810">INTERSPEECH 20009</a> in Brighton, UK, September 6-10. This challenge was the first open public evaluation of speech-based emotion recognition systems with strict comparability where all participants were using the same corpus. The German FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech of 51 children served as a basis. The corpus clearly defined test and training partitions incorporating speaker independence and different room acoustics, as needed in most real-life settings.</p> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>Three sub-challenges (Open Performance, Classifier, and Feature) addressed classification of five non-prototypical emotion classes (anger, emphatic, neutral, positive, remainder) or two emotion classes (negative, idle). The Open Performance Sub-Challenge allowed contributors to find their own features with their own classification algorithms. However, they had to stick to the definition of test and training sets. In the Classifier Sub-Challenge, participants designed their own classifiers and had to use a large set of standard acoustic features, computed with the <a href="http://fr.sourceforge.jp/projects/sfnet_opensmile/">openSMILE toolkit</a> provided by the organisers. Participants had an option to subsample, alter, and combine features (e.g. by standardisation or analytical functions). The training could be bootstrapped, and several classifiers could be combined by tools such as <a href="http://en.wikipedia.org/wiki/Ensemble_learning">Ensemble Learning</a>, or side tasks learned as gender, etc. However, the audio files could not be used for additional feature extraction in this task. In the Feature Sub-Challenge, participants were encouraged to design 100 best features for emotion classification to be tested by the organisers in equivalent setting. In particular, novel, high-level, or perceptually adequate features were sought-after.<p> <p>Participants did not have access to the labels of the test data, and all learning and optimisations was based only on the training data. However, each participant could upload instance predictions to receive the confusion matrix and results from the test data set up to 25 times. The format contained instance and prediction, and optionally additional probabilities per class. This later allowed a final fusion by majority vote of predicted classes of all participants' results to demonstrate the best possible performance of the combined efforts. As classes were unbalanced, the primary measure to optimise was firstly unweighted average (UA) recall, and secondly the accuracy. The choice of unweighted average recall was a necessary step to better reflect imbalance of instances among classes in real-world emotion recognition, where an emotionally &quot;idle&quot; state usually dominates. Other well-suited and interesting measures as <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">the area under the receiver operator curve (ROC)</a> were also considered; however, they were not used as they are not yet common measures in the field at the time.<p> <p>The organisers did not take part in the sub-challenges but provided baselines by the two most popular approaches. First, the <a href="http://www.cs.waikato.ac.nz/ml/weka/">WEKA toolkit</a> and Support Vector Machines were used. Second, the <a href="http://htk.eng.cam.ac.uk/">HTK toolkit</a> was used to train Hidden Markov Models. They intentionally used standard-tools so that the results were reproducible.<p> <p>All participants were encouraged to compete in multiple sub-challenges and each participant submitted a paper to the INTERSPEECH 2009 Emotion Challenge Special Session. The results of the challenge were presented at a Special Session of Interspeech 2009 and the winners were awarded in the closing ceremony by the organisers. Three prizes (each 125 GBP sponsored by the HUMAINE Association and the Deutsche Telekom Laboratories) could be awarded following the pre-conditions: 1) the awarded paper was accepted to the special session after the INTERSPEECH 2009 general peer-review, 2) the provided baseline (67.7% and 38.2% UA recall for the two- and five-class tasks) was exceeded, and 3) the best result in the sub-challenge and the task was achieved.<p> <p>The Open Performance Sub-Challenge Prize was awarded to Pierre Dumouchel et al. (University de Quebec, Canada) for their victory in this sub-challenge: they had managed to obtain the best result (70.29% UA recall) in the two-class task, significantly ahead of their eight competitors. The best result in the five-class task (41.65% UA recall) was achieved by Marcel Kockmann et al. (Brno University of Technology, Czech Republic) who surpassed six further results and were awarded the Best Special Session's Paper Prize as they had received the highest reviewers' score for their paper at the same time.<p> <p>The Classifier Sub-Challenge Prize was given to Chi-Chun Lee et al. (University of Southern California, USA) for their best result in the five-class task in advance of three further participants. In the two-class task, the baseline was not exceeded by any of two participants.<p> <p>Regrettably, no award could be given in the Feature Sub-Challenge. Neither of the feature sets provided by three participants in this sub-challenge exceeded the baseline feature set provided by the organizers.<p> <p>Overall, the results of all 17 participating sites were often very close to each other, and significant differences were as seldom as one might expect in such a close competition. However, by the &quot;democratic&quot; fusion of all participants' results, the performance exceeded all of the individual results: 71.16% and 44.01% UA recall for the two- and five-class tasks. The general lesson learned thus is &quot;together we are best&quot;: apparently the different feature representations and learning architectures dominate in their combination. In addition, the challenge clearly demonstrated the difficulty of dealing with a real-life non-prototypical emotion recognition scenario - this challenge remains.<p> <p>The organizers plan to make the corpus used for the challenge publicly available. They also suggest that future challenges should consider cross age groups, languages, and culture evaluations, noisy, reverberated, or transmission corrupted speech, multimodal sources, and naturally many further aspects and related topics as non-linguistic vocalisations, or automatic speech recognition of emotional speech.<p> <hr> <!-- TITLE --><h2>Data-driven Models of Text Structure and their Applications</h2> <!-- BY-LINE --><h3>Annie Louis</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AT&T Labs Research --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <!-- <p> Information about text structure is important for computational language processing. For example, in multi-document summarization, such information is necessary to correctly order sentences chosen from multiple documents. Language generation and machine translation systems could be informed by coherence models for better structured output. This article describes some of the empirical models of text structure that have had considerable success in several NLP tasks.</p> --> <!-- EXAMPLE SECTION HEADING. Use h4. Section headings are optional - no need if your article is short. --> <!--<h4>Overview</h4> --> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>Linguistic theories describe the structure of coherent text in several different ways. Some relate text units in terms of various discourse relations, such as "contrast", "elaboration", "cause" and "temporal". In other accounts, when the same entities are referred to in adjacent sentences they relate the sentences by a common topic. There has been considerable interest in discourse parsing automatically identifying such discourse relations in free text. </p> <p>Independently, computational models that learn text structure in an unsupervised manner have been developed. Some salient properties of these approaches are:</p> <ul> <li>They aim to discover structure information by empirically examining large corpora of coherent texts. <li>They exploit regularities present in naturally occurring discourse. Texts on the same topic tend to have similar organization. Their word distributions exhibit certain fixed patterns. Similarly, information that pertains to the same subtopic often appears contiguously. For example, sentences in the same paragraph have a common focus. <li>Models have been developed that can learn local (between adjacent sentences) as well as global document structure. </ul> <p>Empirical models of text structure have had considerable success in several NLP applications. Some of these approaches are described below.</p> <h4>Domain dependent word co-occurrence models:</h4> <p>Lexical co-occurrence forms one set of cues for text structure. Words tend to appear in regular patterns in documents from the same domain or those sharing the same topic. For example, in news reports about any disaster, information about causalities typically precedes description of relief efforts. Examination of a large number of documents on a topic could reveal such patterns. Different models based on this idea have shown that both relationships between adjacent sentences and global document structure can be induced by word co-occurrence information</p> <p>For instance, one can learn information about words likely to appear in adjacent sentences from a large corpus of documents [1, 3]. Global models can be built by an extension of the same idea. Barzilay and Lee [2] for example, used documents on the same topic to obtain clusters of related sentences and a HMM model to learn likely transitions between the clusters. In a different approach by Chen et. al [4], global structure was directly learnt as a permutation of subtopics or word distributions rather than as Markovian transitions between consecutive topics. </p> <h4>Topic independent entity coherence model:</h4> <p>Document structure can also be described as text units interconnected by the presence of common entities. Mentions of the same entity across adjacent sentences are indicative of common focus and local coherence. Corpus-based methods using this view aim to learn different entity overlap patterns commonly present in texts. These patterns encode preferences that are characteristic of coherent organization. </p> <p>Under this perspective, the actual lexical identity of words can be abstracted away resulting in a topic independent framework. In other words, local coherence can be computed based on whether entities are shared across sentences, without using the actual words themselves. The entity grid [7] is one model driven entirely by entity overlap across adjacent sentences. While traditional theories like Centering [8] predict some patterns of entity sharing to be more coherent over others, data driven approaches such as these learn preferences as they exist in large collections. </p> <h4>Temporal model of narrative structure:</h4> <p>While the models described above learn general word distribution patterns, learning fine grained relationships may be suitable for some domains. For example, narratives tend to have an event-driven structure and a temporal structure could well describe such discourse. </p> <p>In Chambers and Jurafsky [5], related events (verbs) and their participants are learnt from free text. Adjacency or co-occurrence was widely exploited in the generic models for learning local coherence patterns. In this work, the space of possibilities is limited by considering only events in the same document that also share the same participant-events with a common "protagonist" are likely to be semantically related. Corpus counts of co-occurrence are used to subsequently obtain confidence scores for a candidate pair of verbs. Narrative structure is built as a chain of verbs together with precedence relationships obtained using a temporal classifier. </p> <h4>Applications of data-driven models:</h4> <p>The models described above have been successfully used for text-ordering tasks, topic segmentation and for detecting important text units for summarization. Recent work by McIntyre and Lapata [6] is another typical example of a data-driven approach. As a first step in their story generation process, information about entities is learnt from a large collection of stories "dogs_bark",  dogs_bark_at_cats", etc. A sentence is produced by choosing an entity and a likely event. Consecutive sentences borrow entities from the previous ones and events are chosen based on both the previous event and the entity currently in focus. The generated stories are then ranked by entity coherence and story likelihood to select the best one. </p> <h4>References:</h4> <ol> <li>Mirella Lapata, &quot;Probabilistic Text Structuring: Experiments with Sentence Ordering&quot;, ACL 2003 <li>Regina Barzilay and Lillian Lee, &quot;Catching the drift: Probabilistic Content Models with applications to generation and summarization&quot;, NAACL-HLT 2004. <li>Radu Soricut and Daniel Marcu, &quot;Discourse Generation using Utility-Trained Coherence Models&quot;, ACL 2006 <li>Harr Chen, S.R.K Branavan, Regina Barzilay and David Karger, &quot;Global Models of Document Structure Using Latent Permutations&quot, NAACL 2009 <li>Nate Chambers and Dan Jurafsky, &quot;Unsupervised Learning of Narrative Schemas and their Participants&quot;, ACL-IJCNLP 2009 <li>Neil McIntyre and Mirella Lapata, &quot;Learning to Tell Tales: A Data-driven Approach to Story Generation&quot;, ACL-IJCLNP 2009 <li>Regina Barzilay and Mirella Lapata, &quot;Modeling Local Coherence: An Entity-based Approach&quot, Computational Linguistics, 2008. <li>Barbara Grosz, Aravind Joshi, Scott Weinstein, &quot;Centering: A Framework for modeling the local coherence of discourse&quot;, Computational Linguistics, 1995 </ol> <hr> <!-- TITLE --><h2>Students and Researchers Come Together for "Dialogs on Dialogs"</h2> <!-- BY-LINE --><h3>Matthew Marge</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- Carnegie Mellon University --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>"<a href=http://www.cs.cmu.edu/~dgroup/ id=na-9 target=_blank title=DialogsOnDialogs>Dialogs on Dialogs</a>" offers students and researchers an opportunity to regularly discuss their research on an international scale.</p> <!-- BODY. Use <p>...</p> - this will ensure paragraphs are formatted correctly. --> <p>There are several opportunities available for spoken dialogue researchers to meet and discuss research on an international scale, but most of them meet no more than once a year (e.g., SIGdial, Young Researchers' Roundtable on Spoken Dialogue Systems). Also, several of these avenues generally provide feedback for research that is in complete, polished form. Feedback on research during the initial stages can also have its advantages. One such forum in the community available for students and young researchers is "Dialogs on Dialogs" (DoD), a group run by students in the field of spoken dialogue research and related areas. Although originally, the group started out at Carnegie Mellon University, it has since expanded to be an international forum for spoken dialogue research, with students from across the world joining in on the conversation.<p> <p>The meetings are meant to be informal - many talks are on research that is very much in its nascent stages. In this way, members can actively present and receive feedback on their work regardless of the maturity of their results. DoD is also a great venue for practice talks, and is often used by members before they are present their work at upcoming workshops or conferences. Other meetings focus on the "reading group" aspect of DoD, where members alternate presenting recent work by others in the field and having an open forum for discussion.</p> <p>Jason D. Williams, editor-in-chief of the IEEE SLTC Newsletter and Principal Member of Technical Staff at AT&amp;T Research, actively participated in Dialogs on Dialogs during his PhD studies at Cambridge University. When asked about the communication at DoD meetings, Williams said communication was "very easy" and "really welcoming of new students and participants." Williams added that DoD "worked hard to accommodate remote participants," with the group being based half a world away in Pittsburgh.</p> <p>Williams found the group to be very beneficial to his research. "For me there were three main benefits," said Williams. "First, it was a great forum to get feedback on research ideas. Second, it was an ideal place to give practice talks. Finally, it was a great place to build professional relationships." He also said that he is "still in touch with a number of people from the original DoD group today!"</p> <p>Originally, the forum required members to phone-in using landline telephones. With the emergence of VoIP, we now conduct weekly meetings over Skype. This permits members to freely join in on the conversation for free without the typical costs associated with international calls. Meetings typically last from one to two hours.</p> <p>The group is always looking for and open to new members. Recent topics have ranged from conversational agents in multi-party situations to human-robot dialogue management. Dialogs on Dialogs traditionally meets on Fridays at 10:30am EST. For more information on DoD, please contact <a href="http://mailhide.recaptcha.net/d?k=01HyWcsX1f1R-6Vo_B-iy7ag==&amp;c=zn11es_RkWxd5fVvbZWNng==" onclick="window.open('http://mailhide.recaptcha.net/d?k=01HyWcsX1f1R-6Vo_B-iy7ag==&amp;c=zn11es_RkWxd5fVvbZWNng==', '', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=500,height=300'); return false;" title="Reveal this e-mail address">Matthew Marge</a>. We also have a <a href=http://www.linkedin.com/groups?home=&amp;gid=148287 id=na-9 target=_blank title=LinkedIn>LinkedIn</a> group.</p> <h4>Acknowledgements</h4> <p>Thanks to Jason D. Williams for providing input to this article.</p> <hr> <!-- TITLE --><h2>The Association for Voice Interaction Design, or <i>AVIxD</i></h2> <!-- BY-LINE --><h3>Jenni McKienzie, Peter Krogh</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- AVIxD.org --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p><a href="http://www.avixd.org">AVIxD</a> (the Association for Voice Interaction Design) recently hosted its 8th workshop on voice interaction design. These workshops are held once or twice a year as an opportunity for voice interaction professionals to come together, put companies and competition aside, and tackle issues facing the industry.</p> <h3>Customization: How Does Your App Adapt?</h3> <p>The Summer, 2009 topic was, <i>Customization: How Does Your App Adapt?</i> As is often the case in the speech industry, there is a lack of consensus on definition of terms. "Customization" and "personalization" are used interchangeably by some, and held as completely different by others. Some think neither term extends beyond greeting a caller by name. During the course of the workshop, participants identified three levels of customization:</p> <h4>Situational Awareness</h4> <p>This is customization that can occur without knowing the identity of the caller and without caller opting for custom behavior. A simple example is a hurricane affecting both the volume of and reason for calls to a travel company. Situational awareness does not depend on new technology but leverages existing knowledge about the organization, the domain, and caller populations.</p> <h4>Individualized call flow changes</h4> <p>Identifying methods for managing different levels of customization was the objective of this group's session. Adapting call flow functionality for a specific individual and incorporating their known preferences and past behavior can provide a highly-personal experience, but can cross the line into a sense of invasion. What are appropriate uses of known user behavior? What are the current best practices? What is the goal for using historical criteria?</p> <h4>Data-adaptive dialog</h4> <p>This group looked at a couple of different but related issues. First was the use of the large amount of data available from in-service dialogs to better optimize them. Second was how maintaining accurate estimates of uncertainty within a dialog can be used to make it more stable to potential errors.</p> <p>Look for in-depth articles on each of these three topics to be published at <a href="http://www.avixd.org">AVIxD.org</a> in the coming months.</p> <h3>The Maturation of VUI</h3> <p>Already published on the website are articles from the 2008 and 2007 workshops. The 2008 topic was <i>The Maturation of VUI: What I Wish I Knew Back Then.</i> Participants discussed many topics around how VUI design practices changed over the last 10 years and produced <a href="http://www.avixd.org/publications.php">five articles:</a></p> <h4>Naturalness: How Closely Should IVRs Mirror Human-to-Human Speech</h4> <p>An early mantra of voice interaction design was that if a person wouldn t say it, neither should an IVR. But IVRs aren t people, and the same conversational techniques don t necessarily work in both situations. It is also a mistake to discard wholesale the abundance of linguistic knowledge that may be utilized in designing effective IVRs. This article examines which things transfer well and which ones don t.</p> <h4>Designing for the Overall User Experience</h4> <p>A single phone call to an IVR is only part of a customer s interaction with a company. Designers must consider all possible interaction points a user may have with the company and design their application as part of that whole. This paper covers caller-agent interaction, repeat caller behavior, marketing campaigns, etc.</p> <h4>The Evolution of VUI Design Methodology</h4> <p>Various groups and companies have tried to come up with a design process methodology. Nobody has really found one that has taken hold. What s been found is that there is a broad array of tools available to the designer, and that different situations require using different tools, and sometimes in a different order. Process patterns have emerged over the years and some of these are covered in this paper.</p> <h4>The Role of Data in VUI Design</h4> <p>There is more and more data available, and it should be leveraged. There is more research being done and published about specific design principles. Data is there to drive design decisions on new projects. Post-deployment production data is the most valuable, under-utilized source of data.</p> <h4>Positioning IVR Self-Service</h4> <p>IVR has not enjoyed a level of popularity commensurate with its promise or potential. The number of bad speech IVRs still outnumber the good ones. To ensure that a system falls into the good camp, the reward of using self-service must outweigh the effort. If this objective is not met, callers will "zero out" and end up disinclined to use IVR in the future.</p> <p>In August of 2007, the workshop focused on the profession of voice interaction design. What does the job entail? What are the qualifications? If someone wants to design voice applications, what educational opportunities are available? <a href="http://www.avixd.org/publications.php">These three papers</a> are also available at AVIxD.org:</p> <ul> <li>The Role of the Voice User Interface Designer in Speech Technology Projects</li> <li>Education, Mentoring, and Training for Voice User Interface Designers</li> <li>Where will this Career Lead? Voice User Interface Career Paths</li> </ul> <p>In February of 2007, the focus was on error handling. This workshop led to a series of four articles published in <a href="http://www.speechtechmag.com/">Speech Technology Magazine,</a> beginning in April 2008. The workshop has provided a valuable opportunity to further and refine the practice of voice interaction design. Watch <a href="http://www.avixd.org">AVIxD.org</a> in the months to come for information on the next workshops. The possibility exists for one in London next May. The topic currently proposed for August, 2010 in New York is:</p> <h3>2010: Voice Interaction in a Multi-Modal, Multi-Channel World</h3> <p>As additional modes of interaction gain traction, voice interaction designers must understand and design to the concept of a "meta session" where time, space and memory are respected regardless of which medium is in use. This topic will be discussed in detail at the August, 2010 workshop in New York.</p> <p>Watch <a href="http://www.avixd.org">AVIxD.org</a> for details on <b>membership</b>. If you incorporate the voice channel in your work, or are interested in speech design methodology, you will want to join AVIxD!</p> <p><i>Jenni McKienzie is Senior VUI Designer at Travelocity. Email: Jenni.McKienzie@travelocity.com. Peter Krogh is Director of Solutions Architecture at SpeechCycle. Email: peter@speechcycle.com</i></p> <hr> <!-- TITLE --><h2>Proposed Closure of Phonetics and CL in Bonn</h2> <!-- BY-LINE --><h3>Bernd M&ouml;bius</h3> <!-- INSTITUTION. This is only used on the front page, so put it in a comment here --><!-- University of Bonn --> <!-- DATE (month) --><p>SLTC Newsletter, October 2009</p> <!-- ABSTRACT. This is used both in the article itself and also on the table of contents --> <p>The University of Bonn, Germany, is proposing to close down the division of Language and Speech, formerly known as the "Institut f&uuml;r Kommunikationsforschung und Phonetik". The division comprises the programs in Phonetics and Computational Linguistics. The proposed move is also likely to terminate the teaching of General Linguistics in Bonn. A decision will be made in October 2009.</p> <h4>Background</h4> <p>According to the dean's office the motivation for this unexpected decision is the financial situation of the University, which is planning to assign the two professor positions (Phonetics and Linguistics) to a new teacher-training program, but anyway not to the area of Linguistics. I am concerned that the division was targeted because the search processes for filling the two professor positions have been plagued by some resistance from within the University in the case of Linguistics, and by obstruction in the form of a lawsuit by a competitor against the University in the case of Phonetics.</p> <p>The reason for the move to close down Language and Speech cannot be that the groups are underperforming. Phonetics and Computational Linguistics in Bonn have been very strong in research and attracted funding to an extent that has put it at the top of the humanities in Bonn. Moreover, the pertinent teaching programs have always attracted a large number of students.</p> <p>Delicate negotiations are currently underway to retain a minimal program in General Linguistics. However, even so it is likely that Phonetics will be abandoned, as will be Computational Linguistics. A decision will be made in October 2009.</p> <h4>Facts about Phonetics and Computational Linguistics in Bonn</h4> <ul> <li>Phonetics in Bonn has a long tradition. It is the second oldest phonetics institute in Germany (after Hamburg) and has enjoyed a good scientific reputation and worldwide recognition. The first Chair of Computational Linguistics in Germany was at Bonn.</li> <li>Wolfgang Hess is Emeritus Professor of Phonetics in Bonn. The current Chair is Bernd M&ouml;bius (formerly IMS Stuttgart).</li> <li>Winfried Lenders is Emeritus Professor of Computational Linguistics in Bonn. The current Chair is Berthold Crysmann (formerly DFKI Saarbr&uuml;cken).</li> <li>Hannes Kniffka is Emeritus Professor of General Linguistics in Bonn. No interim Chair has been appointed.</li> <li>Phonetics and Computational Linguistics in Bonn have produced a significant number of professors and researchers who are now active in academia and industry. Most recently, Petra Wagner from Bonn took the position of Professor of Phonetics and Phonology at the University of Bielefeld.</li> <li>Phonetics and Computational Linguistics in Bonn have a continuous tradition of attracting research funding and have been involved in large-scale (e.g., Verbmobil) and smaller research projects, in the fields of language and speech technology, phonetics, natural language processing, and forensic linguistics. The BOSS (Bonn Open Source Synthesis) TTS system is a recent prominent contribution to the speech technology community.</li> <li>Despite its small size, the division has a large number of students. Currently, there are about 400 students in the Magister program in Phonetics and Computational Linguistics and another 400 in General Linguistics. There are also 400 Language and Speech students in the BA program in Communication Sciences.</li> <li>Inside the University, Phonetics and Computational Linguistics have been cooperating with other disciplines, including Computer Science, Signal Processing, and Medicine.</li> <li>Department members are actively involved in supporting the profession internationally, producing successful textbooks (e.g. Kniffka, Working in Language and Law; Hess, Digitale Signalverarbeitung, Pitch Determination), serving as members of editorial boards (e.g. Hess, Kniffka, Lenders, Crysmann, Mbius), and taking leading roles in international professional organizations (e.g. Hess and Mbius on the ISCA Board).</li> <li><a href="http://www3.uni-bonn.de/the-university">"The University of Bonn is one of the world's leading research based Universities"</a> - I hope you will agree that this vision statement is in stark contrast to the proposal to close down one of the strongest research groups in the Faculty of Philosophy.</li> <!-- jdw: added "I hope you will agree that" --><!-- bm: removed 2nd "that", so now it reads "that this vision statement is in stark contrast" --> </ul> <p>For more information, see: <a href="http://www.speechtechbonn.blogspot.com/">Speech and Language in Bonn (blogspot)</a></p> <h4>To support Phonetics and CL in Bonn</h4> <p>Those who would like to voice an opinion about the University's decision should send their letters to:</p> <ul> <li>University of Bonn<br /> Rektor Prof. Dr. J&uuml;rgen Fohrmann<br /> Regina-Pacis-Weg 3<br /> 53113 Bonn<br /> Germany<br /> rektor@uni-bonn.de <br /> fax +49 228 737262</li> <li>University of Bonn<br /> Dean of the Faculty of Philosophy<br /> Prof. Dr. G&uuml;nter Schulz<br /> Am Hof 1<br /> 53113 Bonn<br /> Germany<br /> phildek@uni-bonn.de<br /> fax +49 228 735986</li> <li>Chair of Phonetics<br /> Bernd M&ouml;bius<br /> IfK, Poppelsdorfer Allee 47<br /> 53115 Bonn<br /> Germany<br /> moebius@ifk.uni-bonn.de<br /> fax +49 228 735639</li> </ul> <!-- OPTIONAL: INFORMATION ABOUT THE AUTHOR(S). OK TO INCLUDE LITTLE BIO, CONTACT INFO, RECENT WORK, RECENT PUBLICATIONS, ETC. USE <p><i> ... </i></p> --> <p><i>Bernd M&ouml;bius is interim Chair of Phonetics and Speech Communication at the University of Bonn. He is also an Associate Professor of Phonetics at the Institute of Natural Language Processing (IMS), University of Stuttgart. Email: moebius@ifk.uni-bonn.de</i></p> </body> </html>