Welcome to the Spring 2013 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter! This issue of the newsletter includes 9 articles from 17 guest contributors, and our own staff reporters and editors. Thank you all for your contributions!
We'd like to thank the retiring editor Martin Russell, and welcome our new editor Haizhou Li and staff reporter Navid Shokouhi.
We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions. You can submit job postings here, and reach us at speechnewseds [at] listserv (dot) ieee [dot] org.
We'd like to recruit more reporters: if you are still a PhD student or graduated recently and interested in contributing to our newsletter, please email us (speechnewseds [at] ...) with applications. The workload includes helping with the reviews of submissions and writing occasional reports for the Newsletter. Finally, to subscribe to the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.
Dilek
Hakkani-Tür, Editor-in-chief
William Campbell, Editor
Haizhou Li, Editor
Patrick Nguyen, Editor
Advances over the last decade in speech recognition and NLP have fueled the widespread use of spoken dialog systems, including telephony-based applications, multimodal voice search, and voice-enabled smartphone services designed to serve as mobile personal assistants. Key limitations of the systems fielded to date frame opportunities for new research on physically situated and open-world spoken dialog and interaction. Such opportunities are made especially salient for such goals as supporting efficient communication at a distance with Xbox applications and avatars, collaborating with robots in a public space, and enlisting assistance from in-car information systems while driving a vehicle.
The 26th annual Conference on Neural Information Processing Systems (NIPS) took place in Lake Tahoe, Nevada, December 2012. The NIPS conference covers a wide variety of research topics discussing synthetic neural systems through machine learning and artificial intelligence algorithms as well as the analysis of natural neural processing systems. This article is a summary of selected talks regarding recent developments on neural networks and deep learning algorithms presented in NIPS 2012.
Researchers at Carnegie Mellon University’s Silicon Valley Campus and Honda Research Institute have brought together many of today’s visual and audio technologies to build a cutting-edge in-car interface. Ian Lane, Research Assistant Professor at CMU Silicon Valley, and Antoine Raux, Senior Scientist at Honda Research Institute, spoke to us regarding the latest news surrounding AIDAS: An Intelligent Driver Assistive System.
In this article we describe the "Spoken Web Search" task within Mediaeval, which tries to foster research on language-independent search of "real-world" speech data, with a special emphasis on low-resourced languages. In addition, we review the main approaches proposed in 2012 and make a call for participation for the 2013 evaluation.
My voice tells who I am. No two individuals sound identical because their vocal tract shapes and other parts of their voice production organs are different. With speaker verification technology, we extract speaker traits, or voiceprint, from speech samples to establish speaker's identity. Among different forms of biometrics, voice is believed to be the most straightforward for telephone-based applications because telephone is built for voice communication. The recent release of Baidu-Lenovo A586 marks an important milestone of mass market adoption of speaker verification technology in mobile applications. The voice-unlock featured in the smartphone allows users to unlock their phone screens using spoken passphrases.
Cleft Lip and Palate (CLP) is among the most frequent congenital abnormalities [1]; the facial development is abnormal during gestation. This leads to insufficient closure of lip, palate and jaw with affected articulation. Due to the huge variety of malformations speech production is affected inhomogeneously for different patients.
Previous research in our group focused mostly on text-wide scores like speech intelligibility [2, 3]. In current projects we focus on a more detailed automatic analysis. The goal is to provide an in-depth diagnosis with direct feedback on articulation deficits.
This article gives a brief overview of the 8th International Symposium on Chinese Spoken Language Processing (ISCSLP), that was held in Hong Kong during 5-8 December 2012. ISCSLP is a major scientific conference for scientists, researchers, and practitioners to report and discuss the latest progress in all theoretical and technological aspects of Chinese spoken language processing. The working language of ISCSLP is English.
SLTC Newsletter, February 2013
Welcome to the first SLTC Newsletter of 2013. We are now well into the preparations for our major event of the year: ICASSP (the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing), which will be held in Vancouver, Canada, May 26-31.
First, allow me to pass our many thanks to all those in the speech and language processing community for your submissions to ICASSP-2013. The program will soon be finalized, so please visit the website www.icassp2013.com in a few weeks to see the list of accepted papers. The program will include a Keynote Talk and three other Plenary Talks from internationally recognized leaders in signal processing.
The conference will start with one and a half days of tutorials, including: T2: Auditory Transform and Features for Robust Speech and Speaker Recognition, T8: Speech Translation: Theory and Practice, T11: Dictionary Learning for Sparse Representations: Algorithms and Applications, T14: Graph-based Semi-Supervised Learning Algorithms for Speech & Spoken Language Processing. With 4 of the 15 tutorials dealing with speech and language issues, our areas are well represented this year.
In addition, ICASSP 2013 will offer the following as one of eight special sessions: "New types of deep neural network learning for speech recognition and related applications" (organized by Li Deng, Geoff Hinton and Brian Kingsbury). To assist those who may not have adequate funding to attend ICASSP, there are several Travel Grants available. Prospective applicants may submit an application via the ICASSP-2013 website soon. Each SPS Travel Grant application must be accompanied by a recommendation letter from one's supervisor or a senior colleague who is familiar with one's work. Speech and language processing continues to be largest single technical concentration area within ICASSP, and we hope all will visit Vancouver to participate in this year's conference. Our area speech and language area received a record 753 submitted papers this year.
I also remind you that we will also meet late in 2013 for the ASRU (IEEE Automatic Speech Recognition and Understanding) workshop, which meets every two years and has a tradition of bringing together researchers from academia and industry in an intimate and collegial setting to discuss problems of common interest in automatic speech recognition and understanding. This December 8-12 it will be held in Olomouc, Czech Republic (www.asru2013.org).
ASRU 2013 will focus on three important topics in current machine recognition of speech: Neural networks in ASR: from history to present successes in applications, Building ASR systems with limited resources: development of systems with restricted or no in-domain resources, ASR in applications: interaction with other sources of information.
Submitted papers are encouraged to focus on the areas above, but papers in other typical ASRU areas (LVCSR Systems, Language Modeling, Acoustic Modeling, Decoders & Search, Spoken Language Understanding, Spoken Dialog Systems, Robustness in ASR, Spoken Document Retrieval, Speech-to-Speech Translation, Speech Summarization, New Applications of ASR and Speech Signal Processing) are welcome. The submission deadline is July 1, 2013.
The ASRU alternates with the SLT (Spoken Language Technology) Workshop, which had a successful meeting in Miami in December 2012.
Four keynote talks were given:
The program included tutorials:
In addition Panel Discussions were held on:
In closing, I hope you will join the SLTC this year in participating at both IEEE ICASSP-2013 and ASRU-2013. We look forward to meeting friends and colleagues and seeing the great scenery in beautiful Vancouver and Olomouc.
Best wishes,
Douglas O'Shaughnessy
Douglas O'Shaughnessy is the Chair of the Speech and Language Processing Technical Committee.
SLTC Newsletter, February 2013
In this issue of the SLTC Newsletter, I have the pleasure of highlighting a number of accomplishments from some of our members in the speech and language processing community.
The field of speech and language processing continues to be the largest Technical Area for IEEE ICASSP submissions, and with the proliferation of smart-phones, tablets, and other mobile/smart interactive systems, speech processing including recognition and dialog interaction continues to grow at a rapid pace. In recognition of the significant contributions to the field of speech and audio processing, the IEEE established the "IEEE James L. Flanagan Speech and Audio Processing award" in 2002. The purpose of this award is to recognize one (or more) individual(s) for outstanding contributions to the advancement of speech and/or audio signal processing. The past recipients are listed below, and reflect some of the most significant contributions to the field over the past forty years.
Congratulations to Dr. VICTOR ZUE for being recognized as the IEEE James L. Flanagan Speech and Audio Processing award winner for 2013!
2013 - Victor Zue, Massachusetts Institute of Technology, Cambridge, MA, USA
2012 - James Baker, CMU, and Entrepreneur, Maitland, FL, USA
Janet Baker, Saras Institute, Maitland, FL, USA
2011 - Julia Hirschberg, Columbia University, New York, NY, USA
2010 - Sadaoki Furui, Tokyo Institute of Technology
2009 - John Makhoul, BBN Technologies, Cambridge, MA, USA
2008 - Raj Reddy, Carnegie Mellon University West Coast Campus, Pittsburgh, PA, USA
2007 - Allen Gersho, University of California - Santa Barbara, Santa Barbara, CA, USA
2006 - James D. Johnston, Microsoft Corporation, Redmond, WA, USA
2005 - Frederick R. Jelinek, Johns Hopkins University, Baltimore, MD, USA
In addition to this medal, there is also an award for best paper in the IEEE Transactions on Audio, Speech, and Language Processing, as well as the Young Author Best Paper award. This year, the recipient for the best Young Author Award is Najim Dehak, for his paper:
NAJIM DEHAK (co-authored with Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet), "Front-End Factor Analysis for Speaker Verification", IEEE Transactions on Audio, Speech, and Language Processing, Volume: 19, No. 4, May 2011
Congratulations to Najim Dehak for this accomplishment!
The next IEEE ICASSP conference is quickly approaching (http://www.icassp2013.com/), so I hope you will join me in congratulating this years awards winners. I look forward to seeing you all in Vancouver, Canada!
Best wishes…
John H.L. Hansen is serving as "Past-Chair" of SLTC (Speech and Language Processing Technical Committee).
[1] http://www.ieee.org/about/awards/tfas/flanagan.html
Following on the tremendous success of SLT 2012, the SPS-SLTC invites
proposals to host the 2014 IEEE Workshop on Spoken Language Technology
(SLT-2014). Past SLT workshops have fostered a collegiate atmosphere through
a thoughtful selection of venues, thus offering a unique opportunity for
researchers to interact and learn.
The proposal should include the information outlined below.
If you would like to be the organizer(s) of SLT-2014, please send the
Workshop Sub-Committee a draft proposal before April 1, 2013. (Point of
contact: gsaon@us.ibm.com). Proposals
will be evaluated by the SPS SLTC, with a decision expected in June.
The organizers of the ASRU workshop do not have to be SLTC members, and we
encourage submissions from all potential organizers. So we encourage you to
distribute this call for proposals far and wide to invite members of the
speech and language community at large to submit a proposal to organize the
next SLT workshop.
For more information on the most recent workshops, please see:
SLTC Newsletter, February 2013 Advances over the last decade in speech recognition and NLP have fueled the widespread use of spoken dialog systems, including telephony-based applications, multimodal voice search, and voice-enabled smartphone services designed to serve as mobile personal assistants. Key limitations of the systems fielded to date frame opportunities for new research on physically situated and open-world spoken dialog and interaction. Such opportunities are made especially salient for such goals as supporting efficient communication at a distance with Xbox applications and avatars, collaborating with robots in a public space, and enlisting assistance from in-car information systems while driving a vehicle.
When conversing and collaborating, we leverage in subtle yet deep ways relevant details of the physical and social settings in which we are immersed, including configurations of spaces, proximal objects, and people. Efforts to extend spoken language interactive systems to open-world settings are marked by key departures from assumptions made in the design and development of traditional spoken language interfaces, and hinge on the ability to sense and reason about the physical and social world. First and foremost, we need to design explicitly for physically situated interactions: the surrounding environment provides a continuously streaming situational context that can be relevant for the tasks at hand. Second, these efforts must assume as fundamental the fact that the world contains multiple actors that may come and go and interleave and coordinate their interactions with other activities. Thus, appropriate and effective interactions depend critically on multiple competencies beyond speech recognition and language understanding. Numerous challenges arise in the open world with managing focus of attention, engagement, turn-taking, and multiparty interaction planning. Successful dialogs require reasoning that takes into account the broader situational context: the who, where, what, and why of the scene. At the lower levels, efforts on this path include developing methods that provide for basic physical awareness and reasoning about relevant actors, objects, and events in the environment, including their locations, physical characteristics and relationships, and topologies and trajectories. At a higher level, research goals include developing representations and reasoning machinery for interpreting the semantic context about the dynamics of the activities of the relevant human and computational actors, and about the long-term goals, plans, desires, knowledge, and intentions that can generate and explain these activities—often under inescapable uncertainty.
As an example, a foundational competency required for open-world interaction is the ability to manage conversational engagement: the process by which the interactions are initiated, maintained, and closed [1,2,3]. Traditional approaches to signaling engagement, such as pressing push-to-talk buttons, or simple
heuristics such as listening for the start of an utterance following a system’s verbal prompt, are inadequate for systems that operate in open-world settings, where participants might initiate or break interactions from a distance, interact with each other and with the system, and interleave their interactions with other activities. Creating models that support fluid natural engagement and disengagement in these settings requires multimodal sensing and reasoning about trajectories, proxemics, and geometric relationships. Such competencies are important for recognizing and interpreting formations of people, non-verbal behaviors, body pose, and the signals about human attention and intention from the dynamics of peoples’ gaze and eye contact. Decisions about when to initiate or break engagement must also take into account higher-level inferences about the long-term goals and (joint) activities of the agents in the scene. Interesting design challenges come to the fore in seeking to both recognize and generate signals in a transparent manner, with the goal of mutual grounding on engagement state and intentions. For instance, in virtually or physically embodied conversational systems (e.g. avatars and robots), complex low-level behavioral models that control placement, pose, gesture, and facial expressions are required.
Once a communication channel has been established, systems must coordinate with people on when to talk and, more broadly, when to act. Traditional spoken interface designs focus exclusively on utterances (linguistic actions) and manage turn taking by using push-to-talk buttons, or by making a “single user at a time” assumption and taking a “you speak then I speak—you understand that?” approach. This attempt at simplifying what is more naturally a complex, highly coordinated mixed-initiative process often leads to interaction breakdowns even in dyadic settings. Commonly used turn-taking heuristics are insufficient in open-world settings, where multiple participants may vie for the floor and may address contributions to the system or to each other, and where events external to the conversation can influence the meaning and urgency of contributions. We see multiple opportunities ahead in developing and integrating new competencies in sensing, decision making, and production to support more seamless turn-taking [4,5,6,7]. A key direction is developing robust models for tracking conversational dynamics that harness the richer audio-visual, physical, temporal, and interactional context. Inescapable uncertainties in sensing and the paramount importance of accurate timing define critical tradeoffs between timely decisions and the greater accuracies promised in return for delays to collect additional evidence. As with managing engagement, key design and control challenges also arise in developing more refined behavioral models for signaling the system’s own turn-taking actions and intentions, e.g., modulating gaze and prosody on the fly, producing backchannels, etc.
Beyond coordinating the production of turns and actions, important challenges also arise at the higher levels in understanding and planning interactions. We believe that attaining the dream of fluid, seamless spoken language interaction with machines in open-world settings also requires fundamental shifts in how we view the problem of dialog management. In settings where the environment provides rich,
continuously streaming evidence that is important in guiding decisions, turn-based sequential action planning models typically used for dialog management are no longer sufficient in themselves. Because the situational context can evolve asynchronously with respect to turns in the conversation, physically situated interaction requires interaction models that can understand and plan incrementally [8,9], in stream, rather than on a per turn basis. Discourse understanding models need to be extended to track and leverage the streaming situational context, to extract important clues from noisy observations. In addition, interaction models must go beyond sequential action planning and must reason about parallel and coordinated actions and about their ongoing influences on states of the world.
Managing engagement, turn-taking, and interaction planning are only a few of the core competencies required for fluid physically situated interaction. Other areas of enabling research include work on using proxemics models for interaction [e.g.,10,11], spatial language understanding [e.g.,12,13], situated reference resolution [e.g.,14,15], issues of grounding [e.g.,16,17,18] and alignment [e.g.,19], short and long-term memory models [e.g.,20,21], affective and cultural issues, etc. The confluence of a body of theoretical and empirical work accumulated over decades in areas like psycholinguistics, sociolinguistics, conversational analysis, with increased sensing and computational capabilities leads to interesting new research opportunities in physically situated interaction. While taking incremental, focused steps has been instrumental and continues to be important for making advances in the development of spoken language interfaces, we believe that the conceptual borders of research in spoken dialog systems can and should be further enlarged to encompass the larger challenges of physically situated spoken language interaction. We believe that embracing these larger goals will lead to significant progress on the struggles with the simpler ones, and that the investments in solving challenges with physically situated collaboration will push the field forward and enable future advances towards the long-standing AI dream of fluid interaction with machines.
[1] C.L. Sidner, C. Lee, C.D. Kidd, N. Lesh, and C. Rich, 2005, Explorations in engagement for humans and robots, Artificial Intelligence, 166 (1-2), pp. 140-164
[2] Bohus D., and Horvitz, E., 2009, Models for Multiparty Engagement in Open-World Dialog, in Proceedings of SIGdial 2009, London, UK
[3] Szafir, D., and Mutlu, B., 2012, Pay Attention! Designing Adaptive Agents that Monitor and Improve User Engagement, In Proceedings of CHI 2012, Austin, TX
[4] Traum, D., and Rickel, J., 2002, Embodied Agents for Multiparty Dialogue in Immersive Virtual World, in Proceedings of AAMAS 2002
[5] Thorisson, K.R., 2002, Natural Turn-Taking Needs no Manual: Computational Theory and Model, From Perceptions to Action, Multimodality in Language and Speech Systems.
[6] Selfridge, E., and Heeman, P., 2010, Importance-Driven Turn-Bidding for Spoken Dialogue Systems, in Proceedings of ACL 2010, Uppsala, Sweden.
[7] Bohus, D., and Horvitz, E., 2011, Decisions about Turns in Multiparty Conversation: From Perception to Action, in Proceedings of ICMI 2011, Alicante, Spain
[8] Schlangen, D., and Skantze, G., 2009, A General, Abstract Model of Incremental Dialogue Processing, in Proceedings of EACL 2009, Athens, Greece
[9] Traum, D., DeVault, D., Lee, J., Wang, Z., Marsella, S., 2012, Incremental Dialogue Understanding and Feedback for Multiparty, Multimodal Conversation, Intelligent Virtual Agents, Lecture Notes in Computer Science, Vol. 7502, 2012
[10] Michalowski, M.P., Sabanovic, S., and Simmons, R., 2006, A spatial model of engagement for a social robot, in 9th IEEE Workshop on Advanced Motion Control, 2006
[11] Multu, B., and Mumm, J., 2011, Human-Robot Proxemics: Physical and Psychological Distancing in Human-Robot Interaction, in Proceedings of HRI 2011, Lausanne, Switzerland
[12] Ma, Y., Raux, A., Ramachandran, D., and Gupta, R., 2012, Landmark-based Location Belief Tracking in a Spoken Dialog System, in Proceedings of SIGdial 2012, Seoul, South Korea.
[13] Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N., 2011, Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation, in Proceedings of AAAI 2011
[14] Giuliani, M., Foster, M.E., Isard, A., Matheson, C., Oberlander, J., Knoll, A., 2010, Situated Reference in a Hybrid Human-Robot Interaction System, in Proceedings of INLG-2010, Dublin, Ireland
[15] Chai, J.Y., and Prasov, Z., 2010, Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric Reference in Situated Dialogue, in Proceedings of EMNLP 2010, MIT, MA
[16] Clark, H.H., and Schefer, E.F., 1989, Contributing to Discourse, Cognitive Science, 13:256-294, 1989.
[17] Clark, H.H., and Brennan, S.E., 1991, Grounding in Communication, Perspectives on Socially Shared Cognition, 13:127-149, 1991
[18] Traum, D., 1994, A Computational Theory of Grounding in Natural Language Conversation, TR 545 and Ph.D. Thesis, U. Rochester, 1994
[19] Pickering, M.J., and Garrod, S., 2004, Toward a Mechanistic Psychology of Dialogue, Behavioral and Brain Sciences, Volume 27, Nr.2, 169-189
[20] Zacks, J.M., Tversky, B, Iyer, G., 2001, Perceiving, remembering, and communicating structure in events. Journal of Experimental Psychology: General, Vol 130(1), Mar 2001, 29-58. doi: 10.1037/0096-3445.130.1.29
[21] Horvitz, E., Dumais, S., Koch. P., 2004, Learning Predictive Models of Memory Landmarks, CogSci 2004: 26th Annual Meeting of the Cognitive Science Society, Chicago, August 2004.
SLTC Newsletter, January 2012 The 26th annual Conference on Neural Information Processing Systems (NIPS) took place in Lake Tahoe, Nevada, December 2012. The NIPS conference covers a wide variety of research topics discussing synthetic neural systems through machine learning and artificial intelligence algorithms as well as the analysis of natural neural processing systems. This article is a summary of selected talks regarding recent developments on neural networks and deep learning algorithms presented in NIPS 2012. Stephane Mallat from LIENS, Ecole Normale Supérieure presented his talk on
"Classification with deep invariant scattering networks". The talk discusses learning informative invariants from high-dimensional data representations. Most high-dimensional data representations suffer from intra-class variability problems and finding an invariant representation is a challenging task. The variance of the input data could come from translations, rotations or frequency transpositions of the input space. The talk proposes using convolutional networks as a methodology to reduce the variance in the signal representation by scattering data in high dimensional spaces using wavelet filters. Mallat shows that the proposed scattering coefficient methodology outperforms convolutional neural networks on an MNIST digit recognition task. In addition, on CUReT, a texture classification database, it outperforms other invariant representations such as using Fourier coefficients [1].
Geoffrey Hinton from the University of Toronto gave a talk titled "Dropout: A simple and effective way to improve neural networks". When a neural network with a large number of parameters is trained on a small data set, it can often perform poorly on held-out data; a concept known as overfitting. The talk introduces a technique termed "dropout" to address this overfitting problem. Dropout is the process of randomly omitting half of the hidden units on each training case. This prevents the co-adaptation of feature detectors (i.e. hidden activations), for instances at which features are only helpful in the presence of other feature detectors. Using dropout, each feature detector separately helps to improve classification accuracy of the network. Dropout can also be seen as a method to do model averaging with neural networks, without requiring an explicit training for separate networks. State of the art results are achieved with the proposed technique on both the TIMIT phone recognition task and the ImageNet image classification task [2].
Andrew Ng from Stanford University presented "Deep Learning: Machine learning and AI via large scale brain simulations". The talk discusses large-scale distributed training of a deep network with billions of parameters using thousands of CPU cores. The software framework, known as DistBelief, explores using two distributed training methods, namely an asynchronous stochastic gradient descent method and a distributed L-BFGS technique. Networks roughly 30x larger than those previously reported in the literature are trained on ImageNet, achieving state of the art performance. Ng also discusses training a large autoencoder with 1 billion parameters on 10 million images downloaded from the web. Using just unlabeled data, the network is able to distinguish the presence vs. absence of faces in images. Furthermore, a controlled experiment was performed and the neural network was trained in an unsupervised fashion with images containing cats, human bodies and random backgrounds. The network can achieve state-of-the art recognition results of around 75% on cats and human bodies [3,4].
[1] J. Bruna and S. Mallat, "Invariant Scattering Convolution Networks," in Arxiv, 2012.
If you have comments, corrections, or additions to this article,
please contact the author: Tara Sainath, tsainath [at] us [dot] ibm
[dot] com. Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com SLTC Newsletter, February 2013
Researchers at Carnegie Mellon University’s Silicon Valley Campus and Honda Research Institute have brought together many of today’s visual and audio technologies to build a cutting-edge in-car interface. Ian Lane, Research Assistant Professor at CMU Silicon Valley, and Antoine Raux, Senior Scientist at Honda Research Institute, spoke to us regarding the latest news surrounding AIDAS: An Intelligent Driver Assistive System.
In-car “infotainment” systems have made it out of labs and into consumers’ hands. Such systems are capable of playing music, making phone calls, and providing GPS directions (hence the combination of information and entertainment). In fact, many car companies have made them a primary selling point for their vehicles. Recently their popularity has culminated with a Super Bowl commercial that features such a system. Currently, however, the infotainment systems in the marketplace have limitations: often they cannot interpret more than a few words of speech, and functionality is driven primarily by speech and touchpads.
In one sense, today’s systems do not take full advantage of all aspects of multimodality - those not prominent in today’s devices include visual gesture recognition and gaze tracking. Researchers at Carnegie Mellon University’s Silicon Valley Campus and Honda Research Institute are investigating this very issue - combining more aspects of multimodality to in-car interactive systems. The technologies at work in their system, AIDAS, include speech recognition, natural language understanding, computer vision, and belief tracking.
We had a chance to interview the two primary investigators of the project, Ian Lane and Antoine Raux.
SLTC: Tell me a bit about AIDAS. What are the project’s goals?
Ian Lane and Antoine Raux: AIDAS is a joint project between Honda Research Institute USA and Carnegie Mellon University. Our main purpose is to develop technologies for natural, situated human-machine interaction, that is to say interaction that happens in the real world. Specifically, we are investigating immersive in-car interaction as a realistic and useful application that features all the complexity of situated interaction. Towards this general goal, the project supports research of many different components, from speech recognition and head pose tracking in real world conditions (i.e. in noisy environments with highly variable lighting conditions), to multimodal belief tracking and dialog management.
SLTC: What makes AIDAS a step toward “context-aware” and “natural” interactions with in-car systems?
Lane and Raux: The first step in AIDAS was to develop a hardware platform to support this research. We build AIDAS using a standard high performance PC with a specially designed power unit to run off the 12V battery within a vehicle. The PC has access to a wide range of sensors, throughout the car including in-car and external cameras, depth sensors, headset microphones, microphone arrays, high precision GPS, as well as information from the car such as speed, chassis vibration etc. Information from all these sensors can be accessed from a unified framework, which allows a rich representation of context.
So far, our demonstration systems have focused on speech, GPS and head pose information, combined with a geographic database of points of interest to let users ask questions about their direct environment.
SLTC: How can the system talk about things as you pass them by (in the car)?
Lane and Raux: Simply by looking at (turning your face towards) a particular building, the driver can ask questions about objects within their field-of-view. Such as "Is that restaurant any good?", "What are people saying about that clothing store?" or "When was that church built?". Using a semantic database of points of interest, the system combines the geo-located head pose information with the spoken language understanding result to estimate which point-of-interest the user is talking about and what she wants to know about it. The system can also infer the point-of-interest from dialog context, in the case of pronoun anaphora (e.g. "Is it expensive?").
SLTC: What were the challenges you had in bringing together technologies from various disciplines?
Lane and Raux: The most significant challenge was to get the individual components to work robustly in the vehicle. For example, while off-the-shelf modules from ROS (Robot Operating System) worked well for head tracking in indoor environments, in the car these modules were all unsatisfactory. We ended up having to develop a head-tracking module in-house that would work in the challenging lighting conditions that are present in a moving vehicle.
SLTC: What have been the most exciting parts of the project so far?
Lane and Raux: What's most exciting is to see the potential of our approach. The individual technologies have made so much progress that we can now envision modes of interaction and levels of naturalness that were simply of the domain of science fiction a few years ago. It is very rewarding to see the effect of our demos on the audience, showing that beyond the technical achievements, there is something that people can relate to and see value in.
SLTC: I’ve heard that you’ve adapted the popular ROS (Robot Operating System) framework from the robotics community to your system. How has your experience been with ROS?
Lane and Raux: Although AIDAS is not a robot in the traditional sense, it shares many of the characteristics of interactive robots: heterogeneous sensors whose information must be fused, a need for spatial representations and transformations (e.g. from head-pose to car and from car to map, which allows to infer which points-of-interest are in the field of view of the driver), as well as high level inference and planning. Therefore, ROS was a natural (and excellent) choice for the development middleware. We had previously begun development on HRItk, a toolkit for speech and dialog interaction that leveraged ROS for human-robot interaction. When we started working on AIDAS, we could basically reuse the ASR, SLU, and TTS components of HRItk as-is. For GPS processing, and depth-based head-pose tracking, there were existing libraries that made our life much easier. An intern student from the University of Science and Technology of China, Yanjing Zhang, spent the summer developing an interactive visualization node for AIDAS using Javascript ROS bindings. The visualization helped us significantly during system testing and was essential given the complexity of contextual information and belief tracking components we were developing.
SLTC: What is the role of context in the understanding components of AIDAS?
Lane and Raux: There are primarily two kinds of context that AIDAS takes into account. First is local geographic context, i.e. where the car is in the world and where the driver is looking. This context allows the system to understand and ground utterances such as "How expensive is that restaurant over there?". Second, dialog context is used to resolve anaphora, as mentioned earlier. This is necessary to interpret pronouns in sentences like "What is its rating?". The two forms of context are combined with SLU output before being passed to our belief tracker, which uses a single Bayesian Network to infer the most likely user intention. The belief tracker's network encodes prior knowledge such as "Sino is a Chinese restaurant" so that if there are several restaurants in the gaze and the user asks "What is that Chinese restaurant?", the probability of that restaurant being "Sino" as opposed to the Italian restaurant next to it is high. This is a crucial aspect of our system that allows it to make robust inference on the user intention based on incomplete and noisy information from several modalities.
SLTC: Does AIDAS do any sort of long-term interaction monitoring?
Lane and Raux: At this point, AIDAS does not remember anything about the interaction beyond a single dialog. In the future, there is clear potential for personalizing the interaction since only one or two people typically drive a given car and that driving patterns closely follow people's life and routines.
SLTC: Where do you see the project going? Could anything have a quick impact on the car market?
Lane and Raux: While it was mainly started as a personal collaboration between researchers, the project has garnered significant attention at conferences, and Honda sees the potential value of these technologies. There are on-going efforts to push the system in several of many possible research directions. On the practical side, we are working on further demonstrating the practicality and usefulness of the technology by building real car prototypes that exploit the wealth of web APIs providing local information. Once both simulator-based and real car prototypes robust enough to be used by naive users are ready, we will be able to conduct user studies to validate our results and identify areas for improvement.
Consumers have come to expect more and more high technology in their cars. We would not be surprised if, in the near future, mass produced cars featured technologies comparable to those developed in the AIDAS project.
AIDAS was a multi-disciplinary initiative that consisted of the following contributors: If you have comments, corrections, or additions to this article, please contact the author: Matthew Marge, mrma...@cs.cmu.edu. Matthew Marge is a doctoral student in the Language Technologies Institute at Carnegie Mellon University. His interests are spoken dialogue systems, human-robot interaction, and crowdsourcing for natural language research. SLTC Newsletter, February 2013 In this article we describe the "Spoken Web Search" task within Mediaeval [5], which tries to foster research on language-independent search of "real-world" speech data, with a special emphasis on low-resourced languages. In addition, we review the main approaches proposed in 2012 and make a call for participation for the 2013 evaluation. Recently, there has been great interest in algorithms that allow for rapid and robust development of speech technology independent of language, with a particular focus on search and retrieval applications. Today's technology has mostly been developed for transcription of English, and still covers only a small subset of the world's languages, often with markedly lower performances. A key reason for this is that it is often impractical to collect and transcribe a sufficient amount of training data under well-defined conditions for current state-of-the-art techniques to work well. The "Spoken Web Search" task, which has been running within MediaEval since 2011, provides an evaluation corpus and baseline for research on language-independent search of "real-world" speech data, with a special emphasis on low-resourced languages, and provides a forum for dissemination of original research ideas. MediaEval [4] is a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval. The 2012 "Spoken Web Search" task [2] involved searching for audio content using an audio query. It required researchers to build a language-independent audio search system so that, given a spoken query, it would find the appropriate audio file(s) and the (approximate) location where the query term is spoken within the audio file(s). The 2012 dataset was extracted from the Lwazi ASR corpus [3] and contained speech from four different African languages (isiNdebele, Siswati, Tshivenda, and Xitsonga), with utterances spoken by different people over the phone. A development and an evaluation sets were provided, each one with an accompanying set of 100 spoken query terms. Both sets consisted of around 1600 utterances, with a total of around 4 hours of speech. (COMMENT: vocabulary size? This is not speech recognition, there was no vocabulary size considerations in the task) No phonetic dictionaries were provided even though they could be found at the Lwazi official website and could be used for comparative experiments. Evaluation was performed using standard NIST metrics for spoken term detection, such as ATWV (Average Term Weighted Value) which computes an average weighted combination of missed detections and false alarms across all search terms [1]. Performing language identification, followed by standard speech-to-text is usually not appropriate in this case, because recognizers are typically not available in these languages, or dialects. In the 2012 evaluation, 12 different approaches were proposed by 9 participating sites. The systems were classified into "restricted", which means that no extra data were used, and "open", which could use any resource available as long as it was well documented. The "open" systems were mainly based on multilingual phone recognizers or tokenizers that converted the speech signal for both query and content into a sequence of symbols or posteriorgrams, later used for searching. The methods differed in the speech features used, acoustic modeling and searching paradigms. Five systems used a "symbol-matching" architecture based on separate indexing (conversion from speech signal to symbol sequence) and searching components. Three of these approaches used Dynamic Time Warping (DTW) based methods for searching, while the other two performed the search on phone lattices that resulted from the phone recognition process, by using an Acoustic Keyword Spotting (AKWS) algorithm. The best results on the evaluation data were obtained by a DTW based searching method using posteriorgrams (ATWV=0.76), while the best AKWS based method also obtained good results (ATWV=0.53). The "restricted" systems were all based on "frame-matching" or "pattern-matching" approaches, where no external data is used, and search is performed on a frame-by-frame basis. Most systems implemented variations of the DTW algorithm by using different input features. Most of these systems performed similarly by obtaining an ATWV ranging from 0.29 to 0.37. In comparison, one system obtained an ATWV of 0.64, which may be explained by the use of Vocal Tract Length Normalization [6] (VTLN) before applying the search based on DTW. Overall, in the 2012 evaluation it seems that the careful use of extra data from languages other than the provided "African" ones, combined with a DTW-based search tended to provide the best results. Feature normalization techniques such as VTLN and a careful per-query score normalization also helped to improve results. Further analyses and system descriptions are currently being published in a number of papers, in addition to the working note proceedings available online [4]. A follow-up evaluation with data from additional languages will be performed in conjunction with the MediaEval 2013 benchmark initiative. Results will be reported in a workshop to be held in Barcelona in October 2013. We intend to increase the number of languages, acoustic conditions and the amount of data. To make it easier for first-time participants to participate, we will release a virtual machine with development data and a baseline system. For further information, and to sign up (until May 31st, 2013), please refer to [5], or contact Xavier Anguera. References: Xavier Anguera, Florian Metze, Andi Buzo, Igor Szoke and Luis J. Rodriguez-Fuentes are the organizers of Mediaeval 2013 SWS task evaluation. Contact email: xanguera@tid.es SLTC Newsletter, February 2013 My voice tells who I am. No two individuals sound identical because their vocal tract shapes and other parts of their voice production organs are different. With speaker verification technology, we extract speaker traits, or voiceprint, from speech samples to establish speaker's identity. Among different forms of biometrics, voice is believed to be the most straightforward for telephone-based applications because telephone is built for voice communication. The recent release of Baidu-Lenovo A586 marks an important milestone of mass market adoption of speaker verification technology in mobile applications. The voice-unlock featured in the smartphone allows users to unlock their phone screens using spoken passphrases.
As noted in [1, 2], voice characterization provides an additional layer of security over conventional password or screen swipe, which is based on "what you know", to put it in biometric terminology. In November 2012, the Institute for Infocomm Research of Agency for Science, Technology and Research (A*STAR) in Singapore, and the internet giant Baidu (NASDAQ: BIDU) jointly introduced voice biometrics into smartphone [3]. Baidu-Lenovo A586, in what it calls an industry first, incorporates voice-unlock as a built-in feature in the smartphone operating system.
Voice is a combination of physical and behavioral biometrics characteristics [4]. The physical features of an individual's voice are based on the shape and size of the vocal tract, mouth, nasal cavities, and lips that involve in producing speech sound. The behavioral aspects, which include the use of a particular accent, intonation style, pronunciation pattern, choice of vocabulary and so on, are associated more with the words or lexical content of the spoken utterances.
Speaker verification technology comes in two flavors: text dependent and text independent. Text-dependent speaker verification requires users to pronounce certain pre-determined passphrases, while text-independent system recognizes the speakers independent of what has been spoken. Text-dependent system is typically more efficient than the text independent counterpart, as both physical and behavioral aspects could be taken into consideration. For instance, spoken pass-phrases in any languages with a nominal duration of 2 seconds are used for the voice-unlock featured in A586.
The recent proliferation of mobile devices coupled with the need for a convenient and non-intrusive way of person authentication have been the major driving factor behind the deployment of speaker verification technology on smartphones. Figure 1 shows the screen shot of the Baidu-Lenovo A586 smartphone. A new user is required to register a voiceprint by speaking a passphrase three times in any language. S/he is then able to unlock the screen by speaking the same passphrase. In this system, Hidden Markov Models are used to encode one's voiceprint and its reference voiceprint for the population [5], while likelihood ratio is used in the decision logic. The availability of powerful processor and I/O devices like the microphone and touch-screen display makes the enrollment and verification process relatively straightforward and allows model training and matching to be performed directly on the smartphone.
The speaker verification system was calibrated based on a database that was developed by the Institute for Infocomm Research [6]. Under clean condition, it attains an equal-error-rate (EER) of less 1 % using test segments of less than 3 seconds [6, 7]. Increasing the threshold above the EER leads to tighter security while reducing the threshold promotes better convenience for the users. For the case of voice-unlock, the threshold was set to favor the convenience considering the lower prior and cost for imposture. Smartphones are typically used in mobile condition. To overcome different noise conditions, signal-to-noise ratio (SNR) was estimated utterance by utterance and used as side information in the decision logic.
Incorporating the voice-unlock feature in smartphone was part of the Baidu Cloud initiative [8]. It opens up opportunities for other usage around the feature, for example, second factor authentication in mobile payments, online banking, and access control of other online services, where we can combine speech and speaker verification for enhanced security [5].
Figure 1: Screen shot of A586.
[1] http://shouji.baidu.com/platform/
[2] http://www.bbc.co.uk/news/technology-20675227
[3] http://www.news.gov.sg/public/sgpc/en/media_releases/agencies/astar/press_release/P-20121130-1.html
[4] T. Kinnunen and H. Li, "An overview of text-independent speaker recognition: from features to supervectors," Speech Communication, vol. 52, no. 1, pp. 12-40, Jan. 2010.
[5] K. A. Lee, A. Larcher, H. Thai, B. Ma and H. Li, "Joint application of speech and speaker recognition for automation and security in smart home," In Proc. INTERSPEECH, 2011, pp 3317-3318.
[6] A. Larcher, K. A. Lee, B. Ma and H. Li, "RSR2015: database for text-dependent speaker verification using multiple pass-phrases," in Proc. INTERSPEECH, 2012, paper 364.
[7] A. Larcher, K. A. Lee, B. Ma and H. Li, "Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances," submitted, ICASSP, 2013.
E. Strickland, “Chinese search engine goes mobile,” IEEE Spectrum, vol. 49, no. 8, pp. 9 – 10, Aug. 2012.
Kong Aik Lee is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore
Bin Ma is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore
Haizhou Li is the Head of the Department of Human Language Technology at Institute for Infocomm Research, Singapore. He is also a co-Director of Baidu-I2R Research Centre.
SLTC Newsletter, February 2013 Cleft Lip and Palate (CLP) is among the most frequent congenital abnormalities [1]; the facial development is abnormal during gestation. This leads to insufficient closure of lip, palate and jaw with affected articulation. Due to the huge variety of malformations speech production is affected inhomogeneously for different patients.
Previous research in our group focused mostly on text-wide scores like speech intelligibility [2, 3]. In current projects we focus on a more detailed automatic analysis. The goal is to provide an in-depth diagnosis with direct feedback on articulation deficits.
In clinical routine, each child speaks a standard test and is rated by a speech therapist on phoneme level regarding different articulation processes like diminished tension in the phoneme, nasality or pharyngeal backing. This detailed analysis is used for therapy cotrol and therapy planning. However, the rating process is rather time-consuming for speech therapists -- approximately two to three hours per child. This time could better be spent with actual speech therapy instead of rating the patients. Thus, there is a strong demand for an automatic approach that allows a fast, objective and rater-independent analysis.
In a current research project between the Pattern Recognition Lab of the University Erlangen-Nuremberg and the Phoniatrics and Pediatrics department of the University Clinics Erlangen, we recorded 400 children with CLP. They have been rated perceptually by speech therapists on phoneme level. For therapy purposes, the goal was to build an automatic system that would allow a detailed analysis for each process. For the articulation process pharyngeal backing the automatic analysis would then look like:
The text contains 178 plosives; 165 have been uttered with a pharyngeal backing.
The use of speaker identification approaches has been shown to be very useful for the problem of automatic speech screening [2, 4] and could also be applied to the more detailed speech analysis task. The use of speaker-adapted ASR systems, in order to employ their parameters as an articulatory model of a speaker, has shown huge improvements. For both systems we assume that the acoustics of children with CLP differ from normal speaking children. The severity of articulation problems, i.e., how strong an articulatory process is affected, is measured as the distance between the pathologic speaker model and a reference speaker model. Speaker models are either Gaussian Mixture Models (GMMs) or Maximum Likelihood Linear Regression (MLLR)-transformation matrices of adapted speech recognizers. With the MLLR-based approach we were able to achieve a correlation to the perceptual ratings of the speech therapists of r=0.8.
Currently, our approach can automatically determine the number of affected phones for the various articulation processes. This system can be used for diagnosis and therapy control with a sufficient accuracy. The next step will be to integrate the approaches into an interactive training tool, where words are assessed directly. Decisions on word (or even phone) level are very challenging. They are the consequential next step towards a successful use of speech analysis in the medical environment.
[1] P. A. Mossey, J. Little, R. G. Munger, M. J. Dixon, and W. C. Shaw. "Cleft lip and palate". Lancet, Vol. 374, No. 9703, pp. 1773-1785, Nov. 2009.
[2] T. Bocklet, K. Riedhammer, E. Nöth, U. Eysholdt, and T. Haderlein. "Automatic intelligibility assessment of speakers after laryngeal cancer by means of acoustic modeling". Journal of Voice, Vol. 26, No. 3, pp. 390-397, 2012.
[3] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, M. Schuster, and E. Nöth. "PEAKS - A system for the automatic evaluation of voice and speech disorders". Speech Communication, Vol. 51, No. 5, pp. 425-437, 2009.
[4] C. Middag, T. Bocklet, J.-P. Martens, and E. Nöuth. "Combining Phonological and Acoustic ASR-Free Features for Pathological Speech Intelligibility Assessment". In: Proceedings of the 12th Annual Conference of the International Speech Communication Association, Interspeech 2011, pp. 3005-3008, 2011.
Tobias Bocklet, Elmar Nöth are with the Pattern Recognition Lab (Informatik 5), Friedrich-Alexander-University Erlangen-Nuremberg.
E-Mail: elmar.noeth@fau.de, tobias.bocklet@fau.de
SLTC Newsletter, February 2013 This article gives a brief overview of the 8th International Symposium on Chinese Spoken Language Processing (ISCSLP), that was held in Hong Kong during 5-8 December 2012. ISCSLP is a major scientific conference for scientists, researchers, and practitioners to report and discuss the latest progress in all theoretical and technological aspects of Chinese spoken language processing. The working language of ISCSLP is English.
The 8th International Symposium on Chinese Spoken Language Processing (ISCSLP 2012) was held in Hong Kong during 5-8 December 2012 at the Innocenter, Kowloon Tong, Hong Kong SAR of China. This biennial event rotates among mainland China, Hong Kong, Singapore and Taiwan and is the flagship conference organized by the ISCA SIG-CSLP (International Speech Communication Association, Special Interest Group on Chinese Spoken Language Processing). ISCSLP2012 received 148 submissions authored by over 300 researchers from 15 countries/regions. 128 delegates and students participated in the conference, presenting 30 oral papers and 67 posters. The scope of the conference includes theoretical and technological aspects of spoken language processing that are related or applicable to the Chinese language. ISCSLP2012 had an impressive lineup of invited talks, including keynotes given by Professor Mark Gales on discriminative models for speech recognition, Professor Fan-Gang Zeng on hearing and speech enhancement, Professor Daniel Hirst on analysis and synthesis of speech prosody and Dr. Eric Chang on transforming research to product; as well as tutorials given by Professor Khe Chai Sim on trajectory modeling for robust speech recognition, Professor Tomoki Toda on voice conversion and Dr. Dong Yu on deep neural networks for speech recognition. A special feature this year is that the tutorial talks are freely open to all the delegates and were very well attended by all delegates. This created a dynamic exchange with much cross-referencing among the keynote and the tutorial lectures, which generated very interesting questions and discussions throughout the conference. ISCSLP2012 was hosted by The Chinese University of Hong Kong (CUHK), sponsored by IEEE Signal Processing Society (Hong Kong Chapter) and supported by ISCA, Dolby, The Hong Kong Applied Science and Technology Research Institute Company Ltd., The Hong Kong Polytechnic University, The Hong Kong University of Science and Technology, CUHK's Department of Systems Engineering & Engineering Management, CUHK's United College and CUHK's Shun Hing Institute of Advanced Engineering.
Selected photos from ISCSLP2012 are included below. More photos can also be found at: https://picasaweb.google.com/113416656606646035486/ISCSLP201258Dec
Professor Mark Gales (right) and Professor Fan-Gang Zeng (right)
Professor Daniel Hirst (left) and Dr. Eric Chang (right)
Helen Meng is Professor and Chairman of the Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong. Her interests are in multilingual speech and language technologies and multimodal systems. She was the General Chair of ISCSLP2012.
References:
Call for Proposals - SLT-2014
SPS-SLTC Workshop Sub-Committee: Nick Campbell, George Saon, Geoffrey Zweig
In Pursuit of Situated Spoken Dialog and Interaction
Dan Bohus and Eric Horvitz
An Overview of Selected Talks at NIPS 2012 Conference/Workshop
Tara N. Sainath
References
[2] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," in Arxiv, 2012.
[3] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, M.Z. Mao, M.A. Ranzato, A. Senior, P. Tucker, K. Yang and A. Y. Ng, "Large Scale Distributed Deep Networks," in Proc. NIPS, 2012.
[4] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean and A.Y. Ng, "Building high-level features using large scale unsupervised learning," in Proc. ICML, 2012.
Interview: Developing the Next Generation of In-Car Interfaces
Matthew Marge
Questions
PIs: Ian Lane (CMU) & Antoine Raux (HRI US)
Contributors: Yi Ma, Vishwanath Raman, Yanjing Zhang
Thanks to Ian and Antoine for their thoughtful answers! We look forward to hearing how the AIDAS project progresses.
References
Xavier Anguera, Florian Metze, Andi Buzo, Igor Szoke and Luis J. Rodriguez-Fuentes
Overview
Speaker Verification Makes Its Debut in Smartphone
Kong Aik Lee, Bin Ma, and Haizhou Li
Press Release
My Voice Tells Who I Am
Speaker Verification Debuted in Smartphone
Some Final Thoughts
References:
Spoken language disorders: from screening to analysis
Tobias Bocklet, Elmar Nöth
References:
Overview of the 8th International Symposium on Chinese Spoken Language Processing
Helen Meng