Speech and Language Processing Technical Committee Newsletter

February 2014

Welcome to the Spring 2014 edition of the IEEE Speech and Language Processing Technical Committee's Newsletter! This issue of the newsletter includes 9 articles and announcements from 12 contributors, including our own staff reporters and editors. Thank you all for your contributions! This issue includes news about IEEE journals and recent workshops, SLTC call for nominations, and individual contributions.

We believe the newsletter is an ideal forum for updates, reports, announcements and editorials which don't fit well with traditional journals. We welcome your contributions, as well as calls for papers, job announcements, comments and suggestions.

To subscribe to the Newsletter, send an email with the command "subscribe speechnewsdist" in the message body to listserv [at] listserv (dot) ieee [dot] org.

Florian Metze, Editor-in-chief
William Campbell, Editor
Haizhou Li, Editor
Patrick Nguyen, Editor

From the SLTC and IEEE

From the IEEE SLTC Chair

Douglas O'Shaughnessy

Call for Proposals - ASRU 2015

Geoffrey Zweig

Following on the tremendous success of ASRU 2013, the SPS-SLTC invites proposals to host the Automatic Speech Recognition and Understanding Workshop in 2015. Past ASRU workshops have fostered a collegiate atmosphere through a thoughtful selection of venues, thus offering a unique opportunity for researchers to interact.

Speech Synthesis Perfects Everyone's Singing

Minghui Dong, Nancy Chen, and Haizhou Li

Singing is more expressive than speaking. While singing is popular, singing well is nontrivial. This is especially true for songs that require high vocal skills. A singer needs to overcome two challenges among others - to sing in the right tune and at the correct rhythm. Even professional singers need intensive practice to perfect their vocal skills and to proficiently present particular singing styles, such as vibrato and resonance tuning. Recently, the Institute for Infocomm Research (I2R) in Singapore has developed a technology called Speech2Singing, which converts the singing voice of non-professional singers (or even spoken utterances) into perfect singing.

1 Billion Word Language Modeling Benchmark

Ciprian Chelba

Ciprian Chelba and colleagues released a language model benchmark and would like to advertise it to the speech community. The purpose of the project is to make available a standard training and test setup for language modeling experiments.

An Overview of ASRU 2013

Tara N. Sainath and Jan (Honza) Cernocky

The Automatic Speech Recognition and Understanding Workshop (ASRU) was recently hosted in Olomouc, Czech Republic from December 8-12, 2013. Each day of the workshop focused on a specific current theme that has become popular amongst ASR researchers. In this report, we highlight the 4 days of the workshop, and touch on some papers in more detail.

Speech and Audio Highlights from MediaEval 2013

Gareth J. F. Jones and Martha Larson

MediaEval is a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval. While it emphasizes the 'multi' in multimedia and focuses on human and social aspects of multimedia task, speech and audio processing is a key component of several MediaEval tasks each year. MediaEval 2013 featured a total of 12 tasks exploring aspects of multimedia indexing, search and interaction, five of which involved significant elements of speech and audio processing, which we will discuss in this article.

Asia Information Retrieval Societies conference 2013

Rafael E. Banchs, Min Zhang, and Ming Hui Dong

AIRS 2013, the ninth edition of the Asia Information Retrieval Societies conference, took place from 9th to 11th December 2014 in Singapore. The conference was attended by a total of 85 participants from more than 20 countries. The technical program comprised 45 papers, from which 27 were selected for oral presentations and 18 for poster presentations.

From the SLTC Chair

Douglas O'Shaughnessy

SLTC Newsletter, February 2014

Welcome to the first SLTC Newsletter of 2014, under the new management of Florian Metze. We heartily thank Dilek Hakkani-Tur for her excellent editorship and her team (William Campbell, Patrick Nguyen, Haizhou Li) for their great work. We also look forward to continued excellence this year with the new team of William Campbell, Patrick Nguyen, and Haizhou Li.

Last September, we elected 17 members to replace those whose terms expired in December, as well as a new Vice Chair for the committee, Bhuvana Ramabhadran; she will take over the reins as Chair in 2015. We also say a fond thank you to all departing SLTC members: John Hansen, Masami Akamine, Abeer Alwan, Antonio Bonafonte, Honza Cernocky, Eric Fosler-Lussier, Pascale Fung, Dilek Hakkani-Tur, Qi Li, Hermann Ney, and Frank Soong. It will be difficult to replace all these excellent past members, but the new committee looks forward to the challenges. It is also not too soon to start thinking of this autumn’s election to renew the membership of the SLTC; so please recommend to colleagues to submit a nomination this summer. Details will be forthcoming in the next SLTC newsletter.

Our technical committee this year will have the following subcommittees (and members): Language Processing (Geoffrey Zweig, Sanjeev P. Khudanpur, Haizhou Li), Electronic Newsletter (William M. Campbell, Patrick Nguyen, Haizhou Li, Florian Metze, Svetlana Stoyanchev), Fellows (Rainer Martin, Malcolm Slaney, John Hershey), Workshops (Nick Campbell, George Saon, Israel Cohen), EDICS (Alexandros Potamianos, Yifan Gong, Shinji Watanabe), Policies and Procedures (Tiago Falk, Frank Seide), Education (Takayui Arai, Christine Shadle, Kay Berkling, Tom Bäckström), Nominations and Awards (Peder Olsen, Julia Hirschberg, Satoshi Nakamura, Pedro A. Torres-Carrasquillo), Communications (Korin Richmond, Heiga Zen, Tomoki Toda), Industry (Ananth Sankar, Dong Yu, Hagen Soltau), External Relations (Junichi Yamagishi, Gernot Kubin, Maurizio Omologo), Student Awards (Deep Sen, Mike Seltzer, Mark Hasegawa-Johnson, Svetlana Stoyanchev), Member Election (Larry Heck, Fabrice Lefevre, Bhiksha Raj, Andreas Stolcke), Meeting (Najim Dehak, Nicholas Evans, Panayiotis Georgiou), ICASSP 2014 Area Chairs (Björn Schuller, Bowen Zhou, Tim Fingscheidt, Michiel Bacchiani, Haizhou Li, Karen Livescu), HLT-ACL board liaison (Julia Hirschberg). If you have any needs in these areas, please contact one of these subcommittee members.

At this time, we look forward eagerly to ICASSP in Florence May 4-9. Our speech and language areas received 694 paper submissions, of which 50% were accepted. We thank all 497 reviewers for their 2800 reviews. The overall level of quality of the submissions was very good, which made deciding which papers to take, and which not, very hard at times. Among the tutorials at ICASSP is “Deep learning for natural language processing and related applications” (Xiaodong He, Jianfeng Gao, Li Deng of Microsoft Research). As these speakers note, deep learning techniques have enjoyed tremendous success in the speech and language processing community in recent years. The tutorial focuses on deep learning approaches to problems in language or text processing, with particular emphasis on important applications including spoken language understanding (SLU), machine translation (MT), and semantic information retrieval (IR) from text.

To give just a sampling of what we will see at ICASSP this May, four sessions will examine deep neural networks (DNNs) as acoustic models for automatic speech recognition (ASR), e.g., bootstrapping DNN training without Gaussian Mixture Models, generating a stacked bottleneck feature representation for low-resource ASR, replacing optimization by stochastic gradient descent with second-order stochastic optimization, etc. While DNNs can extract high-level features from speech for ASR tasks, there are many possible forms of DNN features, and some papers will explore how effective different DNN features are, including vectors extracted from both output and hidden layers in the DNN. Context-dependent acoustic modelling for DNNs suffers from data sparsity; papers at ICASSP will address this by using decision tree state clusters as training targets.

Other ASR papers at ICASSP will deal with convolutional neural networks, Deep Scattering Spectrum, and Stacked Bottle-Neck neural networks, as well as time-frequency masking to improve noise-robust ASR (in which regions of the spectrogram dominated by noise are attenuated). Recent years have seen an increasing emphasis on fast development of ASR using limited resources, to reduce the need for in-domain data. Discriminative models, such as support vector machines (SVMs), have been successfully applied to ASR, and will be examined further at ICASSP.

Other upcoming ICASSP’s are scheduled for Brisbane (2015), Shanghai (2016), New Orleans (2017), and Seoul (2018). We are also looking forward to the next IEEE Spoken Language Technology (SLT) Workshop, to be held at Harvey’s Lake Tahoe Hotel in Lake Tahoe, Nevada (Dec. 7-10, 2014).

Anyone wishing to help organize the 2015 ASRU (IEEE Automatic Speech Recognition and Understanding) workshop is advised to contact the Workshops Subcommittee; bids are due by April 25th. The biannual meeting will follow up the recent successful ASRU held in December 2013 in the Czech Republic.

In closing, I hope you will consider participating this year at ICASSP and SLT. We look forward to meeting friends and colleagues in beautiful Florence and Tahoe.

Best wishes,

Douglas O'Shaughnessy

Douglas O'Shaughnessy is the Chair of the Speech and Language Processing Technical Committee.

Call for Proposals - ASRU 2015

SPS-SLTC Workshop Sub-Committee: Nick Campbell, George Saon, Geoffrey Zweig

SLTC Newsletter, February 2014

The proposal should include the information outlined below.

Workshop location and practicalities
- Geographical location
- Workshop venue (facilities, meeting rooms, network access during the workshop, audio/visual equipment)
- Accommodation -- hotel availability and pricing
- Meals
- Transportation options -- major airports, logistics, visas
- Climate
Approximate workshop dates
- Previous workshops have been held in the month of December.
- Please be aware of any other related conferences in the same time frame.
Rough budget and expected sponsorship.
- Approximately how much will participants need to pay to attend, including accommodation and meals as well as registration?
- Estimated budget for e.g. 100/150 participants and expected sponsorships. This should include venue costs, administration, banquet, coffee breaks, publication costs, etc. (Note, unlike larger conferences such as ICASSP, Interspeech, smaller workshops do not have sufficient registration fees to cover the total workshop costs. The IEEE Signal Processing Society is particularly motivated to see that both workshop and conference costs are held to levels which allow the diverse membership equal opportunities to participate without registration costs being a major barrier).
Organizing / Technical committee
- Typically 15-20 people with a mix of academic and industrial researchers
- Program committees are encouraged to reach a good balance of SLTC and non-SLTC members, for example approximately a 50/50 split
Local arrangements
- Who will be in charge of organizing the workshop, and how will finances be handled (e.g., will participants be able to pay by credit card)?
Tentative schedule
- Paper submission, notification of acceptance, proposals for demonstrations, early registration
- Reception, talks, posters, demo session, banquet, etc.

If you would like to be the organizer(s) of ASRU 2015, please send the Workshop Sub-Committee a draft proposal before April 25, 2014. (Point of contact: Nick Campbell or George Saon). Proposals will be evaluated by the SPS SLTC, with a decision expected in June.

The organizers of the ASRU workshop do not have to be SLTC members, and we encourage submissions from all potential organizers. So we encourage you to distribute this call for proposals far and wide to invite members of the speech and language community at large to submit a proposal to organize the next ASRU workshop.

For more information on the most recent workshops, please see the following:

Speech Synthesis Perfects Everyone’s Singing

Minghui Dong, Nancy Chen, Haizhou Li

SLTC Newsletter, February 2014

The human voice includes three essential elements: content, prosody and timbre. Content is concerned with the literal meaning of language conveyed by the voice. Prosody consists of pitch, duration (timing) and loudness of voice. Prosody characterizes the emotion and expressiveness of one’s voice. For the case of singing voice, prosody is often referred to as melody (a combination of pitch and rhythm). Timbre, on the other hand, determines the identity of a person’s voice. I2R’s Speech2Singing technology keeps the content and timbre of the voice unchanged, but modifies the prosody of the voice into the correct melody to perfect the singing voice of the user. I2R’s Speech2Singing works as follows: Singing voices of professional singers are recorded as model voice templates and stored in the database. When a user subsequently sings a song or reads the lyrics, the recorded voice of each line is compared with the corresponding line stored in the database. The user’s vocal signal is first decomposed into feature (including pitch) sequences. An enhanced singing voice is then synthesized from the adjusted feature sequence, which contains the correct pitch and timing information. To obtain the correct timing for the user’s voice, speech recognition technology is used to identify the phonetic units from both the model voice and the user’s voice. The timing information of the user’s voice is adjusted to match that of the model voice by means of dynamic time warping [1]. The correct pitch information is directly derived from the model singing voice. Finally, the reconstructed time-synchronous singing voice is overlaid with background music.

Two existing technologies related to Speech2Singing are Auto-Tune [2] and score-based conversion [3]. The popular method of Auto-Tune alters the user’s pitch to the closest note in a pre-defined scale. While it generally works well for singing voice where the melody is not far off-tune, it is not suitable for converting spoken voice. In contrast to Auto-tune, I2R's method uses a reference melody to guide the conversion, so that corrections can still be made even when the melody is completely off-tune.

Although the score-based conversion method in [3] uses musical scores as reference for the change of melody (pitch and timing), the reference melody is generated with mathematical models, making the synthesized singing voice less natural than when using human melody as a reference. I2R’s approach uses a professional singer’s melody as a reference model so that every single detail of the melody, such as the pitch envelope and vibrato (a regular, pulsating change in pitch), is perfectly preserved and imposed onto the synthesized singing voice. In addition, since the user’s voice can be mapped to the professional singer’s voice, I2R’s method allows the timing change of each syllable of the user’s voice to match that of the professional singers’ voice.

I2R has implemented this Speech2Singing technology in mobile devices such as smart-phones and tablets. This is the first software that automatically changes a user’s speech into natural singing voice. The technology has been showcased in various occasions such as I2R’s annual TechFest [4] and A*STAR’s MediaExploit [5]; it has also drawn attention from local and international media, such as AFP [6], C-Net [7], MediaCorp [8]. ‘Sing for Singapore’ was the first release of Speech2Singing to the public during 2013 Singapore’s National Day (Figure 1) [9,10] with iOS version in AppStore [11] and Android version in Google Play [12].

In 1961, an IBM 7094 became the first computer to sing (the song was “Daisy Bell”). Ever since, singing synthesis technology has progressed tremendously. Similar to Photoshop that perfects graphics, Speech2Singing technology helps perfect singing vocals.

Figure 1: Screen shots of the NDP 2013 App.

References:

[1] L Cen, M Dong, P Chan, Template-based Personalized Singing Voice Synthesis, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012.

[2] http://www.antarestech.com/

[3] T. Saitou, M. Goto, M. Unoki and M. Akagi, "Speech-to-Singing Synthesis: Vocal conversion from speaking voices to singing voices by controlling acoustic features unique to singing voices," Proc. Proc. 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2007), pp. 215–218, 2007.

[4] http://asia.cnet.com/scientists-showcase-innovations-at-singapores-techfest-62218326.htm

[5] https://www.facebook.com/photo.php?fbid=491751920852612&set=a.491445364216601.122313.134973009863840&type=1&theater

[6] https://www.facebook.com/photo.php?fbid=491462187548252&set=a.491462180881586.122323.134973009863840&type=1&theater

[7] http://asia.cnet.com/videos/iframe-embed/46728161/

[8] http://www.etpl.sg/innovation-offerings/technologies-for-license/tech-offers/1908

[9] https://www.facebook.com/i2r.research/posts/679594635401672

[10] http://www.etpl.sg/find-us-at/news/in-the-news/article/119

[11] https://itunes.apple.com/sg/app/ndp-2013-mobile-app/id524388683?mt=8

[12] https://play.google.com/store/apps/details?id=com.massiveinfinity.slidingmenu&hl=en

Minghui Dong is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore. His research interests include speech synthesis, singing voice synthesis, and voice conversion.

Nancy Chen is a Scientist in the Department of Human Language Technology at Institute for Infocomm Research, Singapore. Her research interests include keyword search, pronunciation modeling, speech summarization, and computer-assisted language learning. For more information: http://alum.mit.edu/www/nancychen.

Haizhou Li is the Head of the Department of Human Language Technology at Institute for Infocomm Research, Singapore. He is also a Conjoint Professor at the University of New South Wales, Australia.

1 Billion Word Language Modeling Benchmark

Ciprian Chelba

SLTC Newsletter, February 2014

We just released a LM benchmark at: https://code.google.com/p/1-billion-word-language-modeling-benchmark/ and would like to advertise it to the speech community.

The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!

An Overview of ASRU 2013

Tara N. Sainath and Jan (Honza) Cernocky

SLTC Newsletter, February 2014

The Automatic Speech Recognition and Understanding Workshop (ASRU) was recently hosted in Olomouc, Czech Republic from December 8-12, 2013. Each day of the workshop focused on a specific current theme that has become popular amongst ASR researchers. Below, we highlight the 4 days of the workshop, and touch on some interesting papers, in more detail.

Day 1 – Neural Network

In the past few years, deep learning has become the de-facto approach for acoustic modeling in ASR, showing tremendous improvements between 10-30% relative over alternative acoustic modeling approaches across a variety of LVCSR tasks [1].

The stage was set by a keynote on the Physiological Basis of Speech Processing: From Rodents to Humans by Christoph Schreiner (UC San Francisco), followed by invited talks on Multilayer perceptrons for speech recognition: There and Back Again by Nelson Morgan (International Computer Science Institute), and Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems by Frank Seide (Microsoft Research Asia).

There were numerous interesting papers in this section as well. For example [2] explored using speaker-indentity vectors with DNNs, obtaining impressive results on Switchboard. In addition, [3] introduced a semi-supervised training strategy for DNNs. On the surprise language task (Vietnamese) from the BABEL project, they were able to obtain a 2.2% absolute improvement in WER compared to a system built only on fully transcribed data. Finally, [4] looked at moving benefits of DNNs back into GMM modeling techniques, including having deep (multiple layers) and wide (multiple) models and by sharing model parameters. Making these changes, the authors find that the performance of GMMs comes closer to DNNs on TIMIT.

Day 2 – Limited Resource

The second day focused on limited resources for ASR.

Invited talks were given on Building Speech Recognition Systems with Low Resources (by Tanja Schultz, Karlsruhe Institute of Technology) and Unsupervised Acoustic Model Training with Limited Linguistic Resources (by Lori Lamel, CNRS-LIMSI). Mary Harper (IARPA) delivered a keynote on The Babel Program and Low Resource Speech Technology, which was followed by invited talks on Zero to One Hour of Resources: A Self-organizing Unit Approach to Training Speech Recognizers (by Herb Gish, Raytheon BBN Technologies), and Recent Progress in Unsupervised Speech Processing, by Jim Glass (Massachusetts Institute of Technology).

The work in [5] looked at automatically learning a pronunciation lexicon, by starting with a small seed lexicon and then learning pronunciations of words by transcribing speech at the word level. Experiments on a Switchboard task show that the proposed lexicon learning method achieves a WER similar to using a fully handcrafted lexicon. In addition, [6] proposes a framework which discovers acoustic units by clustering together context-dependent grapheme models, and then generates an associated pronunciation lexicon from the initial grapheme-based recognition system. Results on WSJ show the proposed approaches allow for a 13% reduction in WER, and have many implications for low-resourced languages such as the Babel dataset.

Day 3 – ASR in Applications

The third day focused on the impact ASR is making in various applications.

A keynote was delivered on Utilization of ASRU technology - present and future (Joseph Olive), while invited talks were given on the topics of Augmenting conversations with a speech understanding anticipatory search engine (Marsal Gavalda, Expect Labs), Calibration of binary and multiclass probabilistic classifiers in automatic speaker and language recognition (Niko Brummer, Agnitio), Speech technologies for data mining, voice analytics and voice biometry (Petr Schwarz, Phonexia and Brno University of Technology), From the Lab to the Living Room: The Challenges of Building Speech-Driven Applications for Children (Brian Langner, ToyTalk), and The growing role of speech in Google products, by Pedro Moreno.

One interesting paper dealing with spoken language understanding was [7], which looked at a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture shows promising results on a variety of real-world ASR applications. In addition, [8] looks at using linguistic knowledge for query understanding by extracting a set of syntactic structural features and semantic dependency features from query parse trees to enhance inference model learning. Experiments on real natural language queries indicate that using additional linguistic knowledge can improve query understanding results across various real-world tasks.

Day 4 – What’s Wrong with ASR?

Finally, Jordan Cohen and Steve Wegmann jointly delivered a keynote discussing what incorrect assumptions our current modeling approaches make, and what research directions could potentially be pursued to improve ASR performance. For example, HMMs have been around for 40 years, but make poor assumptions such as frame independence. Wegmann also had an interesting paper related to the theme of the day. In this paper [9], Wegmann appies a diagnostic analysis to the performance metric, actual term weighted value (ATWV), used in the Babel task. His analysis looks at the large ATWV gains that often occur due to system combination by increasing the number of true hits, and questions if gains can be obtained without huge expenditures needed by system combination.

Also, for the first time at ASRU, posters were left hanging throughout the conference, and they were being presented in "authors-be-at-their-posters" sessions, which lead to many in-depth discussions also over coffee, during lunch, or in the evening, simply because the poster was still there, to refresh one's memory, or clarify an idea.

References

[1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

[2] G. Saon, H. Soltau, M. Picheny, D. Nahamoo, “Speaker Adaptation of Neural Network Acoustic Models using I-Vectors,” in Proc. ASRU, 2013.

[3] K. Vesely, M. Hannemann, L. Burget, “Semi-supervised Training of Deep Neural Networks,” in Proc. ASRU, 2013.

[4] K. Demuynck and F. Friefenbach, “Porting Concepts from DNNs back to GMMs,” in Proc. ASRU, 2013.

[5] L. Lu, A. Ghoshal and S. Renals, “Acoustic Data-Driven Pronunciation Lexicon for Large Vocabulary Speech Recognition,” in Proc. ASRU, 2013.

[6] W. Hartmann, A. Roy, L. Lamel and J.L. Gauvain, “Acoustic Unit Discovery and Pronunciation Generation from a Grapheme-Based Lexicon,” in Proc. ASRU, 2013.

[7] P. Xu and R. Sarikaya, “Convolutional Neural Network Based Triangular CRF for Joint Intent Detection and Slot Filling,” in Proc. ASRU, 2013.

[8] J. Liu, P. Pasupat, Y. Wang, S. Cyphers, and J. Glass, "Query Understanding Enhanced by Hierarchical Parsing Structures," in Proc. ASRU, 2013.

[9] S. Wegmann, A. Faria, A. Janin, K. Riedhammer, N. Morgan, “The TAO of ATWV: Probabing Mysteries of Keyword Search Performance,” in Proc. ASRU, 2013.

Speech and Audio Highlights from MediaEval 2013

Gareth J. F. Jones, Martha Larson

SLTC Newsletter, February 2014

MediaEval is a benchmarking initiative dedicated to evaluating new algorithms for multimedia access and retrieval. While it emphasizes the 'multi' in multimedia and focuses on human and social aspects of multimedia task, speech and audio processing is a key component of several MediaEval tasks each year. This article overviews the tasks with significant speech and audio elements from the MediaEval 2013 multimedia evaluation benchmark and looks ahead to the MediaEval 2014 campaign [1].

MediaEval 2013 featured a total of 12 tasks exploring aspects of multimedia indexing, search and interaction, five of which involved significant elements of speech and audio processing.

The “Spoken Web Search” (SWS) Task aimed to perform audio search in multiple languages and acoustic conditions where there are very few resources available to develop a solution for each individual language. The operational setting of the task was to imagine that you want to build a simple speech recognition system, or at least a spoken term detection (STD) or keyword spotting (KWS) system in a new dialect, language or acoustic condition, for which only a small number of audio examples are available. The research aimed to explore the question of whether it is possible to do something useful (e.g. identify the topic of a query) by using only the very limited resources available.

The task involved searching for audio content within audio content using an audio query, and contained audio for 9 different languages. Participants were required to build a language independent audio search system so that, given an audio query, it should be able to find the appropriate audio file(s) and the exact location(s) of a query term within these audio file(s). Evaluation was performed using standard NIST metrics and some other indicators. The SWS task at MediaEval 2013 expanded the size of the test dataset and number of languages over similar tasks held in 2011 and 2012. In addition, a baseline system was being offered to first-time participants as a virtual kitchen application [2].

In 2014, the task will continue under the new name of "QUESST": Query by Example Search on Speech.

The Search and Hyperlinking (S&H) Task consisted of two sub-tasks: (i) answering known-item queries from a collection of broadcast TV material, and (ii) automatically linking anchors within the known-item to other parts of the video collection. The S&H Task envisioned the following scenario: a user is searching for a segment of video that they know to be contained in a video collection. If the user finds the segment, she may wish to see further information about some aspect of this segment. This use scenario is a refinement of the previous S&H task at MediaEval 2012.

The dataset for both subtasks was a collection of 1,260 hours of video provided by the BBC. The average length of a video was roughly 30 minutes and most videos were in the English language. The collection was used both for training and testing of systems. Known-items and queries to locate them were created by volunteer subjects in sessions at the BBC offices, and relevant links were identified using crowdsourcing with Amazon Mechanical Turk. The BBC kindly provided human generated textual metadata and manual transcripts for each video. Participants were also provided with the output of two automatic speech recognition (ASR) systems and features created using automatic visual analysis.

The Similar Segments in Social Speech Task was a new task at MediaEval 2013. The task involved finding segments similar to a query segment, in a multimedia collection of informal, unstructured dialogs among members of a small community. The task was motivated by the following scenario. With users’ growing willingness to share personal activity information, the eventual expansion of social media to include social multimedia, such as video and audio recordings of casual interactions, seems inevitable. To unlock the potential value of this material, new methods need to be developed for searching such records. This requires the development of reliable models of the similarity between dialogue-region pairs. The specific motivating task was as follows: A new member has joined an organization or social group that has a small archive of conversations among members. He starts to listen, looking for any information that can help him better understand, participate in, enjoy, find friends in, and succeed in this group. As he listens to the archive (perhaps at random, perhaps based on some social tags, perhaps based on an initial keyword search), he finds something of interest. He marks this region of interest and requests “more like this”. The system returns a set of “jump-in” points, places in the archive to which he could jump and start listening/watching with the expectation of finding something similar.

One dimension of MediaEval’s interest in audio processing is tasks relating to music. MusiClef was a new task at MediaEval 2013, having previously formed part of the CLEF evaluation benchmark [3]. The MusiClef 2013: Soundtrack Selection for Commercials Task aimed at analyzing music usage in TV commercials and determining music that fits a given commercial video. This task is usually carried out by music consultants, who select a song to advertise a particular brand or a given product. By contrast, the MusiClef 2013 task aimed at automating this process by taking into account both context- and content-based information about the video, the brand, and the music.

Music is composed to be emotionally expressive. The Emotion in Music Task was another new task at MediaEval 2013, and sought to develop tools for navigating today’s vast digital music libraries, based on the assumption that emotional associations provide an especially natural domain for indexing and recommendation. Because there are a myriad of challenges to such a task, powerful tools are required for the development of systems that automate the prediction of emotion in music. As such, a considerable amount of work was dedicated to the development of automatic music emotion recognition (MER) systems. The corpus used for this task employed Creative Commons (CC) licensed music from the Free Music Archive4 (FMA), which enabled the content to be redistributed to the participants with annotations created via crowdsourcing using Amazon Mechanical Turk.

Other tasks at MediaEval 2013 included the Placing Task which seeks to place images on a world map, that is, to automatically estimate the latitude/longitude coordinates at which a photograph was taken. The main Placing Task has featured at MediaEval for several year, a newly introduced secondary task for 2013 was Placeability Prediction which asked participants to estimate the error of their predicted location. Annotating images with this kind of geographical location tag, or geotags, has a number of applications in personalization, recommendation, crisis management and archiving. Currently, the vast majority of images online are not labelled with this kind of data. The data for this task was drawn from Flickr. In comparison to previous editions of this task, the test set has not only increased drastically in size, but was also been derived according to different assumptions in order to model a more realistic use-case scenario.

The Violent Scenes Detection Task derives directly from a Technicolor use case which aims at easing a user’s selection process from a movie database, and ran for the third time at MediaEval 2013. This task was to automatically analyse movie content with the objective of identifying violent actions in the content. Another returning task was Social Event Detection (SED) task which required participants to discover social events and organize the related media items in event-specific clusters within a collection of Web multimedia content. Social events are defined as events that are planned by people, attended by people and for which the social multimedia are also captured by people. The Visual Privacy Task (VPT) aimed at exploring how image processing, computer vision and scrambling techniques can deliver technological solutions to some visual privacy problems. The evaluation was performed using both video analytics algorithms and user studies so as to provide both subjective and objective evaluation of privacy protection techniques.

The MediaEval 2013 campaign culminated in a very energetic and successful 2 days workshop in Barcelona, Spain October 2013 attended by 100 task organisers and participants.

The tasks for each MediaEval campaign are chosen following an open call based on results of a public questionnaire exploring the interest of the research community in them. The questionnaire for MediaEval 2014 has recently concluded and selection and details of the tasks to be offered are currently being finalised. Task registration will open in March 2014, details will be available from the MediaEval website [1].

Gareth J. F. Jones and Martha Larson are coordinators of the MediaEval Benchmarking Initiative for Multimedia Evaluation.

Full proceedings of MediaEval 2013 are available from: http://ceur-ws.org/Vol-1043/. More information can be found at http://www.multimediaeval.org/.

Asia Information Retrieval Societies conference 2013

Rafael E. Banchs, Min Zhang, Ming Hui Dong

SLTC Newsletter, February 2014

The Chinese and Oriental Languages Information Processing Society (COLIPS) hosted the ninth edition of the Asia Information Retrieval Societies conference (AIRS 2013) from December 9 to 11, 2013 in Singapore. The Asia Information Retrieval Societies Conference (AIRS) aims to bring together researchers and developers to share new ideas and recent advances in the field of Information Retrieval (IR) and its applications in text, speech, image, multimedia and social data.

Given the increasing adoption of mobile platforms for information services in general, such as social media, information retrieval and e-commerce, the use of speech in the context of the information society is gaining preponderance at a very fast pace. New service paradigms, as well as new business models, are emerging around the use of speech and natural language interfaces. In this sense, AIRS 2013 has provided an excellent environment for technical discussion and debate on the most relevant and current trends for the modern information society, igniting the engines and preparing the Singapore’s scientific landscape for the upcoming INTERSPEECH 2014.

This year, AIRS 2013 received a total of 109 submissions, which were reviewed by a program committee of 143 specialists. After peer reviewing, 45 submissions were accepted for being included in the proceedings and presented at the conference. In total 27 papers were selected for oral presentations and 18 for poster presentations. The proceedings were published as a Springer’s LNCS Series Volume 8281.

The conference was attended by 85 participants from more than 20 countries. About 30% of the registered participants were students. The program also included an Invited Keynote Speech: “Information technologies as innovation drivers in the financial and banking industry” given by Ketan Samani, who is the Executive Director for Regional eBusiness of the Group Consumer Banking Department of DBS Bank in Singapore.

The Best Paper Award was conferred to Laure Soulier, Lynda Tamine and Wahiba Bahsoun for their paper “A Collaborative Document Ranking Model for a Multi-Faceted Search”. In addition, best oral and best poster presentation awards were voted by the participants. The best oral presentation award was conferred to Alistair Moffat, for his paper “Seven Numeric Properties of Effectiveness Metrics”. Finally, due to a tie in the collected votes, the best poster award was conferred to two teams: Qianli Xing, Yiqun Liu, Min Zhang, Shaoping Ma and Kuo Zhang for their paper “Characterizing Expertise of Search Engine Users” and Rajendra Prasath, Aidan Duane and Philip O'Reilly for their paper “Topic Assisted Fusion to Re-Rank Texts for Multi-Faceted Information Retrieval”.

In addition to the technical program, AIRS 2013 featured two main social events: a cocktail session that ran in parallel with the poster presentations at the Rendezvous Grand Hotel Singapore, a modern upscale hotel that is located in the heart of the city; and a conference banquet, which was held at The Halia at Raffles, an urban, casual-chic restaurant and sibling of the award-winning Halia at Singapore Botanic Gardens.

Acknowledgements and more Information

Rafael E. Banchs (rembanchs@i2r.a-star.edu.sg) is a Research Scientist at the HLT department of I2R in Singapore. His research interests are in the areas of Information Retrieval, Machine Translation and Dialogue Systems.

Min Zhang (zhangminmt@hotmail.com) is professor at the School of Computer Science and Technology, Soochow University, Suzhou, China. His research interests are in the areas of Machine Translation and Information Retrieval.

Min Hui Dong (mhdong@i2r.a-star.edu.sg) is vice-president of COLIPS and Research Scientist at the HLT department of I2R in Singapore. His research interests are in the areas of Speech Processing and Synthesis.