Interspeech 2017 was hosted in Stockholm. First off, I can’t emphasize enough how stunning Sweden in the summer is. Highlights for me included lunch at the idyllic Botanic Gardens, swimming in the nearby lake between sessions (lots of locals swim before going to work!), casually running into the King and Queen, and being very generously welcomed by a local friend to his family’s summer house. Stockholmers clearly have pride in their city, and I can see why.
The conference took place in relaxing surroundings in Stockholm University. The Great Hall which hosted the keynotes and larger sessions was a wooden work of art. The session locations were quite spread out, and it was tricky to navigate the single long thin corridor when it clogged with people at popular times and sessions. There was the perennial issue of overcrowding of the more popular oral sessions in smaller rooms. I also found the lack of any provided drinking water alarming, but perhaps I’ve been living in the States too long! In general, though, the conference felt well organized with attention to detail. The welcome reception was held at the stately City Hall, site of the Nobel Banquet, and I’m sure I wasn’t alone in fantasizing about walking down the wide staircase on another night! Unfortunately there was little or no food at the reception, and there were lots of extremely hungry people by the end. The banquet was fun and relatively informal. Highlights included silly jumping games and a strangely funny stand-up comedy skit with a talking head.
The technical program was diverse. Broader trends in machine learning were represented, with generative adversarial networks applied to voice conversion amongst other applications, and several applications of variational autoencoders. There continued to be many papers on attention mechanisms, and end-to-end modeling and sequence-to-sequence models both had a large uptick in interest (respective increases of 5 to 15 and 2 to 12 papers last year to this year). There was an interesting discussion in the Conversational Telephone Speech Recognition session, with consecutive papers from IBM and Microsoft both demonstrating impressive recent improvements on the challenging Switchboard task but disagreeing on how close we are to achieving human-level performance. One thing that became clear is that it’s an illusion that humans can do speech recognition perfectly, even with extremely careful attention. The gap between human and current machine performance was also investigated for language modeling. There were a number of papers on domain adaptation, possibly reflecting the fledgling potential for speech recognition to become commodified and truly widely deployed after the substantial algorithmic and modeling improvements of recent years. The keynote on Conversing with Social Agents That Smile and Laugh by Catherine Pelachaud was fascinating, and it was bizarre, intimate and hilarious to watch people laughing uncontrollably while instrumented in a lab setting. But by far and away the most arresting and impressive presentation was on the use of real-time MRI to investigate dynamic articulator positions, complete with live beatboxing demonstrations courtesy of Nimisha Patil.
I had a huge amount of fun at Interspeech 2017, and it was fantastic to see and sing with old friends.
© Copyright 2018 IEEE – All rights reserved. Use of this website signifies your agreement to the IEEE Terms and Conditions.
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.