Satoshi Nakamura received his B.S. degree in electronics engineering from Kyoto Institute of Technology, Kyoto, in 1981. He received his Ph.D. in informatics from Kyoto University in 1992. He was the Department Head and Director of ATR Spoken Language Communication Research Laboratories, Kyoto, Japan in the period of 2000-2008. He was the Director-General of Keihanna Research Laboratories and Executive Director of Knowledge-Creating Communication Research Center, National Institute of Information and Communications Technology in 2009-2010. He is currently the director of Data Science Center and full professor of Augmented Human Communication Laboratory, Graduate School of Science and Technology, Nara Institute of Science and Technology. He is also the Team Leader of the Tourism Information Analytics Team at Center for Advanced Intelligence Project (AIP) Center, RIKEN, and Honorary Professor at Karlsruhe Institute of Technology, Germany. His research interests include modeling and developing systems of spoken language processing, speech processing, natural language processing, and data science. He is one of the world leaders in speech-to-speech translation research and has been serving in a wide range of research projects in that field. He published more than 120 peer-reviewed academic journals and 500 peer-reviewed international conference papers. He was awarded various international and domestic awards including the ELRA Antonio Zampolli Prize and commendation for science and technology by the Minister of Education, Culture, Sports, Science and Technology, Japan. He was an elected Board Member of the International Speech Communication Association, ISCA, in the period of June 2011-2019, an IEEE Signal Processing Magazine Editorial Board member in 2012-2015, and IEEE SPS Speech and Language Technical Committee Member in 2013-2015. He is ATR Fellow, IPSJ Fellow, ISCA Fellow, and IEEE Fellow.
We approached Satoshi Nakamura with a few questions:
1. Why did you choose to become faculty in the field of Signal Processing?
In the early 80’ I started to work at a research division in the industry to develop a speech recognition system, however, the technologies were not matured to be deployed to the real system. Eventually, I obtained a chance to work for a newly launched research laboratory funded by the Japanese ministry and industries, named Advanced Telecommunication Research International, ATR, in Kyoto, where one of the main research topics was automatic speech interpretation.
Much innovative research had been conducted at ATR with international researchers and by collaboration with international research laboratories. Several years later I got an associate professor position at Nara Institute of Science and Technology, NAIST, which was a newly established national university having only graduate schools. NAIST provided a good environment and maximum freedom to start new research. I started hands-free ASR and multimodal ASR in mid ’90. However, the research progress is much slower compared with DARPA speech recognition and understanding projects in the United States. I decided to return to ATR to move the speech-related research forward, since ATR offered a department head position to me. We at ATR developed a speech-to-speech translation system for travel domain and led the commercialization of the network-based speech-to-speech translation service for a mobile phone in Japan. Our group moved to the National Institute of Information and Communications Technology to continue the development. At that time, the most important part to enhance the system performance was to collect more data, while I wanted to focus on developing the automatic simultaneous speech interpretation technology. Various new researches were necessary for that. Therefore, I moved back to NAIST seeking for a freedom to do fundamental research with motivated colleagues and graduate students. With the advent of deep learning, all kinds of information including natural language texts had come to be mapped into vector space, thus all of the modalities related to human information processing had fallen into a category of signal processing by then.
2. How does your work affect society?
We at ATR established an international consortium for speech translation called CSTAR, Consortium for Speech Translation Advanced Research, collaborated with US, EU, and Asian research institutions in early ‘90. Later this consortium started an International Workshop on Spoken Language Translation, IWSLT, in 2004. The IWSLT also started various shared tasks to enhance the development of speech translation-related research.
In Japan, the speech translation technologies developed at ATR and NICT have been licensed to various industries and deployed to various services, including speech translation services for inbound and outbound tourists. It is expected that further research on the automatic speech interpretation technologies will help online speech interpretation services and speech dubbing of multiple languages.
3. What challenges have you had to face to get where you are today?
- The most difficult part of my research was to ensure sustainable research funds and to keep excellent research colleagues. It was nice to have lots of motivated graduate students at NAIST but they left the lab after graduation. In the long run, the graduates could be a strong competitor. Another difficulty was that research laboratories such as ATR and NICT are not able to provide sufficient tenure researcher positions.
- The second challenge was to find where to work. I was lucky because I had chances to move when I needed to move. We have to find an optimal place to work on what we want to do. Accumulating small successes when young is very useful to find the next position for new research.
- The balance between novelty and performance. The continuous improvements bring solid performance and more publications, both of which are inevitable to survive in the world of research. On the other hand, innovative idea does not always bring better performance compared to the existing state-of-the-art systems. Only a good combination of novelty and performance can create a breakthrough to a new paradigm. I prefer to work on innovative idea, but it takes time especially for preparing original primal data and new evaluation metrics. The workload is much heavier while the chance rate of success is lower. I would say that the balance of these two is a very difficult problem.
4. What advice would you give to scientists/engineers in signal processing?
To young scientists/engineers, I would say it is important to think about what can happen in 10 or 20 years and then to consider what we can do for the future world by ourselves. Backcasting the world and technologies will provide a very informative source to design your life and carrier. We should not be afraid too much to take a risk for innovative research.
I don’t specifically think that I am working solely on signal processing, though it was the starting point of my research career. Nowadays, everything related to my research area is becoming a part of signal processing. Most of all signal processing technologies can be applied to the natural language processing, and the approach of higher-order semantic processing in the natural language processing might be applied to that of signal processing, too. Signal processing bocomes a very general and common technology in many research areas.
5. Anything else that you would like to add?
In the speech and natural processing, shared task is one of the main approaches to improve its performance. The improvements by the shared task are very useful and important for the research. Sometimes participants seem to misunderstand that they solved the problem by achieving a top score. However, every shared task is designed artificially by the organizer. It is necessary and important for the organizer to design the task to help the participants approach the real problems and to explain how each task contributes to solving the final problems. Also, the participants need to understand the design and the meaning of each task.
The other issue is interdisciplinarity. Both speech signal processing and spoken language processing treat the signals produced by human-being. They are produced as a language containing intent and emotion so that listener and speaker can understand each other. From an information theory point of view, the information source is very complex and difficult to be modelled. Therefore, we need to learn the human information processing more, including the brain activities.
My lab is currently working on developing computer therapists to support autistic children to learn communication, by collaborating with psychiatrists. Also, we are working with brain scientists to see similarities and differences in the language learning between humans and primates.
There are still open sciences related to human spoken language process and cognition process. The engineering question on how to model them and mimic them computationally is yet to be answered in future studies.
To learn more about Satoshi Nakamura and for more information, visit his webpage.