When you first hear about Audio and Acoustic Signal Processing (AASP) it’s likely you don’t think of the movies, music and modern smartphones that it gives life to. As a sub-category of signal processing, AASP specifically deals with the analysis, processing and synthesis of sound that has been either recorded by one or more microphones or artificially generated by, for example, a computer program or synthesizer.
And there’s one technical device that has made a profound impact in our daily lives, in part thanks to recent advances in AASP: the smartphone. AASP performs two major tasks on a smartphone that in most cases require distinct signal processing algorithms: pre- and post- processing for speech applications such as cellular telephony, VoIP and speech recognition for virtual assistants, and post-processing for applications related to media consumption, such as music playback or gaming.
Can you hear me now? AASP for mobile voice communication applications
Telephony applications have driven the early advancements in AASP technology. To ensure a high quality end-to-end audio experience, the ITU (International Telegraph Union) has specified a set of minimum requirements that AASP techniques have to meet. AASP techniques in the telephony use-case include:
- Acoustic echo control: acoustic echo is the result of the loudspeaker signal (far-end talker) being picked up by the microphone and sent back to the far-end. Acoustic echo control aims at virtually eliminating this annoying acoustic feedback, which is particularly problematic in the speakerphone use-case. Without acoustic echo control, the person calling the smartphone user would hear him-/herself (echo/feedback). A conversation would be virtually impossible.
- Noise control: the microphone doesn’t only pick up the desired speech signal, but often also unwanted background noise. Noise control tries to minimize those unwanted signals otherwise would be sent to the far-end. Traditionally, AASP for noise control has been applied to single-microphone setups only and is incapable of effectively dealing with strong directional interferers. The advent of multi-microphone AASP, however, has enabled the suppression of directional interferers. Most modern smartphones use at least two microphones for this purpose. For example, let’s assume the smartphone user sits in a noisy restaurant. Without (particularly multi-microphone) noise control, the person on the other end would barely be able to distinguish the smartphone user’s voice from the background noise. Again, a conversation would be very difficult.
- Gain control: the ITU guidelines define how loud a speech signal should be when leaving a telephony transmitter as well as when it is being played back at the receiver. Gain control can be implemented either statically during the handset design stage or automatically/adaptively during operation in real-time. For the safety of the listener (think: hearing protection) on the one end of the spectrum and to assume a minimum level of audibility on the other, any telephone equipment must play back the audio at a certain loudness.
- Linear filtering: the ITU also sets requirements on how the speech signal should sound by defining an acceptable timbre range for optimum speech intelligibility. AASP in the form of linear filtering can help the handset manufacturer to meet these requirements. This is also a requirement for all telephone equipment to assure a consistent level of intelligibility. The ITU-T standard takes the human hearing into account.
- Speech coding: AASP played an instrumental role during the transition from analog “plain old telephone service” (POTS) to end-to-end digital telephony in the 1970s by supplying the call capacity-increasing G.711 narrowband (approximately 300 Hz to 3.4 kHz) speech coder. Since then, a whole host of speech coders with varying tradeoffs between compression ratio, speech quality, and computational complexity have been made available. Thanks to more recent advancements made in speech coding, mobile telephony providers are now moving to support higher quality wideband speech (approximately 150 Hz to 7 kHz). Note that some voice-over-IP (VoIP) systems such as Microsoft Skype or Apple FaceTime have been supporting wideband speech for quite some time, as they do not have to provide legacy support if a call is routed only within their respective ecosystems. Without speech coding, cellular (or any digital) telephony would not exist, especially in the early days of adoption.
The sound of music: AASP and mobile media consumption
While AASP algorithms for telephony applications are, in most cases, optimized exclusively for speech signals, AASP algorithms targeted for media consumption on a modern smartphone operate on speech and music signals. Following the quality improvements of the smartphone camera, which has replaced a dedicated camera for the casual photographer, many people solely rely on their smartphone for all of their mobile media consumption needs, such as listening to music, watching videos, and gaming. In this context, AASP is used to provide audio post-processing and audio decoding capabilities:
- Post-processing: some of the most basic audio post-processing techniques are based on equalization and filtering, which allow the user to adjust the timbre of the audio to his or her liking. Examples include bass boost and parametric equalization. Another popular post-processing technique based on filtering is the ability to add reverberation for recreating an audio event as if it were recorded in a particular venue. Other post-processing options offered in modern smartphones thanks to AASP include pitch shift and time stretching, sometimes used to speed up or slow down the playback of podcast material, for instance.
- Audio (de)coding: AASP, particularly in the area of audio coding like mp3 and AAC, has enabled a revolution in how music is distributed, stored, and consumed. Online music streaming services have put virtually every piece of music ever recorded at the consumer’s fingertips waiting to be listened to anywhere by using his or her smartphone.
His master’s voice: AASP and the virtual assistant
A new and continually improving technology utilized on modern smartphones is the virtual assistant, e.g. Apple’s Siri, Microsoft’s Cortana, or Google’s Now, which responds to the user’s spoken commands. The typical AASP concepts used by virtual assistants are:
- Speech enhancement: acoustic front-ends often include multi-microphone speech pickup using beamforming technology as well as noise suppression to isolate the desired speech prior to forwarding it to the speech recognition engine.
- Speech recognition (speech-to-text): this draws ideas from multiple disciplinary fields including linguistics, computer science, and AASP. Ongoing work in acoustic modeling is a major contribution to recognition accuracy improvement in speech recognition by AASP.
- Speech synthesis (text-to-speech): this technology has come a very long way from its very robotic sounding introduction in the 1930s to making synthesized speech sound more and more natural.
Red pill and blue pill: AASP and virtual reality applications
Virtual reality (VR) is currently making great strides toward becoming the “next big thing” in the mobile applications space. VR is supposed to fool your senses to make you think you’re in a different physical location. Users typically wear a VR headset for the visual experience and headphones for the audio experience. Even though the audio and video experience should go hand-in-hand when a user moves around in the virtual world, the spatial acoustic representation often does not. Many current VR systems/applications present audio via a simple fixed stereo image to the user, thereby offering an incomplete representation of the three-dimensional virtual space.
AASP spurred innovation in three-dimensional soundfield acquisition and representation, which ultimately led to first-order Ambisonics (also known as B-format). Even though Ambisonics, first introduced in the 1970s, hasn’t yet been widely adopted it seems perfectly suited for VR applications.
This year, YouTube began supporting first-order Ambisonics encoded audio content. Three-dimensional rendering has since been enabled in the Android YouTube app as well as via certain browsers on the desktop. This development paves the way for consumers to experience realistic audio/visual VR right on their mobile device.
Currently, enhancements are being worked on to enable higher spatial resolution and increased realism by using higher-order Ambisonics. However, for widespread adoption of VR outside the realms of gaming and simulators, content creation needs to be simplified by giving users the technology and tools to generate VR content on their mobile device rather than having to rely on dedicated products. And once again, AASP will come to the rescue.