Digital media, particularly audio, plays a prominent role in crime investigation whether it relates to someone admitting to a crime or exposing illegal dealings between multiple parties. Yet, society is quick to dismiss potential evidence from mainstream media, social media and other platforms. The major reason is difficultly in authenticating such recordings. So how can audio recordings be validated? The first step is to analyze the authentication of the recording itself.
The authenticity of visual recordings produced during times of political instability or taken by terrorist organizations often need to be tested for location and time of origin and audio-visual agreements. During attacks on Syrian military units in July 2012, there were claims that a Qatari company was releasing staged videos by duplicating the model of Syrian cities like Damascus and Aleppo. Such videos add to prejudices and may spread negative sentiments. Sometimes videos of terrorist organizations surface years later after slight retouching. In these scenarios, one must determine not only the location and origin of video, but also if audio and video are in sync.
Electrical Network Frequency (ENF) is a strong measure of determining the audio authentication and origin. It is a signature of the power grid, analogous to fingerprint of a person. To understand how this works, consider an electrical appliance. When someone uses an appliance, a power system generates a signal corresponding to power system frequency. For example, this frequency may be 60Hz in U.S. (50 Hz in other parts of the world) for a refrigerator. The 60Hz is referred to as nominal frequency. When people record audio either through a mobile device or a device plugged into mains (CCTV), this signal gets induced in both audio and video due to nearby electrical activities. Although we call it a signal of a certain nominal frequency (60Hz or 50Hz), it usually deviates by a very small margin, say by 0.1Hz, from the nominal frequency. This deviation depends on the mechanical properties of the power grid that supplies the electricity. Since these mechanical properties vary increasingly, the generated signal has a slightly varying frequency versus time. The margin by which it deviates can be extracted using signal processing algorithms such as frequency transformation techniques. Thus, for a short duration, a dominant frequency can be estimated. And this process can be repeated for multiple chunks of the given recording and a frequency can be estimated for each chunk. Therefore, what we have is the variation around the nominal frequency, which can be used as a signature for the power grid.
Now, when a recording is in question for authenticity or origin, the ENF signature can be extracted. By directly plugging a circuit into mains, one can record the power signal. Since this power signal is like the signal that gets induced in recording, the ENF estimation from this signal should match the signature if it is recorded in this power grid. Accordingly, the origin may be detected. If the recording is tampered with, the ENF estimation typically shows a drastic deviation at the point where the audio was manipulated. Hence, if we detect a sudden spike, it might indicate a forgery.
To determine the time of recording, one can match the audio fingerprint against a reference fingerprint obtained from the power signal. Some countries such as the UK have been recording it for several years now and applying it for evidence in courts. Since the ENF is a time-varying phenomenon, the recording will match only a segment of the power signal fingerprint. This portion where the match is best gives the time of recording.
Recent progress in this field is attributed to Professor Min Wu’ (ECE Department, University of Maryland) work on ENF extraction from video. This extraction is much more complex than the ENF extraction form audio, but the team has shown some fantastic applications. In cases where there is a disagreement between the video and audio, ENF can be extracted from both, and can then be matched. If both were recorded at different times, one can establish this fact by observing a low match between the two signatures. However, if they were recorded simultaneously, then there is a high correlation between the corresponding fingerprints.
Increasingly, ENF signature is becoming a robust and admissible fingerprint in courts. Forensic agencies are commonly employing this technique to investigate audio and video recordings. However, it does not work well in noisy environments and one must take this into account while analyzing the recording. With the advent of Deep Learning paradigm, there is a strong scope of estimating this fingerprint. For example, one may learn how to perform the frequency transformation using Deep Learning architectures, which can be used to estimate the dominant frequency from different chunks of the signals to give a better precision while estimating. This can further enhance the performance during matching leading to better determination of authenticity, origin and time of recording.
A.V. Subramanyam is an assistant professor in the Department of Computer Science and Engineering t Indraprastha Institute of Information Technology-Delhi. His major research interests are in multimedia security, image processing and vision, visual surveillance and deep learning. He earned his Ph.D. from the School of Computer Engineering, Nanyang Technological University, Singapore.