Researchers use fluid dynamics to detect artificial imposter voices in deepfake audio.
Consider the following scenario: A phone call comes in. When an office worker answers the phone, he hears his boss, who is in a panic, tells him that she forgot to transfer money to the new contractor before leaving for the day and needs him to do it. She gives him the wire transfer information, and the crisis is averted with the money transferred.
The employee reclines in his chair, takes a deep breath, and watches as his boss enters the room. The person on the other end of the phone line was not his boss. It wasn’t even a human being. He was hearing the voice of an audio deepfake, a machine-generated audio sample designed to sound exactly like his boss.
Such attacks using recorded audio have already occurred, and conversational audio deepfakes may not be far behind.
Deepfakes, both audio and video, have only recently become possible due to the advancement of sophisticated machine-learning technologies. Deepfakes have added a new layer of uncertainty to the digital media landscape. Many researchers have turned to analyzing visual artifacts – minute glitches and inconsistencies – found in video deepfakes to detect deepfakes.
This is not Morgan Freeman, but how would you know if you weren’t told?
Audio deepfakes may pose an even greater threat because people frequently communicate verbally without using video, such as through phone calls, radio broadcasts, and voice recordings. These voice-only communications greatly expand attackers’ ability to use deepfakes.
To detect audio deepfakes, we and our University of Florida colleagues devised a technique that compares the acoustic and fluid dynamic differences between voice samples generated organically by human speakers and those generated synthetically by computers.
Natural vs. synthetic voices
Humans vocalize by forcing air through the vocal tract’s various structures, which include the vocal folds, tongue, and lips. By rearranging these structures, you can change the acoustical properties of your vocal tract and produce over 200 distinct sounds, or phonemes. However, the acoustic behavior of these different phonemes is fundamentally limited by human anatomy, resulting in a relatively small range of correct sounds for each.
Audio deepfakes, on the other hand, are created by first allowing a computer to listen to audio recordings of a specific victim speaker. Depending on the exact techniques used, the computer may only need to hear 10 to 20 seconds of audio. This audio is used to extract important information about the victim’s voice.
The attacker chooses a phrase for the deepfake to speak and then generates an audio sample that sounds like the victim saying the chosen phrase using a modified text-to-speech algorithm. This process of creating a single deepfaked audio sample takes only a few seconds, potentially giving attackers enough flexibility to use the deepfake voice in a conversation.
Deepfake audio detection
Understanding how to acoustically model the vocal tract is the first step in distinguishing human speech from deepfake speech. Scientists, thankfully, have techniques for estimating what someone – or some being, such as a dinosaur – would sound like based on anatomical measurements of its vocal tract.
We did the opposite. We were able to extract an approximation of a speaker’s vocal tract during a segment of speech by inverting many of these same techniques. This effectively allowed us to peer into the anatomy of the speaker who created the audio sample.
From there, we hypothesized that deepfake audio samples would be free of the same anatomical constraints that humans face. In other words, deepfaked audio samples were analyzed to simulate vocal tract shapes that do not exist in humans.
Our test results not only confirmed our hypothesis but also revealed something new. When we extracted vocal tract estimations from deepfake audio, we discovered that they were frequently comically incorrect. Deepfake audio, for example, frequently produced vocal tracts with the same relative diameter and consistency as a drinking straw, as opposed to human vocal tracts, which are much wider and more variable in shape.
This realization shows that, even when convincing to human listeners, deepfake audio is far from indistinguishable from human-generated speech. It is possible to determine whether the audio was generated by a person or a computer by estimating the anatomy responsible for creating the observed speech.
Why is this important?
The digital exchange of media and information defines today’s world. Everything from news to entertainment to conversations with loved ones is usually done digitally. Deepfake video and audio, even in their infancy, undermine people’s trust in these exchanges, effectively limiting their usefulness.
Effective and secure techniques for determining the source of an audio sample are critical if the digital world is to remain a critical source of information in people’s lives.
Source Credit: Conversation.com