Making deepfakes is getting easier, and they’re more convincing than ever. Cybercriminals are using video and audio deepfakes to extort money from victims by adding a credible “fake authenticity” to their scams.
Editing Peoples’ Beliefs
Ever since the first person said “the camera never lies,” there have been people out to prove otherwise. Creative photographers in the late 19th century used simple tricks to create faked images.
The “person in a bottle” gag enjoyed its moment of popularity. Take a photograph of yourself in a suitable pose. Develop and print the photograph at the appropriate size. Cut out your image, pop it into a glass bottle, and take another photograph of yourself holding the bottle. Hey presto, you have an image of yourself holding a bottle containing a miniature version of yourself.
Of course, today we can do this sort of thing using Photoshop or GIMP. And we’re aware that with sufficient skills and creativity, experts can create images that look completely genuine and yet contain impossible elements. Paradoxically, that awareness can lead to us doubting genuine photographs. A startling image that by sheer fluke captures a once-in-a-lifetime occurrence is always greeted by the question “Is that real?”
A photograph of a miniature person in a bottle is obviously faked. But change a photograph so that it shows something that could be true and also looks real, and some people will believe it.
This isn’t about visual gags anymore. It’s weaponizing imagery. It’s social engineering.
RELATED: The Many Faces of Social Engineering
Moving Pictures and Talkies
As soon as motion pictures became a social spectacle, pioneering filmmakers used special effects and tricks to solve two problems. One was filming something that could really happen but was impractical to film, and the other was filming something that was simply impossible. The solution to this gave birth to the massive special effects industry that we have today.
The addition of sound and dialogue saw the demise of the silent movie and the rise of the talkies. Some silent stars didn’t make the transition. Their voice wasn’t right, or they couldn’t deliver lines with conviction and timing. Until overdubbing became a thing, there was no other solution than to cast someone else.
Today, we’re manipulating actor’s voices too. Did George Clooney really sing in O Brother, Where Art Thou? No, that was Dan Tyminski‘s voice, computer-lip-synced to George Clooney’s moving image.
The systems that can do this type of video and sound manipulation are large and expensive, and they need experts to drive them. But convincing end results can be achieved using easy-to-obtain and relatively simple software that will run on reasonable computer hardware.
The video and audio might not be of Hollywood quality, but it is certainly good enough to allow cybercriminals to add faked images, video, and audio to their arsenal of weapons.
The term deepfake was coined to describe digital footage that is manipulated so that someone in the video wears the face of another person entirely. The “deep” part of the name comes from “deep learning”, one of the machine learning fields of artificial intelligence. Machine learning uses specialist algorithms and a lot of data to train artificial neural networks to achieve an objective. The more data you have to train the system with, the better the results are.
Provide it with enough photographs of someone and a deep learning system will come to understand the physiognomy of that person’s face so well that it can work out what it would look like displaying any expression, from any angle. It can then create images of that person’s face that match all the expressions and head poses of the person in the video.
When those images are inserted into the video the new face matches the action of the video perfectly. Because the artificially-created facial expressions, lip-syncing, and head movements are the same as those worn by the original person when the real video was shot, the result can be a very convincing fake.
This is especially true when the two face shapes are similar. A well-known deepfake maps Lynda Carter’s face onto Gal Gadot’s body, merging two versions of Wonder Woman. Other high-profile examples featured Barack Obama and Tom Cruise. You can find them—and many other examples—on YouTube.
The same machine learning techniques can be applied to audio. With sufficient voice samples, you can train an artificial neural network to produce high-grade sound output replicating the sampled voice. And you can make it say anything you like. Want to hear Notorious B.I.G. rapping some of H. P. Lovecraft’s eldtrich terrors? Again, YouTube’s the place. Actually, you’ll hear something that sounds a lot like Notorious B.I.G. rapping Lovecraft.
Beyond crazy mash-ups and summer blockbusters, these techniques are finding functional uses elsewhere. Descript is a sound and video editor that creates a text transcript of your recording. Edit the text document and the changes are made to the recording. If you don’t like the way you said something, just edit the text. Descript will synthesize any missing audio from your own voice. It can synthesize a voice from as little as one minute of original voice recording.
Korean television channel MBN has created a deepfake of Kim Joo-Ha, their news anchor. If a story that would usually be handled by Kim Joo-Ha breaks when she isn’t in the studio, the deepfake delivers it.
The Cybercrimes Have Already Started
Cybercriminals are always quick to leap onto any bandwagon that they can use to improve or modernize their attacks. Audio fakes are becoming so good that it requires a spectrum analyzer to definitively identify fakes, and AI systems have been developed to identify deepfake videos. If manipulating images lets you weaponize them, imagine what you can do with sound and video fakes that are good enough to fool most people.
Crimes involving faked images and audio have already happened. Experts predict that the next wave of deepfake cybercrime will involve video. The working-from-home, video-call-laden “new normal” might well have ushered in the new era of deepfake cybercrime.
An old phishing email attack involves sending an email to the victim, claiming you have a video of them in a compromising or embarrassing position. Unless payment is received in Bitcoin the footage will be sent to their friends and colleagues. Scared there might be such a video, some people pay the ransom.
The deepfake variant of this attack involves attaching some images to the email. They are alleged to be blown-up stills from the video. The victim’s face—which fills most of the frame—has been digitally inserted into the images. To the uninitiated, they make the blackmail threat more compelling.
As deepfake systems become more efficient they can be trained with ever-smaller data sets. Social media accounts can often provide enough images to use as a learning database.
Email phishing attacks use a variety of techniques to generate a sense of urgency to try to get people to act in haste, or they play upon an employee’s desire to be seen as helpful and effective. Phishing attacks conducted by phone are called vishing attacks. They use the same social engineering techniques.
A lawyer in the U.S. received a phone call from his son, who was obviously distressed. He said he had hit a pregnant woman in an automobile accident and was now in custody. He told his father to expect a call from a public defender to organize $15,000 bail.
The call wasn’t from his son, it was scammers using a text-to-speech system they had trained using sound clips of his son to create an audio deepfake. The lawyer didn’t question it for a moment. As far as he was concerned he was talking to his own son. As he waited for the public defender to call, he took the time to ring his daughter-in-law, and his son’s place of work to let them know about the accident. Word reached son who rang to tell him it was a scam.
A CEO in the UK wasn’t so lucky. He received a spear-phishing email purportedly from the chief executive of the firm’s German parent company. This asked for a payment of £243,000 (roughly $335,000) to be made to a Hungarian supplier within the hour. It was immediately followed by a phone call from the chief executive, confirming the payment was urgent and should be made immediately.
The victim says he not only recognized his boss’ voice and slight German accent but he also recognized the cadence and careful enunciation. So he happily made the payment.
The potential threat of deepfakes has been recognized by the US government. The Malicious Deep Fake Prohibition Act of 2018 and the Identifying Outputs of Generative Adversarial Networks Act or IOGAN Act were created in direct response to the threats posed by deepfakes.
Companies need to add discussions of deepfakes to their cybersecurity awareness training. Cyber-awareness training should be part of a new starter’s induction and should be repeated periodically for all staff.
So far, the attacks that have been seen are polished versions of phishing and spear-phishing attacks. Simple procedures can help trap many of these.
- No transfer of finances should be actioned solely on receipt of an email.
- A follow-up phone call should be made from the recipient of the email to the sender, not from the sender to the recipient.
- Challenge phrases can be incorporated that an outside attacker would not know.
- Cross-reference and double-check everything that is out of the ordinary.