{"id":25887,"date":"2023-07-10T10:35:54","date_gmt":"2023-07-10T14:35:54","guid":{"rendered":"https:\/\/www.kaspersky.co.in\/blog\/audio-deepfake-technology\/25887\/"},"modified":"2023-07-10T20:55:53","modified_gmt":"2023-07-10T15:25:53","slug":"audio-deepfake-technology","status":"publish","type":"post","link":"https:\/\/www.kaspersky.co.in\/blog\/audio-deepfake-technology\/25887\/","title":{"rendered":"Don&#8217;t believe your ears: voice deepfakes"},"content":{"rendered":"<p>Have you ever wondered how we know who we\u2019re talking to on the phone? It\u2019s obviously more than just the name displayed on the screen. If we hear an unfamiliar voice when being called from a saved number, we know right away something\u2019s wrong. To determine who we\u2019re really talking to, we unconsciously note the timbre, manner and intonation of speech. But how reliable is our own hearing in the digital age of artificial intelligence? As the latest news shows, what we hear isn\u2019t always worth trusting \u2013 because voices can be a fake: deepfake.\n<\/p>\n<h2>Help, I\u2019m in trouble<\/h2>\n<p>\nIn spring 2023, scammers in Arizona <a href=\"https:\/\/www.independent.co.uk\/tech\/ai-voice-clone-scam-kidnapping-b2319083.html\" target=\"_blank\" rel=\"nofollow noopener\">attempted to extort money<\/a> from a woman over the phone. She heard the voice of her 15-year-old daughter begging for help before an unknown man grabbed the phone and demanded a ransom, all while her daughter\u2019s screams could still be heard in the background. The mother was positive that the voice was really her child\u2019s. Fortunately, she found out fast that everything was fine with her daughter, leading her to realize that she was a victim of scammers.<\/p>\n<p>It can\u2019t be 100% proven that the attackers used a deepfake to imitate the teenager\u2019s voice. Maybe the scam was of a more traditional nature, with the call quality, unexpectedness of the situation, stress, and the mother\u2019s imagination all playing their part to make her think she heard something she didn\u2019t. But even if neural network technologies weren\u2019t used in this case, deepfakes can and do indeed occur, and as their development continues they become increasingly convincing and more dangerous. To fight the exploitation of deepfake technology by criminals, we need to understand how it works.\n<\/p>\n<h2>What are deepfakes?<\/h2>\n<p>\nDeepfake (<em>\u201cdeep learning\u201d<\/em> + <em>\u201cfake\u201d<\/em>) artificial intelligence has been growing at a rapid rate over the past few years. Machine learning can be used to create compelling fakes of images, video, or audio content. For example, neural networks can be used in photos and videos to replace one person\u2019s face with another while preserving facial expressions and lighting. While initially these fakes were low quality and easy to spot, as the algorithms developed the results became so convincing that now it\u2019s difficult to distinguish them from reality. In 2022, the world\u2019s first <a href=\"https:\/\/www.youtube.com\/playlist?list=PLWTwWADrHvpkgv3cKyjomdfhESt5711OZ\" target=\"_blank\" rel=\"nofollow noopener\">deepfake TV show<\/a> was released in Russia, where deepfakes of Jason Statham, Margot Robbie, Keanu Reeves and Robert Pattinson play the main characters.<\/p>\n<div id=\"attachment_48592\" style=\"width: 2058px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200639\/audio-deepfake-technology-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-48592\" class=\"size-full wp-image-48592\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200639\/audio-deepfake-technology-01.jpg\" alt=\"Deepfake versions of Hollywood stars in the Russian TV series PMJason\" width=\"2048\" height=\"1278\"><\/a><p id=\"caption-attachment-48592\" class=\"wp-caption-text\">Deepfake versions of Hollywood stars in the Russian TV series PMJason. (<a href=\"https:\/\/xn--h1aax.xn--p1ai\/news\/v-rossii-vyshel-pervyy-v-mire-dipfeyk-veb-serial-\/\" target=\"_blank\" rel=\"nofollow noopener\">Source<\/a>)<\/p><\/div>\n<h2>Voice conversion<\/h2>\n<p>\nBut today our focus is on the technology used for creating voice deepfakes. This is also known as voice conversion (or \u201cvoice cloning\u201d if you\u2019re creating a full digital copy of it). Voice conversion is based on autoencoders \u2013 a type of neural network that first compresses input data (part of the <u>en<\/u>coder) into a compact internal representation, and then learns to decompress it back from this representation (part of the <u>de<\/u>coder) to restore the original data. This way the model learns to present data in a compressed format while highlighting the most important information.<\/p>\n<div id=\"attachment_48593\" style=\"width: 2143px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200717\/audio-deepfake-technology-02.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-48593\" class=\"size-full wp-image-48593\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200717\/audio-deepfake-technology-02.png\" alt=\"Autoencoder scheme\" width=\"2133\" height=\"1600\"><\/a><p id=\"caption-attachment-48593\" class=\"wp-caption-text\">Autoencoder scheme. (<a href=\"https:\/\/www.compthree.com\/blog\/autoencoder\/\" target=\"_blank\" rel=\"nofollow noopener\">Source<\/a>)<\/p><\/div>\n<p>To make voice deepfakes, two audio recordings are fed into the model, with the voice from the second recording converted to the first. The content encoder is used to determine <strong>what<\/strong> was said from the first recording, and the speaker encoder is used to extract the main characteristics of the voice from the second recording \u2013 meaning <strong>how<\/strong> the second person talks. The compressed representations of <strong>what<\/strong> must be said and <strong>how<\/strong> it\u2019s said are combined, and the result is generated using the decoder. Thus, what\u2019s said in the first recording is voiced by the person from the second recording.<\/p>\n<div id=\"attachment_48594\" style=\"width: 1288px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200750\/audio-deepfake-technology-03.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-48594\" class=\"size-full wp-image-48594\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2023\/07\/10200750\/audio-deepfake-technology-03.jpg\" alt=\"The process of making a voice deepfake\" width=\"1278\" height=\"435\"><\/a><p id=\"caption-attachment-48594\" class=\"wp-caption-text\">The process of making a voice deepfake. (<a href=\"http:\/\/cs230.stanford.edu\/projects_fall_2020\/reports\/55721255.pdf\" target=\"_blank\" rel=\"nofollow noopener\">Source<\/a>)<\/p><\/div>\n<p>There are other approaches that use autoencoders, for example those that use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Generative_adversarial_network\" target=\"_blank\" rel=\"nofollow noopener\">generative adversarial networks (GAN)<\/a> or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Diffusion_model\" target=\"_blank\" rel=\"nofollow noopener\">diffusion models<\/a>. Research into how to make deepfakes is supported in particular by the film industry. Think about it: with audio and video deepfakes, it\u2019s possible to replace the faces of actors in movies and TV shows, and dub movies with synchronized facial expressions into any language.\n<\/p>\n<h2>How it\u2019s done<\/h2>\n<p>\nAs we were researching deepfake technologies, we wondered how hard it might be to make one\u2019s own voice deepfake? It turns out there are lots of free open-source tools for working with voice conversion, but it isn\u2019t so easy to get a high-quality result with them. It takes Python programming experience and good processing skills, and even then the quality is far from ideal. In addition to open source, there are also proprietary and paid solutions available.<\/p>\n<p>For example, in early 2023, Microsoft <a href=\"https:\/\/arstechnica.com\/information-technology\/2023\/01\/microsofts-new-ai-can-simulate-anyones-voice-with-3-seconds-of-audio\/\" target=\"_blank\" rel=\"nofollow noopener\">announced<\/a> an algorithm that could reproduce a human voice based on an audio example that\u2019s only three seconds long! This model also works with multiple languages, so you can even hear yourself speaking a foreign language. All this looks promising, but so far it\u2019s only at the research stage. But the ElevenLabs platform <a href=\"https:\/\/www.theverge.com\/2023\/1\/31\/23579289\/ai-voice-clone-deepfake-abuse-4chan-elevenlabs\" target=\"_blank\" rel=\"nofollow noopener\">lets users<\/a> make voice deepfakes without any effort: just upload an audio recording of the voice and the words to be spoken, and that\u2019s it. Of course, as soon as word got out, people started playing with this technology in all sorts of ways.\n<\/p>\n<h2>Hermione\u2019s battle and an overly trusting bank<\/h2>\n<p>\nIn full accordance with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Godwin%27s_law\" target=\"_blank\" rel=\"nofollow noopener\">Godwin\u2019s law<\/a>, Emma Watson was made to <a href=\"https:\/\/www.vice.com\/en\/article\/dy7mww\/ai-voice-firm-4chan-celebrity-voices-emma-watson-joe-rogan-elevenlabs\" target=\"_blank\" rel=\"nofollow noopener\">read \u201cMein Kampf\u201d<\/a>, and another user <a href=\"https:\/\/www.vice.com\/en\/article\/dy7axa\/how-i-broke-into-a-bank-account-with-an-ai-generated-voice\" target=\"_blank\" rel=\"nofollow noopener\">used<\/a> ElevenLabs technology to \u201chack\u201d his own bank account. Sounds creepy? It does to us \u2013 especially when you add to the mix the popular horror stories about scammers collecting samples of voices over the phone by having folks say \u201cyes\u201d or \u201cconfirm\u201d as they pretend to be a bank, government agency or poll service, and then steal money using voice authorization.<\/p>\n<p>But in reality things aren\u2019t so bad. Firstly, it takes about five minutes of audio recordings to create an artificial voice in ElevenLabs, so a simple \u201cyes\u201d isn\u2019t enough. Secondly, banks also know about these scams, so voice can only be used to initiate certain operations that aren\u2019t related to the transfer of funds (for example, to check your account balance). So money can\u2019t be stolen this way.<\/p>\n<p>To its credit, ElevenLabs reacted to the problem fast by rewriting the service rules, prohibiting free (i.e., anonymous) users to create deepfakes based on their own uploaded voices, and blocking accounts with complaints about \u201coffensive content\u201d.<\/p>\n<p>While these measures may be useful, they still don\u2019t solve the problem of using voice deepfakes for suspicious purposes.\n<\/p>\n<h2>How else deepfakes are used in scams<\/h2>\n<p>\nDeepfake technology in itself is harmless, but in the hands of scammers it can become a dangerous tool with lots of opportunities for deception, defamation or disinformation. Fortunately, there haven\u2019t been any mass cases of scams involving voice alteration, but there have been several high-profile cases involving voice deepfakes.<\/p>\n<p>In 2019, scammers used this technology to <a href=\"https:\/\/www.wsj.com\/articles\/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402\" target=\"_blank\" rel=\"nofollow noopener\">shake down UK-based energy firm<\/a>. In a telephone conversation, the scammer pretended to be the chief executive of the firm\u2019s German parent company, and requested the urgent transfer of \u20ac220,000 ($243,000) to the account of a certain supplier company. After the payment was made, the scammer called twice more \u2013 the first time to put the UK office staff at ease and report that the parent company had already sent a refund, and the second time to request another transfer. All three times the UK CEO was absolutely positive that he was talking with his boss because he recognized both his German accent and his tone and manner of speech. The second transfer wasn\u2019t sent only because the scammer messed up and called from an Austrian number instead of a German one, which made the UK SEO suspicious.<\/p>\n<p>A year later, in 2020, scammers used deepfakes to <a href=\"https:\/\/www.forbes.com\/sites\/thomasbrewster\/2021\/10\/14\/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions\/?sh=42bdebd47559\" target=\"_blank\" rel=\"nofollow noopener\">steal<\/a> up to $35,000,000 from an unnamed Japanese company (the name of the company and total amount of stolen goods weren\u2019t disclosed by the investigation).<\/p>\n<p>It\u2019s unknown which solutions (open source, paid, or even their own) the scammers used to fake voices, but in both the above cases the companies clearly suffered \u2013 badly \u2013 from deepfake fraud.\n<\/p>\n<h2>What\u2019s next?<\/h2>\n<p>\nOpinions differ about the future of deepfakes. Currently, most of this technology is in the hands of large corporations, and its availability to the public is limited. But as the history of much more popular generative models like <a href=\"https:\/\/openai.com\/dall-e-2\/\" target=\"_blank\" rel=\"nofollow noopener\">DALL-E<\/a>, <a href=\"https:\/\/www.midjourney.com\/\" target=\"_blank\" rel=\"nofollow noopener\">Midjourney<\/a> and <a href=\"https:\/\/stability.ai\/blog\/stable-diffusion-announcement\" target=\"_blank\" rel=\"nofollow noopener\">Stable Diffusion<\/a> shows, and even more so with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Large_language_model\" target=\"_blank\" rel=\"nofollow noopener\">large language models<\/a> (ChatGPT anybody?), similar technologies may well appear in the public domain in the foreseeable future. This is confirmed by a recent <a href=\"https:\/\/www.semianalysis.com\/p\/google-we-have-no-moat-and-neither\" target=\"_blank\" rel=\"nofollow noopener\">leak<\/a> of internal Google correspondence in which representatives of the internet giant fear they\u2019ll lose the AI race to open solutions. This will obviously result in an increase in the use of voice deepfakes \u2013 including for fraud.<\/p>\n<p>The most promising step in the development of deepfakes is real-time generation, which will ensure the explosive growth of deepfakes (and fraud based on them). Can you imagine a <a href=\"https:\/\/github.com\/iperov\/DeepFaceLive\" target=\"_blank\" rel=\"nofollow noopener\">video call<\/a> with someone whose face and voice are completely fake? <a href=\"https:\/\/blog.metaphysic.ai\/future-autoencoder-deepfakes\/\" target=\"_blank\" rel=\"nofollow noopener\">However<\/a>, this level of data processing requires huge resources only available to large corporations, so the best technologies will remain private and fraudsters won\u2019t be able to keep up with the pros. The high quality bar will also help users learn how to easily identify fakes.\n<\/p>\n<h2>How to protect yourself<\/h2>\n<p>\nNow back to our very first question: can we trust the voices we hear (that is \u2013 if they\u2019re not the voices in our head)? Well, it\u2019s probably overdoing it if we\u2019re paranoid all the time and start coming up with secret code words to use with friends and family; however, in more serious situations such paranoia might be appropriate. If everything develops based on the pessimistic scenario, deepfake technology in the hands of scammers could grow into a formidable weapon in the future, but there\u2019s still time to get ready and build reliable methods of protection against counterfeiting: there\u2019s already a lot of <a href=\"https:\/\/arxiv.org\/abs\/2005.08781\" target=\"_blank\" rel=\"nofollow noopener\">research<\/a> into deepfakes, and large companies are developing <a href=\"https:\/\/venturebeat.com\/ai\/intel-unveils-real-time-deepfake-detector-claims-96-accuracy-rate\/\" target=\"_blank\" rel=\"nofollow noopener\">security solutions<\/a>. In fact, we\u2019ve already talked in detail about ways to combat video deepfakes <a href=\"https:\/\/www.kaspersky.com\/blog\/rsa2020-deepfakes-mitigation\/34006\/\" target=\"_blank\" rel=\"noopener nofollow\">here<\/a>.<\/p>\n<p>For now, protection against AI fakes is only just beginning, so it\u2019s important to keep in mind that deepfakes are just another kind of advanced social engineering. The risk of encountering fraud like this is small, but it\u2019s still there, so it\u2019s worth knowing and keeping in mind. If you get a strange call, pay attention to the sound quality. Is it in an unnatural monotone, is it unintelligible, or are there strange noises? Always double-check information through other channels, and remember that surprise and panic are what scammers rely on most.<\/p>\n<input type=\"hidden\" class=\"category_for_banner\" value=\"premium-generic\">\n","protected":false},"excerpt":{"rendered":"<p>Audio deepfakes that can mimic anyone&#8217;s voice are already being used for multi-million dollar scams. How are deepfakes made and can you protect yourself from falling victim?<\/p>\n","protected":false},"author":2738,"featured_media":25890,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1342,2196],"tags":[1094,1095,2902,80,1924,2778,43,321,527],"class_list":{"0":"post-25887","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"category-threats","9":"tag-ai","10":"tag-artificial-intelligence","11":"tag-deepfakes","12":"tag-fraud","13":"tag-machine-learning","14":"tag-neural-networks","15":"tag-privacy","16":"tag-technology","17":"tag-threats"},"hreflang":[{"hreflang":"en-in","url":"https:\/\/www.kaspersky.co.in\/blog\/audio-deepfake-technology\/25887\/"},{"hreflang":"en-ae","url":"https:\/\/me-en.kaspersky.com\/blog\/audio-deepfake-technology\/21327\/"},{"hreflang":"ar","url":"https:\/\/me.kaspersky.com\/blog\/audio-deepfake-technology\/11071\/"},{"hreflang":"en-us","url":"https:\/\/usa.kaspersky.com\/blog\/audio-deepfake-technology\/28587\/"},{"hreflang":"en-gb","url":"https:\/\/www.kaspersky.co.uk\/blog\/audio-deepfake-technology\/26223\/"},{"hreflang":"es-mx","url":"https:\/\/latam.kaspersky.com\/blog\/audio-deepfake-technology\/26543\/"},{"hreflang":"es","url":"https:\/\/www.kaspersky.es\/blog\/audio-deepfake-technology\/29025\/"},{"hreflang":"it","url":"https:\/\/www.kaspersky.it\/blog\/audio-deepfake-technology\/27923\/"},{"hreflang":"ru","url":"https:\/\/www.kaspersky.ru\/blog\/audio-deepfake-technology\/35694\/"},{"hreflang":"x-default","url":"https:\/\/www.kaspersky.com\/blog\/audio-deepfake-technology\/48586\/"},{"hreflang":"fr","url":"https:\/\/www.kaspersky.fr\/blog\/audio-deepfake-technology\/20851\/"},{"hreflang":"pt-br","url":"https:\/\/www.kaspersky.com.br\/blog\/audio-deepfake-technology\/21555\/"},{"hreflang":"de","url":"https:\/\/www.kaspersky.de\/blog\/audio-deepfake-technology\/30344\/"},{"hreflang":"ja","url":"https:\/\/blog.kaspersky.co.jp\/audio-deepfake-technology\/34254\/"},{"hreflang":"ru-kz","url":"https:\/\/blog.kaspersky.kz\/audio-deepfake-technology\/26493\/"},{"hreflang":"en-au","url":"https:\/\/www.kaspersky.com.au\/blog\/audio-deepfake-technology\/32197\/"},{"hreflang":"en-za","url":"https:\/\/www.kaspersky.co.za\/blog\/audio-deepfake-technology\/31880\/"}],"acf":[],"banners":"","maintag":{"url":"https:\/\/www.kaspersky.co.in\/blog\/tag\/deepfakes\/","name":"deepfakes"},"_links":{"self":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/25887","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/users\/2738"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/comments?post=25887"}],"version-history":[{"count":3,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/25887\/revisions"}],"predecessor-version":[{"id":25892,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/25887\/revisions\/25892"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/media\/25890"}],"wp:attachment":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/media?parent=25887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/categories?post=25887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/tags?post=25887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}