{"id":596,"date":"2024-04-09T15:56:25","date_gmt":"2024-04-09T13:56:25","guid":{"rendered":"https:\/\/www.wolter.tech\/?p=596"},"modified":"2024-04-10T10:41:06","modified_gmt":"2024-04-10T08:41:06","slug":"on-deepfake-audio-fingerprints","status":"publish","type":"post","link":"https:\/\/www.wolter.tech\/?p=596","title":{"rendered":"On Deepfake audio fingerprints"},"content":{"rendered":"\n<p><a href=\"https:\/\/keithito.com\/LJ-Speech-Dataset\/\" data-type=\"link\" data-id=\"https:\/\/keithito.com\/LJ-Speech-Dataset\/\">LJSpeech<\/a> is a dataset of actual human voices. Lets start our own subjective quality evaluation by considering the example below.<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/LJ001-0001.wav\"><\/audio><\/figure>\n\n\n\n<p>Modern artificial neural networks can generate credible sounding human speech. A MelGAN-reproduction of the first LJSpeech sentence is hard to identify as such.<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/MEL_GAN_LJ001-0001_gen.wav\"><\/audio><\/figure>\n\n\n\n<p>The same is true for the HIFI-GAN version below.<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/HIFIGAN_LJ001-0001_generated.wav\"><\/audio><\/figure>\n\n\n\n<p>The audio is quite convincing. This ability will likely lead to creative new applications for example in video games and movies. Unfortunately, the technology is also abused for <a href=\"https:\/\/edition.cnn.com\/2023\/04\/29\/us\/ai-scam-calls-kidnapping-cec\/index.html\" data-type=\"link\" data-id=\"https:\/\/edition.cnn.com\/2023\/04\/29\/us\/ai-scam-calls-kidnapping-cec\/index.html\">theft<\/a> . In response our <a href=\"https:\/\/openreview.net\/pdf?id=RGewtLtvHz\" data-type=\"link\" data-id=\"https:\/\/openreview.net\/pdf?id=RGewtLtvHz\">paper<\/a> studies the automatic identification of synthetic speech. We found stable generator-specific fingerprints, and trained networks that generalize well to unknown generators.<\/p>\n\n\n\n<p>The plot below illustrates mean mean level 14 wavelet packet coefficients for LJSpeech and MelGAN-Audio recordings from the  <a href=\"https:\/\/arxiv.org\/pdf\/2111.02813.pdf\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/pdf\/2111.02813.pdf\">Wavefake-Dataset<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"325\" src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-1024x325.png\" alt=\"\" class=\"wp-image-597\" srcset=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-1024x325.png 1024w, https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-300x95.png 300w, https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-768x244.png 768w, https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-1536x487.png 1536w, https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints-1568x497.png 1568w, https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/fingerprints.png 1737w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p>The approach follows <a href=\"https:\/\/arxiv.org\/abs\/1812.11842\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/1812.11842\">Marra et al<\/a>, they proposed a Fourier-Transform based method to extract deepfake-image fingerprints. We compute some here for the audio deepfake samples collected in the <a href=\"https:\/\/arxiv.org\/pdf\/2111.02813.pdf\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/pdf\/2111.02813.pdf\">Wavefake-Dataset<\/a>. If we transform the fingerprints back into the time domain, we can listen to the result! Samples are available below, the results are exciting but not aesthetically pleasing, please set the volume to a low value.<\/p>\n\n\n\n<p>Microphone-fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/A_ljspeech_real.wav\"><\/audio><\/figure>\n\n\n\n<p>Deepfake ( MelGAN ) &#8211; fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/B_ljspeech_melgan.wav\"><\/audio><\/figure>\n\n\n\n<p>Deepfake ( HiFiGAN )  &#8211; fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/C_ljspeech_hifiGAN.wav\"><\/audio><\/figure>\n\n\n\n<p>Deepfake ( Melgan-Large ) &#8211; fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/D_ljspeech_melgan_large.wav\"><\/audio><\/figure>\n\n\n\n<p>Deepfake ( Multi-Band-Melgan ) &#8211; fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/E_ljspeech_multi_band_melgan.wav\"><\/audio><\/figure>\n\n\n\n<p>Deepfake ( Avocodo ) &#8211; fingerprint:<\/p>\n\n\n\n<figure class=\"wp-block-audio\"><audio controls src=\"https:\/\/www.wolter.tech\/wordpress\/wp-content\/uploads\/2024\/04\/I_ljspeech_avocodo.wav\"><\/audio><\/figure>\n\n\n\n<p>Interested? Our source code and full paper links are available below:<\/p>\n\n\n\n<p>Read more at: <a href=\"https:\/\/openreview.net\/pdf?id=RGewtLtvHz\">https:\/\/openreview.net\/pdf?id=RGewtLtvHz<\/a><\/p>\n\n\n\n<p>Code is available at: <a href=\"https:\/\/github.com\/gan-police\/audiodeepfake-detection\/tree\/main\">https:\/\/github.com\/gan-police\/audiodeepfake-detection\/tree\/main<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>LJSpeech is a dataset of actual human voices. Lets start our own subjective quality evaluation by considering the example below. Modern artificial neural networks can generate credible sounding human speech. A MelGAN-reproduction of the first LJSpeech sentence is hard to identify as such. The same is true for the HIFI-GAN version below. The audio is &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.wolter.tech\/?p=596\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;On Deepfake audio fingerprints&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24,5],"tags":[25,12,28],"class_list":["post-596","post","type-post","status-publish","format-standard","hentry","category-all","category-research-projects","tag-frequency-domain","tag-machine-learning","tag-wavelets","entry"],"_links":{"self":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/596","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=596"}],"version-history":[{"count":17,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/596\/revisions"}],"predecessor-version":[{"id":625,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/596\/revisions\/625"}],"wp:attachment":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=596"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=596"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=596"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}