3 seconds of recording are enough for this Microsoft AI to copy your voice

Written By tsboi team

It’s the commotion at Microsoft about artificial intelligence: the firm has developed a tool called “Vall-E” that allows you to create voice replicas from a three-second recording. In addition to simply playing a voice, this AI can play emotions.

Source: Turag Photography via Unsplash

At the beginning of 2023, the trend is undeniable towards artificial intelligence and automatic generation tools. On Microsoft’s side, the company has created its own DALL-E 2 and would like to integrate ChatGPT into Bing to compete with Google. Also, Microsoft would like to invest $10 billion in OpenAI to integrate AI tools into the Office suite. A busy start to the year that does not end: with Vall-E, Microsoft can reproduce the human voice from just three seconds of recording.

Vall-E: Microsoft’s artificial intelligence that can reproduce a voice

A few days ago, Microsoft published a scientific article presenting ” a language modeling approach to text-to-speech synthesis “. A text-to-speech tool that not only converts text to speechrobotics created from scratch, but in a voice created from a real human voice. The developers say they trained their model for 60,000 hours in English. According to them, these are hundreds of times more than existing systems ».

Vall-E operating scheme // Source: Microsoft

With your abilities, Vall-E” can be used to synthesize high-quality custom speech with just a 3-second recording of an unknown speaker as an acoustic guide“. Therefore, words can be spoken by a voice without ever having been spoken by the voice. On top of that, the tool can preserve the emotion of the speaker and the acoustic environment of the acoustic guest in the synthesis ».

Obviously, the more samples, the more accurate the recreated voice will be. If the recordings generated and published by Microsoft are not entirely convincing, they were with three seconds of recording. With more samples, one can imagine that the AI ​​is more efficient.

What can this playback speech synthesis be used for?

In the presentation of Vall-E, some possible uses were detailed: “ VALL-E directly enables various speech synthesis applications, such as TTS(text to speech, French text to speech)voice editing and content creation, in combination with other generative AI models such as GTP-3».

However, Vall-E could be used for less than honest purposes. For many years, technologyfake deepit is increasingly democratic: it consists of modifying videos or images to join a person’s face to a body that does not belong to them, with the aim of deceiving. If Vall-E is not available at the moment, Microsoft has not put anything in place to avoid these problems.

The developers imagine that “Speech editing models must be accompanied by relevant components, including the protocol to ensure that the speaker agrees to perform the editing and the system to detect the edited speech.».

An explanatory diagram about Dall-E // Source: OpenAI

If the tool exists and if the demos are encouraging, Microsoft’s biggest challenge is not technical, but ethical. Public figures, some of whom are already victims ofdeep fakes, might be the most naturally impacted. Furthermore, one can imagine that Vall-E is used in addition to a tool forfake deepvideo, to create scandalous fake videos.

Also, Vall-E could very well be used to impersonate someone over the phone. As for artists with automatic image generation AI, Microsoft’s tool could jeopardize the work of many people: voice-over professionals, dubbing professionals, etc.

Everyone is in the race for generative AI

At the same time, other automatic generation tools are being developed. A few weeks ago, OpenAI, the company behind ChatGPT, introduced Point-E, a tool for generating 3D models. Microsoft is far from the only GAMAM in the game, as Meta manages to create videos from text and Google is hard at work developing AI tools.

Result for “An astronaut riding a horse in a photorealistic style” // Source: OpenAI

Apple has even gone further as the company is marketing a series of audiobooks with an artificial, AI-generated narrator. in the video gamehigh up in lifeone character was even voiced by an AI.

