Generative audio refers to the creation ofaudio files from databases ofaudio clips.[citation needed] This technology differs fromsynthesized voices such as Apple'sSiri or Amazon'sAlexa, which use a collection of fragments that are stitched together on demand.
Generative audio works by using neural networks to learn the statistical properties of an audio source, then reproduces those properties.[1]
With this technology, a person's voicecan be replicated to speak phrases that they may have never spoken. This could lead to a synthetic version of a public figure's voice being used against them.[2]
Modern generative audio systems employ various deep learning architectures. One notable approach usesgenerative adversarial networks (GANs), where two machine learning models work against each other to create realistic audio. Other architectures includeWaveNet, which uses dilated causal convolutions to model raw audio waveforms, and implementations like15.ai, which demonstrated in 2020 the ability to clone voices using as little as 15 seconds of training data through specialized neural network architectures.[3][4]