AICurious Logo

What is: Jukebox?

SourceJukebox: A Generative Model for Music
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

Jukebox is a model that generates music with singing in the raw audio domain. It tackles the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. It can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

Three separate VQ-VAE models are trained with different temporal resolutions. At each level, the input audio is segmented and encoded into latent vectors h_t\mathbf{h}\_{t}, which are then quantized to the closest codebook vectors e_z_t\mathbf{e}\_{z\_{t}}. The code z_tz\_{t} is a discrete representation of the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The top level learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same. Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in the highest-quality audio.