Transcription

Jukebox: A Generative Model for MusicPrafulla Dhariwal * 1 Heewoo Jun * 1 Christine Payne * 1 Jong Wook Kim 1 Alec Radford 1 Ilya Sutskever 1AbstractWe introduce Jukebox, a model that generatesmusic with singing in the raw audio domain. Wetackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes,and modeling those using autoregressive Transformers. We show that the combined model atscale can generate high-fidelity and diverse songswith coherence up to multiple minutes. We cancondition on artist and genre to steer the musicaland vocal style, and on unaligned lyrics to makethe singing more controllable. We are releasingthousands of non cherry-picked samples, alongwith model weights and code.1. IntroductionMusic is an integral part of human culture, existing from theearliest periods of human civilization and evolving into awide diversity of forms. It evokes a unique human spirit inits creation, and the question of whether computers can evercapture this creative process has fascinated computer scientists for decades. We have had algorithms generating pianosheet music (Hiller Jr & Isaacson, 1957; Moorer, 1972;Hadjeres et al., 2017; Huang et al., 2017), digital vocodersgenerating a singer’s voice (Bonada & Serra, 2007; Sainoet al., 2006; Blaauw & Bonada, 2017) and also synthesizersproducing timbres for various musical instruments (Engelet al., 2017; 2019). Each captures a specific aspect of musicgeneration: melody, composition, timbre, and the humanvoice singing. However, a single system to do it all remainselusive.The field of generative models has made tremendousprogress in the last few years. One of the aims of generative modeling is to capture the salient aspects of the dataand to generate new instances indistinguishable from thetrue data The hypothesis is that by learning to produce thedata we can learn the best features of the data1 . We aresurrounded by highly complex distributions in the visual,audio, and text domain, and in recent years we have devel*Equal contribution 1 OpenAI, San Francisco. Correspondenceto: [email protected] .oped advances in text generation (Radford et al.), speechgeneration (Xie et al., 2017) and image generation (Brocket al., 2019; Razavi et al., 2019). The rate of progress inthis field has been rapid, where only a few years ago wehad algorithms producing blurry faces (Kingma & Welling,2014; Goodfellow et al., 2014) but now we now can generate high-resolution faces indistinguishable from real ones(Zhang et al., 2019b).Generative models have been applied to the music generation task too. Earlier models generated music symbolicallyin the form of a pianoroll, which specifies the timing, pitch,velocity, and instrument of each note to be played. (Yanget al., 2017; Dong et al., 2018; Huang et al., 2019a; Payne,2019; Roberts et al., 2018; Wu et al., 2019). The symbolicapproach makes the modeling problem easier by workingon the problem in the lower-dimensional space. However, itconstrains the music that can be generated to being a specificsequence of notes and a fixed set of instruments to renderwith. In parallel, researchers have been pursuing the nonsymbolic approach, where they try to produce music directlyas a piece of audio. This makes the problem more challenging, as the space of raw audio is extremely high dimensionalwith a high amount of information content to model. Therehas been some success, with models producing piano pieceseither in the raw audio domain (Oord et al., 2016; Mehriet al., 2017; Yamamoto et al., 2020) or in the spectrogramdomain (Vasquez & Lewis, 2019). The key bottleneck isthat modeling the raw audio directly introduces extremelylong-range dependencies, making it computationally challenging to learn the high-level semantics of music. A way toreduce the difficulty is to learn a lower-dimensional encoding of the audio with the goal of losing the less importantinformation but retaining most of the musical information.This approach has demonstrated some success in generating short instrumental pieces restricted to a set of a fewinstruments (Oord et al., 2017; Dieleman et al., 2018).In this work, we show that we can use state-of-the-art deepgenerative models to produce a single system capable of generating diverse high-fidelity music in the raw audio domain,with long-range coherence spanning multiple minutes. Ourapproach uses a hierarchical VQ-VAE architecture (Razavi1Richard Feynmann famously said, “What I cannot create, Ido not understand”

Jukebox: A Generative Model for Musicet al., 2019) to compress audio into a discrete space, witha loss function designed to retain the maximum amount ofmusical information, while doing so at increasing levelsof compression. We use an autoregressive Sparse Transformer (Child et al., 2019; Vaswani et al., 2017) trained withmaximum-likelihood estimation over this compressed space,and also train autoregressive upsamplers to recreate the lostinformation at each level of compression.We show that our models can produce songs from highlydiverse genres of music like rock, hip-hop, and jazz. Theycan capture melody, rhythm, long-range composition, andtimbres for a wide variety of instruments, as well as thestyles and voices of singers to be produced with the music. We can also generate novel completions of existingsongs. Our approach allows the option to influence thegeneration process: by swapping the top prior with a conditional prior, we can condition on lyrics to tell the singerwhat to sing, or on midi to control the composition. Werelease our model weights and training and sampling codeat https://github.com/openai/jukebox.2. BackgroundWe consider music in the raw audio domain represented asa continuous waveform x [ 1, 1]T , where the numberof samples T is the product of the audio duration and thesampling rate typically ranging from 16 kHz to 48 kHz. Formusic, CD quality audio, 44.1 kHz samples stored in 16bit precision, is typically enough to capture the range offrequencies perceptible to humans. As an example, a fourminute-long audio segment will have an input length of 10million, where each position can have 16 bits of information.In comparison, a high-resolution RGB image with 1024 1024 pixels has an input length of 3 million, and eachposition has 24 bits of information. This makes learninga generative model for music extremely computationallydemanding with increasingly longer durations; we have tocapture a wide range of musical structures from timbre toglobal coherence while simultaneously modeling a largeamount of diversity.2.1. VQ-VAETo make this task feasible, we use the VQ-VAE (Oord et al.,2017; Dieleman et al., 2018; Razavi et al., 2019) to compressraw audio to a lower-dimensional space. A one-dimensionalVQ-VAE learns to encode an input sequence x hxt iTt 1using a sequence of discrete tokens z hzs [K]iSs 1 ,where K denotes the vocabulary size and we call the ratioT /S the hop length. It consists of an encoder E(x) whichencodes x into a sequence of latent vectors h hhs iSs 1 ,a bottleneck that quantizes hs 7 ezs by mapping each hsto its nearest vector ezs from a codebook C {ek }Kk 1 ,and a decoder D(e) that decodes the embedding vectorsback to the input space. It is thus an auto-encoder with adiscretization bottleneck. The VQ-VAE is trained using thefollowing objective:L Lrecons Lcodebook βLcommitPkxt D(ezt )k22Lrecons T1tPLcodebook S1ksg [hs ] ezs k22sPLcommit S1khs sg [ezs ]k22(1)(2)(3)(4)swhere sg denotes the stop-gradient operation, which passeszero gradient during backpropagation. The reconstructionloss Lrecons penalizes for the distance between the input xb D(ez ), and Lcodebook peand the reconstructed output xnalizes the codebook for the distance between the encodingsh and their nearest neighbors ez from the codebook. Tostabilize the encoder, we also add Lcommit to prevent theencodings from fluctuating too much, where the weight βcontrols the amount of contribution of this loss. To speed uptraining, the codebook loss Lcodebook instead uses EMA updates over the codebook variables. Razavi et al. (2019)extends this to a hierarchical model where they train a single encoder and decoder but break up the latent sequenceh into a multi-level representation [h(1) , · · · , h(L) ] with decreasing sequence lengths, each learning its own codebookC (l) . They use non-autoregressive encoder-decoders andjointly train all levels with a simple mean-squared loss.3. Music VQ-VAEInspired by the results from the hierarchical VQ-VAE model(Razavi et al., 2019) for images, we consider applying thesame technique to model raw audio using three differentlevels of abstraction, as illustrated in Figure 1. At each level,we use residual networks consisting of WaveNet-style noncausal 1-D dilated convolutions, interleaved with downsampling and upsampling 1-D convolutions to match differenthop lengths. A detailed description of the architecture isprovided in Appendix B.1. We make a number of modifications to our VQ-VAE compared to the ones in (Oord et al.,2017; Razavi et al., 2019), as described in the followingsubsections.3.1. Random restarts for embeddingsVQ-VAEs are known to suffer from codebook collapse,wherein all encodings get mapped to a single or few embedding vectors while the other embedding vectors in thecodebook are not used, reducing the information capacityof the bottleneck. To prevent this, we use random restarts:when the mean usage of a codebook vector falls below athreshold, we randomly reset it to one of the encoder outputs from the current batch. This ensures all vectors in the

Jukebox: A Generative Model for MusicCodebook orQuantizationCodebookLookupDecodextht E(xt)zt argmink ǁ ht – ek ǁx̂t D(ezt)eztFigure 1. We first train three separate VQ-VAE models with different temporal resolutions. At each level, the input audio is segmentedand encoded into latent vectors ht , which are then quantized to the closest codebook vectors ezt . The code zt is a discrete representationof the audio that we later train our prior on. The decoder takes the sequence of codebook vectors and reconstructs the audio. The toplevel learns the highest degree of abstraction, since it is encoding longer audio per token while keeping the codebook size the same.Audio can be reconstructed using the codes at any one of the abstraction levels, where the least abstract bottom-level codes result in thehighest-quality audio, as shown in Figure 4. For the detailed structure of each component, see Figure 7.codebook are being used and thus have a gradient to learnfrom, mitigating codebook collapse.we use the sum of the spectral losses Lspec calculated overmultiple STFT parameters that trade-off time and frequencyresolutions (Yamamoto et al., 2020).3.2. Separated AutoencodersWhen using the hierarchical VQ-VAE from (Razavi et al.,2019) for raw audio, we observed that the bottlenecked toplevel is utilized very little and sometimes experiences a complete collapse, as the model decides to pass all informationthrough the less bottlenecked lower levels. To maximizethe amount of information stored at each level, we simplytrain separate autoencoders with varying hop lengths. Discrete codes from each level can be treated as independentencodings of the input at different levels of compression.3.3. Spectral LossWhen using only the sample-level reconstruction loss, themodel learns to reconstruct low frequencies only. To capturemid-to-high frequencies, we add a spectral loss which isdefined asLspec k STFT(x) STFT(bx) k2It encourages the model to match the spectral componentswithout paying attention to phase which is more difficultto learn. This is similar to the use of power loss (Oordet al., 2018) and spectral convergence (Arık et al., 2018b)when training parallel decoders for raw audio. One difference between the latter approach and ours is that we are nolonger optimizing the spectral signal-to-noise ratio; dividingby the magnitude of the signal results in numerical instability for mostly silent inputs. To prevent the model fromoverfitting to a particular choice of the STFT parameters,4. Music Priors and UpsamplersAfter training the VQ-VAE, we need to learn a prior p(z)over the compressed space to generate samples. We breakup the prior model asp(z) p(ztop , zmiddle , zbottom )top p(z )p(zmiddletop z )p(z(5)bottom zmiddletop,z )(6)and train separate models for the top-level prior p(ztop ), andupsamplers p(zmiddle ztop ) and p(zbottom zmiddle , ztop ). Eachof these is an autoregressive modeling problem in the discrete token space produced by the VQ-VAE. We use Transformers with sparse attention (Vaswani et al., 2017; Childet al., 2019) as they are currently the SOTA in autoregressivemodeling. We propose a simplified version which we callthe Scalable Transformer, that is easier to implement andscale (see Appendix A for details).For the upsamplers, we need to provide the autoregressive Transformers with conditioning information from thecodes of the upper levels. To do so, we use a deep residual WaveNet (Xie et al., 2017) followed by an upsamplingstrided convolution and a layer norm (Ba et al., 2016), andadd the output as extra positional information to the embeddings of the current level. We condition the lower levelsonly on the chunk of upper level codes that correspond tothe same segment of raw audio.

Jukebox: A Generative Model for MusicAt each level, we use Transformers over the same contextlength of discrete codes, which correspond to increasingthe raw audio length with larger hop lengths, and modelinglonger temporal dependencies at the higher levels whilekeeping the same computational footprint for training eachlevel. As our VQ-VAE is convolutional, we can use thesame VQ-VAE to produce codes for arbitrary lengths ofaudio.ConditioningInformationTop-Level PriorMiddle Upsampler4.1. Artist, Genre, and Timing C