Music and AI
This project aims to explore the potential of generating long-length songs directly from raw audio waveforms using OpenAI’s Jukebox. One of the significant challenges in this domain is the ability of AI models to create meaningful and coherent musical structures for songs that last 2 to 5 minutes or longer. Most AI systems currently struggle with maintaining structure and continuity over such extended durations.
To address this, my primary objective is to compose a song that not only demonstrates the capabilities of Jukebox but also showcases a creative approach to overcoming the structural limitations inherent in current AI music generation models. This involves devising strategies that could help in managing the complexities of long-form audio generation, ensuring the output is both musically engaging and structurally coherent.
Due to the deprecation of certain dependencies, the official interactive Google Colab for Jukebox is no longer functional. However, I discovered a refactored version of the same implementation from a member of the AI music generation community. Consequently, I have restructured the code base and am utilizing it for my training.
When it comes to sampling, I've chosen a song by Radiohead “15 steps” as the target, primarily due to the band's reputation for intricate compositions and their experimental nature. If I can leverage AI to produce a song that mirrors Radiohead's distinctive style, it will indeed be a noteworthy accomplishment. The hyper-parameters I set are:
· Model: 5b_lyrics
· Genre: Indie Rock
· Artist: Radiohead
· sampling_temperature: .96, .98
· mode: primed
· speed_upsampling
In my initial test, I utilized ChatGPT to generate lyrics that mimic the style of a Radiohead song. I then employed these lyrics as input for lyrical conditioning during the sampling process. Fortunately, my first result (referring to test1-GPT_lyrics/level_0/item_0.wav) was quite promising. The structure of the output remained coherent, and the melody was reasonably good. However, the song became entirely disjointed after the 50-second mark. This issue was also apparent in the rest of the outputs, leading me to question if Jukebox can indeed construct a more robust structure for longer pieces of music, as they claimed in their research paper.
I also discovered that the lyrics play a crucial role in guiding the song's composition. To investigate this further, I carried out a few experiments focused on the lyrics. In my second test (referred to as test2-nonono), I inserted the word "no" 42 times to serve as the lyrics for the entire song. The results indicated that the song tended to veer towards different styles, presumably because the word "no" is a common element in a wide variety of songs. This seemed to disrupt the model's ability to generate a more consistent piece.
In my third test (referred to as test3-original-lyrics), I decided to use the original lyrics from the target song and evaluated whether the model could replicate the exact song. As expected, it was unable to do so, which was not surprising at all.
For my fourth test (referred to as test4-2.5min-structure), I aimed to generate a longer output, with the hope that a more extended time frame might lead to a more structured result. Regrettably, due to memory limitations, the model ended up splicing together several smaller clips. This approach didn't contribute to the overall coherence of the full song as I had hoped.
I also tried randomly generated lyrics to test out what results can the model generate.
Drawing conclusions from all the previous tests, I devised a method that divides the original song into 10 segments. For each segment, I used it to sample shorter outputs (less than 30 seconds) to circumvent the inconsistency problem. Essentially, I "restart" the song 10 times to guarantee the structural diversity and inclusion in the final output. I then compiled these generated segments to compose a full song in Ableton Live. Please refer to test5-fragments/15_steps-finals.wav for the resulting piece.
Building upon the foundational materials and methodologies from Assignment 5 for my final project, I dedicated substantial time towards understanding the process of generating long-format songs that preserve an intact music structure. Specifically, I delved deeply into the Jukebox codebase. My efforts to attain this objective were primarily bifurcated into two broad categories: lyrics conditioning and song prompting.
When working with lyrics conditioning, I experimented with generating both long and short pieces. This encompassed lyrics produced by ChatGPT, original song lyrics, repetitive single-word lyrics, and lyrics randomly generated from the most frequently used words across all the artist's albums. After several trials, it became clear that Jukebox anticipates the output to adhere to a rudimentary musical structure of intro->verse->outro. In an attempt to distribute the complete lyrics across the designated length, the model appears to have an optimal generation length in the range of 1.5 to 1.8 minutes, based on my observations. When the desired length is too brief, such as 30 seconds, the model attempts to cram all the lyrics in, resulting in poor audio quality. Conversely, if the song exceeds 2 minutes, the model finds it challenging to incorporate complex structures like the chorus, as the structure of intro->verse->chorus->verse->outro necessitates a more advanced model and greater computational resources. To put it simply, Jukebox essentially tries to extend the samples it's given, with the model striving to populate the notes for the specified duration before concluding with an outro.
Consequently, addressing the structure problem emerged as my primary objective. I formulated a method that sliced the original song into ten segments and utilized these as prompts for Jukebox to generate numerous brief outputs. I then manually stitched the outputs of each part together using a Digital Audio Workstation (DAW), aiming to acquire the optimal combination while introducing some AI-induced variations. This approach maintains the "intentional structure" of the original song. Although this was successful, I was motivated to advance the experiment even further: to generate long-length song without editing.
Lyrics generated using the word frequency of all Radiohead songs:
feel there we from me best m from by someone surprises get look of me with ve could better won by who on about nice do light wanna out oh enough better more soul at up now where of she or oh an hurt might no good up soul d place i but nothing way little lost at because children stand to walls though how they to of moon if wake got but him should when free nothing if all we how you its where the why gone used let it are take his time all new
be free dead keep m stand no think our someone good hell where love for big running one am say am say uptight no get am something myself by why could maybe d place its gone day case m case something we everything dead everything just world myself one place alarms right going how waiting quiet new walls got man uptight ve wish d big ve something wanna been oh what get the come or your t man feel get down head now have never i keep over running eyes let nothing everyone seen gonna a
I devised a resource-intensive solution: to generate as many long, low-level (fast but noisy) samples as possible, with the hope of securing a "perfect" output that would eliminate the need for editing. To accomplish this, I discovered a Docker image that can mirror the development environment of Jukebox: https://hub.docker.com/r/btrude/jukebox-docker. This resource opened up the opportunity for me to manipulate even more hyperparameters and potentially fine-tune the model. Additionally, I found a cost-effective GPU-sharing platform Vast.ai , which granted me access to high-performance hardware.
Nonetheless, Jukebox is a considerably slow model. After dedicating 2-3 full days to training, I was only able to generate 30 pieces of 2.5-minute songs. Among these samples, a mere 3-5 outputs met my satisfaction. The rest either started promisingly but quickly veered off course, or struggled to formulate a reasonable structure. Interestingly, I observed that despite setting the generation length at 2.5 minutes, most songs began to fade out after the 2-minute mark and the quality noticeably declined after 1 minute. This highlighted the model's limitations, and I ultimately had to resort to editing the output samples in Ableton Live. I managed to use three generated outputs, discovering some clever tactics to weave different pieces together. While Jukebox excels at maintaining tempo and key, my final composition still exhibited some evident gaps between sections. Identifying these obstacles was an insightful part of the process.
Source
1. 15-steps - Radiohead: https://genius.com/Radiohead-15-step-lyrics
2. jukebox-docker: https://hub.docker.com/r/btrude/jukebox-docker/
3. btrude’s github: https://github.com/btrude/jukebox-docker
4. Vast.ai: https://vast.ai/
5. Jukebox: https://github.com/openai/jukebox
6. Google colab: https://colab.research.google.com/drive/1YjOmczsWqPl3rIrBJ1I7PkzqUyRtOxVI
7. My Python lyrics analysis script (in the attachment)