Music and AI

This project aims to explore the potential of generating long-length songs directly from raw audio waveforms using OpenAI’s Jukebox. One of the significant challenges in this domain is the ability of AI models to create meaningful and coherent musical structures for songs that last 2 to 5 minutes or longer. Most AI systems currently struggle with maintaining structure and continuity over such extended durations.

To address this, my primary objective is to compose a song that not only demonstrates the capabilities of Jukebox but also showcases a creative approach to overcoming the structural limitations inherent in current AI music generation models. This involves devising strategies that could help in managing the complexities of long-form audio generation, ensuring the output is both musically engaging and structurally coherent.

Due to the deprecation of certain dependencies, the official interactive Google Colab for Jukebox is no longer functional. However, I discovered a refactored version of the same implementation from a member of the AI music generation community. Consequently, I have restructured the code base and am utilizing it for my training.

When it comes to sampling, I've chosen a song by Radiohead “15 steps” as the target, primarily due to the band's reputation for intricate compositions and their experimental nature. If I can leverage AI to produce a song that mirrors Radiohead's distinctive style, it will indeed be a noteworthy accomplishment. The hyper-parameters I set are:

· Model: 5b_lyrics

· Genre: Indie Rock

· Artist: Radiohead

· sampling_temperature: .96, .98

· mode: primed

· speed_upsampling

In my initial test, I utilized ChatGPT to generate lyrics that mimic the style of a Radiohead song. I then employed these lyrics as input for lyrical conditioning during the sampling process. Fortunately, my first result (referring to test1-GPT_lyrics/level_0/item_0.wav) was quite promising. The structure of the output remained coherent, and the melody was reasonably good. However, the song became entirely disjointed after the 50-second mark. This issue was also apparent in the rest of the outputs, leading me to question if Jukebox can indeed construct a more robust structure for longer pieces of music, as they claimed in their research paper.

I also discovered that the lyrics play a crucial role in guiding the song's composition. To investigate this further, I carried out a few experiments focused on the lyrics. In my second test (referred to as test2-nonono), I inserted the word "no" 42 times to serve as the lyrics for the entire song. The results indicated that the song tended to veer towards different styles, presumably because the word "no" is a common element in a wide variety of songs. This seemed to disrupt the model's ability to generate a more consistent piece.

In my third test (referred to as test3-original-lyrics), I decided to use the original lyrics from the target song and evaluated whether the model could replicate the exact song. As expected, it was unable to do so, which was not surprising at all.

For my fourth test (referred to as test4-2.5min-structure), I aimed to generate a longer output, with the hope that a more extended time frame might lead to a more structured result. Regrettably, due to memory limitations, the model ended up splicing together several smaller clips. This approach didn't contribute to the overall coherence of the full song as I had hoped.

Audio Block

Double-click here to upload or link to a .mp3. Learn more

I also tried randomly generated lyrics to test out what results can the model generate.

Drawing conclusions from all the previous tests, I devised a method that divides the original song into 10 segments. For each segment, I used it to sample shorter outputs (less than 30 seconds) to circumvent the inconsistency problem. Essentially, I "restart" the song 10 times to guarantee the structural diversity and inclusion in the final output. I then compiled these generated segments to compose a full song in Ableton Live. Please refer to test5-fragments/15_steps-finals.wav for the resulting piece.

Building upon the foundational materials and methodologies from Assignment 5 for my final project, I dedicated substantial time towards understanding the process of generating long-format songs that preserve an intact music structure. Specifically, I delved deeply into the Jukebox codebase. My efforts to attain this objective were primarily bifurcated into two broad categories: lyrics conditioning and song prompting.

When working with lyrics conditioning, I experimented with generating both long and short pieces. This encompassed lyrics produced by ChatGPT, original song lyrics, repetitive single-word lyrics, and lyrics randomly generated from the most frequently used words across all the artist's albums. After several trials, it became clear that Jukebox anticipates the output to adhere to a rudimentary musical structure of intro->verse->outro. In an attempt to distribute the complete lyrics across the designated length, the model appears to have an optimal generation length in the range of 1.5 to 1.8 minutes, based on my observations. When the desired length is too brief, such as 30 seconds, the model attempts to cram all the lyrics in, resulting in poor audio quality. Conversely, if the song exceeds 2 minutes, the model finds it challenging to incorporate complex structures like the chorus, as the structure of intro->verse->chorus->verse->outro necessitates a more advanced model and greater computational resources. To put it simply, Jukebox essentially tries to extend the samples it's given, with the model striving to populate the notes for the specified duration before concluding with an outro.

Consequently, addressing the structure problem emerged as my primary objective. I formulated a method that sliced the original song into ten segments and utilized these as prompts for Jukebox to generate numerous brief outputs. I then manually stitched the outputs of each part together using a Digital Audio Workstation (DAW), aiming to acquire the optimal combination while introducing some AI-induced variations. This approach maintains the "intentional structure" of the original song. Although this was successful, I was motivated to advance the experiment even further: to generate long-length song without editing.

Lyrics generated using the word frequency of all Radiohead songs:

feel there we from me 
best m from by 
someone surprises get look 
of me with ve 
could better won by 
who on about nice 

do light wanna out 
oh enough better more 
soul at up now 
where of she or 
oh an hurt might 
no good up soul 

d place i but 
nothing way little lost 
at because children stand 
to walls though how 
they to of moon 
if wake got but 

him should when free 
nothing if all we 
how you its where 
the why gone used 
let it are take 
his time all new 

be free dead keep 
m stand no think 
our someone good hell 
where love for big 
running one am say 
am say uptight no 

get am something myself 
by why could maybe 
d place its gone 
day case m case 
something we everything dead 
everything just world myself 

one place alarms right 
going how waiting quiet 
new walls got man 
uptight ve wish d 
big ve something wanna 
been oh what get 

the come or your 
t man feel get 
down head now have 
never i keep over 
running eyes let nothing 
everyone seen gonna a 

I devised a resource-intensive solution: to generate as many long, low-level (fast but noisy) samples as possible, with the hope of securing a "perfect" output that would eliminate the need for editing. To accomplish this, I discovered a Docker image that can mirror the development environment of Jukebox: https://hub.docker.com/r/btrude/jukebox-docker. This resource opened up the opportunity for me to manipulate even more hyperparameters and potentially fine-tune the model. Additionally, I found a cost-effective GPU-sharing platform Vast.ai , which granted me access to high-performance hardware.

Nonetheless, Jukebox is a considerably slow model. After dedicating 2-3 full days to training, I was only able to generate 30 pieces of 2.5-minute songs. Among these samples, a mere 3-5 outputs met my satisfaction. The rest either started promisingly but quickly veered off course, or struggled to formulate a reasonable structure. Interestingly, I observed that despite setting the generation length at 2.5 minutes, most songs began to fade out after the 2-minute mark and the quality noticeably declined after 1 minute. This highlighted the model's limitations, and I ultimately had to resort to editing the output samples in Ableton Live. I managed to use three generated outputs, discovering some clever tactics to weave different pieces together. While Jukebox excels at maintaining tempo and key, my final composition still exhibited some evident gaps between sections. Identifying these obstacles was an insightful part of the process.

Source

1. 15-steps - Radiohead: https://genius.com/Radiohead-15-step-lyrics

2. jukebox-docker: https://hub.docker.com/r/btrude/jukebox-docker/

3. btrude’s github: https://github.com/btrude/jukebox-docker

4. Vast.ai: https://vast.ai/

5. Jukebox: https://github.com/openai/jukebox

6. Google colab: https://colab.research.google.com/drive/1YjOmczsWqPl3rIrBJ1I7PkzqUyRtOxVI

7. My Python lyrics analysis script (in the attachment)

Music and AI

Linkedin | Behance | Dribbble | instagram

All rights reserved by Kyle Huang 2024.