
A Novel Approach To Generative Music as a Service
Aimi’s generative music service has been developed with three fundamental goals in mind:
Flexible output
For a generative music service to enable a broad number of applications, it needs to be flexible enough to support short and long-form audio while maintaining coherency regardless of the length. Flexible output also means supporting multilayered audio, which is critical when mixing non-musical elements such as voice-over, sound effects, and original audio. Flexibility also means control over every element of the generative audio, ensuring that the output generated can map to target mediums such as videos, games, application environments, and listener tastes.
Exceptional quality
Generative music has been a topic of interest for music nerds for decades, dating back to Brian Eno who is widely credited with coining the term. And while the prior efforts we researched fell short on quality, our team knew that to build a commercially successful generative music service would require pushing beyond making music that was interesting from an academic perspective to music that meets the discerning requirements of creators, developers, enterprises, and music lovers. This meant we had to develop a system that creates exceptional music.
Respect for art
Our team is comprised of lifelong musicians. We love music. We love the humanity of it. We love the creativity and uniquely human way music communicates emotion across cultures around the globe. For us, this means respecting the amazing artists who learn to play instruments, who learn to compose, who learn to sing, or learn to produce. Respecting these artists means respecting their art and protecting their ability to continue creating it. While copyright law is not perfect, it is in place, in part, to protect the humanity of music. This is something that we understood early on would be antithetical to scraping or pilfering songs to train models. So we chose a different path.
Separation of Content From Structure
The process starts with our sonic vault. Multiple, proprietary machine learning models ingest samples and process them to understand their audio features. Our data ingestion pipeline can determine key, tempo, instrument, frequency space, beat density, MIDI, vocal (and vocal type), timbre, style, and hundreds of features that are not human-readable. These features are used at runtime by our novel programming language, Aimi Script, which supports arranging, mixing, and mastering samples in real time. To ensure we can use samples in generative output, we have created a new form of audio representation that facilitates ad-hoc use of samples served by the vault in real time. These samples are combined with generative instruments (smart algorithmic instruments that can jam along with other samples) to create a magical mix of human-generated content and machine-generated content. Furthermore, since we commission, license, and buy audio samples on a regular basis, we feed back into the artist economy, ensuring that the creative human element of music is not only represented in our generative output but that it also can thrive in a new generative music economy.
This solution allows us to express arrangement using code (Aimi Script), while expressing sound using original audio recordings and generative instruments. This separation of content from structure follows the same model that traditional music does and, as such, allows us to create diverse and varied arrangements of arbitrary length. This type of flexibility is not achievable by monolithic models since continuity and coherence are lost after a few minutes. As a result, we can transcend the confines of the traditional ‘song’ structure and instead generate music in any form, in any style, for any medium, at any length. And since we are not trained on ‘songs,’ we are able to produce music like a producer would: from the bottom up, incorporating any audio elements needed, optimally mixing them, and then generating a final stream that can include everything from original audio, voice overs, music, vocals, and sound effects. This is the essence of our generative music service and is the backbone of our API and our products.
Research & Development at Aimi – A Look Ahead
The benefits of separating content from structure extend beyond the utility outlined above. This separation means we can also apply machine learning solutions that are optimized for content separately from structure. For example, we have developed models that can generate script on the fly. The scripts can serve as the basis for arrangements, or facilitate high form structures for mediums such as movies. We also have developed a novel pattern language that allows us to express music patterns that can, in turn, be used to program generative instruments that jam along with traditional samples from our vault. Since we have extract detailed features from samples in our vault, we are also able to create new samples by transferring the sound of one sample to the melodic form of another sample. This type of re-combinatorial explosion allows us to use content in more ways, giving us more flexibility in the expressiveness of our platform.
While all of this research is exciting and promises to both expand the capabilities of Aimi’s platform and improve the quality of the music generated, our fundamental belief that music is a uniquely human creation will remain the same.
