Sound-and-Image-informed Music Artwork Generation Using Text-to-Image Models
This notebook presents a pipeline for the generation of music artwork that is informed by contextual semantic tags of audio and visual works, and shaped by user prompt engineering.
Music and its accompanying artwork have a symbiotic relationship. While some artists are involved in both domains, the creation of music and artwork require different skill sets. The development of deep generative models for music and image generation has potential to democratise these mediums and make multi-modal creation more accessible for casual creators and other stakeholders. This notebook contains a co-creative pipeline for the generation of images to accompany a musical piece. The pipeline utilises state-of-the-art models for music-to-text (MusiCNN), image-to-text (CLIP Interrogator), and subsequently text-to-image (Stable Diffusion) generation to recommend, via generation, visuals for a piece of music that are informed by the audio of a musical piece and a user-recommended corpus of artworks and prompts to give a meaningful grounding in the generated material.
This work is presented in:
Alexander Williams, Stefan Lattner, and Mathieu Barthet. 2023. Sound-and-Image-informed Music Artwork Generation Using Text-to-Image Models. In Music Recommender Systems Workshop at the 17th ACM Conference on Recommender Systems, September 18–22, 2023, Singapore. ACM, New York,NY, USA, 5 pages.
Paper is available here: https://www.researchgate.net/publication/374263758_Sound-and-Image-informed_Music_Artwork_Generation_Using_Text-to-Image_Models