Computer Generated Album Art
Abstract
Since K-pop music albums are often strongly associated with its visual branding and concept, we believe it would be interesting to look at album covers from top K-pop albums to new generate concepts with (to clarify, the definition we are using for “K-pop” refers to Korean “idol music” as opposed to all Korean pop songs). For this dataset, we plan on using the Gaon Album Chart data to find the top-selling albums for the 2010s.
To achieve all of this, our technical process will include the use of jupyterhub while utilizing numerous libraries including TensorFlow, Pillow, Keras, and scikit-learn to recreate a deep covolutional generative adversarial network (DC-GAN) that will help us process new album art and display our results in jpg format.
Data and Model
Since our main dataset is limited by the size of around 500 popular covers, for additional exploration purposes we will also collect a different generalized/larger dataset by scrapping random kpop album art from Google and Bing to further generate new album covers from. To achieve this, we will be utilizing the icrawler python library and input english and korean keywords such as ‘kpop album art’ to download around 500 random covers.
Our generative adversarial model is based on MLWhiz’s DC-GAN architecture to which they generated new anime characters from. The idea behind the GAN originated from the paper, “Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Network” by Alex Radford, Luke Metz, and Soumith, where they discussed how the to implement a more stable training implementation through a generator architecture to which creates fakes. The discriminator on the other hand contains a multitude of convolutional layers in conjunction with a dense layer to predict if a given image is fake or not. The model provided a basis for our exploration and generative process.
Code
Data Acquisition
For the data acquisition portion of our code, we first utilized crawlers built in libraries and imported GoogleImageCrawler and BingImageCrawler. We further utilized those function in conjunction with specific keywords in both english and korean like '한국 앨범 표지' which translates to 'Korean Album Cover,' to download a set of over 500 images. We ran the code twice as a few images were not particularly album covers to which we hand deleted them.
Preprocessing and Neural Networks
For the analysis portion of our code, we first had to preprocess the given datasets and normalize the aspect ratio and resolution size to be balanced among every album image. To achieve this we repurposed a thumbnail function to resize every image to be 512 by 512 thus transforming the aspect ratio to a square like normal album covers are. After normalizing each image, we imported keras, imageio, pillow, and tensorflow to reconstruct helper functions that contribute to the overall generator and discriminator model.
The generator architecture utilizes transposed convolutional layers which are then used to upsample the noise vector to an image, however we further removed many dense layers to make the model fully convolutional to further benefit our dataset. The discriminator on the other hand utilizes a sequence of convolutional layers with a final dense layer to predict if an image is either fake or not.
Thus to begin the training process, we sample a batch of images from the dataset, generate random noise to input into the generator which generates "fake" images. Next we train the discriminator using the generated/fake images and the normalized/real images along with their noisy labels, and finally we train the GAN using a random vector of noise and it’s labels while making sure the discriminator remains untrainable. This entire process is repeated within a loop of num_steps to which we put at around 10000 steps.
Results
After multiple sessions with the GAN, we tested many different factors in how to generate better and more qualitative results that represented the dataset in a more cohesive way. For our first attempt on the popular albums dataset, we utilized a batch size of 16 and a noise shape of (1,1,100) with a step number of 10,000. We noticed that the results were decent but mostly unrecognizable or not specific enough to tell what they represent.