Music Auto-Tagger

Music auto-tagger using keras

The prerequisite

You need keras to run example.py.
- To use your own audio file, you need librosa.
The input data shape is (None, channel, height, width), i.e. following theano convention. If you're using tensorflow as your backend, you should check out ~/.keras/keras.json if image_dim_ordering is set to th, i.e.

"image_dim_ordering": "th",

Structures

ConvNet

5-layer 2D Convolutions
num_parameter: 865,950
AUC score of 0.8654

(FYI: with 3M parameter, a deeper ConvNet showed 0.8595 AUC.)

RecurrentNet

4-layer 2D Convolutions + 2 GRU
num_parameter: 396,786
AUC score: 0.8662

How was it trained?

Using 29.1s music files in Million Song Dataset
Check out more details on this paper
The tags are...

['rock', 'pop', 'alternative', 'indie', 'electronic', 'female vocalists', 
'dance', '00s', 'alternative rock', 'jazz', 'beautiful', 'metal', 
'chillout', 'male vocalists', 'classic rock', 'soul', 'indie rock',
'Mellow', 'electronica', '80s', 'folk', '90s', 'chill', 'instrumental',
'punk', 'oldies', 'blues', 'hard rock', 'ambient', 'acoustic', 'experimental',
'female vocalist', 'guitar', 'Hip-Hop', '70s', 'party', 'country', 'easy listening',
'sexy', 'catchy', 'funk', 'electro' ,'heavy metal', 'Progressive rock',
'60s', 'rnb', 'indie pop', 'sad', 'House', 'happy']

Which is better?

Training: ConvNet is faster than RecurrentNet (wall-clock time)
Prediction: ConvNet > RecurrentNet
Memory Usage: RecurrentNet have smaller number of trainable parameters. Actually you can even decreases the number of feature maps. The RecurrentNet still works quite well in the case - i.e., the current setting is a little bit rich (or redundant). With ConvNet, you will see the performance decrease if you reduce down the parameters.

Therefore, if you just wanna use the pre-trained weights, use ConvNet. If you wanna train by yourself, it's up to you. I would use RecurrentNet after downsizing it to, like, 0.2M parameters (then the training time would be similar to ConvNet) in general. To reduce the size, change nums_feat_maps under get_convBNeluMPdrop in recurrentnet.py.

Usage

$ python example.py

Please take a look on the codes, it's pretty simple.

Result

$ python example.py
Running main() with network: cnn and backend: tensorflow
Loading weights of cnn...
Predicting...
Prediction is done. It took 3 seconds.
Printing top-15 tags for each track...
data/bensound-cute.mp3
[('folk', '0.222'), ('pop', '0.166'), ('jazz', '0.160'), ('female vocalists', '0.092'), ('acoustic', '0.075')]
[('rock', '0.070'), ('easy listening', '0.059'), ('indie', '0.055'), ('Mellow', '0.051'), ('beautiful', '0.036')]
[('alternative', '0.035'), ('soul', '0.034'), ('guitar', '0.033'), ('country', '0.032'), ('chillout', '0.027')]

data/bensound-actionable.mp3
[('rock', '0.592'), ('classic rock', '0.245'), ('pop', '0.119'), ('alternative', '0.109'), ('punk', '0.086')]
[('indie', '0.083'), ('80s', '0.076'), ('hard rock', '0.073'), ('female vocalists', '0.062'), ('indie rock', '0.051')]
[('alternative rock', '0.048'), ('blues', '0.047'), ('70s', '0.045'), ('90s', '0.039'), ('60s', '0.036')]

data/bensound-dubstep.mp3
[('electronic', '0.313'), ('Hip-Hop', '0.160'), ('electro', '0.116'), ('rock', '0.107'), ('pop', '0.085')]
[('dance', '0.078'), ('electronica', '0.077'), ('alternative', '0.064'), ('female vocalists', '0.047'), ('rnb', '0.047')]
[('indie', '0.035'), ('sexy', '0.031'), ('alternative rock', '0.031'), ('00s', '0.027'), ('hard rock', '0.024')]

data/bensound-thejazzpiano.mp3
[('jazz', '0.799'), ('instrumental', '0.420'), ('guitar', '0.042'), ('blues', '0.028'), ('rock', '0.023')]
[('Progressive rock', '0.021'), ('easy listening', '0.020'), ('experimental', '0.018'), ('oldies', '0.013'), ('chillout', '0.009')]
[('60s', '0.009'), ('alternative', '0.009'), ('folk', '0.009'), ('classic rock', '0.007'), ('indie', '0.007')]

Running main() with network: rnn and backend: tensorflow
Loading weights of rnn...
Predicting...
Prediction is done. It took 8 seconds.
Printing top-15 tags for each track...
data/bensound-cute.mp3
[('jazz', '0.167'), ('female vocalists', '0.165'), ('folk', '0.145'), ('pop', '0.117'), ('soul', '0.110')]
[('rock', '0.071'), ('acoustic', '0.057'), ('easy listening', '0.055'), ('country', '0.053'), ('oldies', '0.049')]
[('Mellow', '0.045'), ('blues', '0.045'), ('indie', '0.043'), ('beautiful', '0.032'), ('chillout', '0.031')]

data/bensound-actionable.mp3
[('rock', '0.480'), ('classic rock', '0.389'), ('hard rock', '0.216'), ('blues', '0.085'), ('70s', '0.074')]
[('80s', '0.071'), ('heavy metal', '0.053'), ('alternative', '0.040'), ('Progressive rock', '0.040'), ('60s', '0.032')]
[('alternative rock', '0.029'), ('punk', '0.025'), ('pop', '0.024'), ('guitar', '0.022'), ('90s', '0.017')]

data/bensound-dubstep.mp3
[('electronic', '0.513'), ('electro', '0.222'), ('dance', '0.166'), ('electronica', '0.134'), ('House', '0.098')]
[('indie', '0.087'), ('rock', '0.086'), ('pop', '0.055'), ('alternative', '0.054'), ('Hip-Hop', '0.044')]
[('experimental', '0.042'), ('indie rock', '0.033'), ('female vocalists', '0.024'), ('00s', '0.024'), ('party', '0.023')]

data/bensound-thejazzpiano.mp3
[('jazz', '0.915'), ('instrumental', '0.043'), ('female vocalists', '0.018'), ('guitar', '0.017'), ('easy listening', '0.014')]
[('blues', '0.013'), ('chillout', '0.008'), ('rock', '0.008'), ('Mellow', '0.007'), ('soul', '0.006')]
[('funk', '0.005'), ('chill', '0.005'), ('folk', '0.004'), ('pop', '0.004'), ('ambient', '0.004')]

Files

example.py: example
audio_convnet.py: build a convnet model
audio_conv_rnn.py: build a recurrentnet model
audio_processor.py: compute mel-spectrogram using librosa
Under data/,
- four .mp3 files: test files
- four .npy files: pre-computed melgram for those who don't want to install librosa
- cnn_weights_tensorflow.h5, cnn_weights_theano.h5: pre-trained weights so that you don't need to train by yourself.
- rnn_weights_tensorflow.h5, rnn_weights_theano.h5: similar but it's for conv+rnn.

And...

More info - CNN:
- on this paper, or blog post.
- Also please take a look on the slide at ismir 2016. It includes some results that are not in the paper.
More info - RNN:
- Paper/slide coming soon.

Credits

Please cite this paper, Automatic Tagging using Deep Convolutional Neural Networks, Keunwoo Choi, George Fazekas, Mark Sandler 17th International Society for Music Information Retrieval Conference, New York, USA, 2016
Test music items are from http://www.bensound.com.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
imgs		imgs
LICENSE.md		LICENSE.md
README.md		README.md
audio_conv_rnn.py		audio_conv_rnn.py
audio_convnet.py		audio_convnet.py
audio_processor.py		audio_processor.py
example.py		example.py
slide-ismir-2016.pdf		slide-ismir-2016.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Auto-Tagger

The prerequisite

Structures

ConvNet

RecurrentNet

How was it trained?

Which is better?

Usage

Result

Files

And...

Credits

About

Releases

Packages

Languages

License

crosofg/music-auto_tagging-keras

Folders and files

Latest commit

History

Repository files navigation

Music Auto-Tagger

The prerequisite

Structures

ConvNet

RecurrentNet

How was it trained?

Which is better?

Usage

Result

Files

And...

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages