In this project, we try to train a neural network to discover the genre of a music track only by looking at its lyrics text.
Regarding the datasets used in this project, we found in two papers a link pointing to a compilation of
"55000+ song-lyrics in English from LyricsFreak", made by user Kuznetsov. Unfortunately, the songs in the dataset were
not labeled with its corresponding musical genres, so we had to fetch ourselves the labels from an external site,
namely the last.fm website, which provided us with the tags we were looking for.
Ultimately, we decided to discard too general genres like pop, rock etc., and focus only on genres that are a priori
much easier to distiguish, i.e. metal, hip-hop, country, religious and jazz. After this filtering process,
the original 55.000 dataset shrinked to less than 10.000. Because of this and because we thought a bigger dataset
not used in the above mentioned papers before would be ideal, we then searched for another corpus in kaggle and
found one named "Every song you have heard (almost). "Over 500,000 song lyric urls for over a million artists" of which
just the first part/half of 250.000 was considered, as it was very time-consuming to preprocess all of the data.
Once again the songs were not labelled, so this had to be done as well thanks to the api service of last.fm and after
filtering the data to obtain just the songs with the genres of interest for our use case, this dataset shrinked to
about 30.000 that in the end also showed better results in the project implementation.
Links to the two datasets:
- 55000+ Song Lyrics
- Every song you have heard (almost)! (this is the one we used in the end)
The preprocessing consists of several steps in order to convert the plaintext of the song-lyrics into a format that is processable by a neural network.
The .csv-file of the dataset contains around 250.000 songs. We only
want to work with english songs. Therefore, we detect the languages of the lyircs using the python package cld2
and
remove all non-english songs from the dataset. After filtering out genres which are not interesting for us,
the dataset has a size of 30.138 songs. The genres we are interested in are listed below with their respective
quantities of songs.
Genre | Number samples |
---|---|
Metal | 9.147 |
Country | 7.852 |
Jazz | 5.860 |
Hip-Hop | 5.420 |
Religious | 1.859 |
The filtered data set containing the genres can be found here.
In this step, all the lyrics texts are taken and the contractions are expanded.
The lyrics, which are still encoded as normal strings, are now tokenized ignoring the sentence- or newline-structure. The output of tokenization is a simple list of words.
In this step, the tokenized lyrics texts are taken, first POS tagged and then lemmatized. The POS tagging
converts the list of words from above to pairs of (word, POS tag)
. After the POS tagging, lemmatization is applied and
the text is transferred back to a normal list of words representation.
The last step is to filter out the stopwords from the lyrics texts as they don't contain any valuable information we need. Removing the stopwords makes the texts quite shorter than before
In the end, the preprocessed data set is written to a .npy-file, so that it can be easily used for later steps by simply loading it from the file, without having to preprocess the data again. The data set is written to the file as a numpy array, where the first column contains the genre and the second column contains the list of words representing the lyrics of the song.
In the feature extraction, the dataset from the preprocessing is read and numerical features are extracted.
A TF-IDF vectorizer object is created from the lyrics data as well as a word2vec model. Together, the two models are
used to generate a numerical vector representation of each lyric-text, using 75 dimensions.
Those vectors are again written to a .npy-file and can then be used by a neural network.
We tried two different approaches for training our neural network for lyrics-genre classification:
- use an LSTM network directly on the preprocessed text-data (implemented in
ClassificationLSTM.py
) - apply feature extraction to the preprocessed text-data and feed the numerical vectors into a fully connected
neural network (implemented in
Classification.py
)
After several tests and modifications, we came to the conclusion that the second mentioned approach yields a better accuracy and is also way faster to train.
The optimal parameters for the fully connected neural network were:
- Learning rate: 0.01
- Batch size: 256
- Number of steps: 20.000
- 2 layers of 32 and 64 neurons
The model achieved an average test-accuracy of 74% using 5-Fold cross-validation.
In the end, one should be able to make actual predictions with the created model. Therefore we stored the final model
in a file in order to allow to load it at a later point in time. I.e. the weights and biases are stored to later make
a prediction within the Prediction.py
file. In this file, a lyrics text can be defined. This text is then
preprocessed, converted to numerical vectors and can be fed into the neural network.
Due to time-reasons, we could not finish the very last step of loading the stored model and doing an actual prediction
for the defined lyrics-string.