# Computational Analysis of Sound and Music

# MIR 6 - Music Tagging

Dr.-Ing. Jakob Abeßer, jakob.abesser@idmt.fraunhofer.de

**Last update:** 18.05.2024

**Outline**

In this notebook, you will learn how to implement a simple **music tagging** and **music similarity** algorithm using **deep audio embeddings**.

## Preparation

In [None]:
!pip install wget

In [None]:
!pip install openl3

In [None]:
import glob
import os
import librosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import IPython.display as ipd
import wget
import seaborn as sns
import openl3
import zipfile
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, MaxPool2D, concatenate, UpSampling2D

## Dataset

We use in this notebook a small subset of the **GTZAN dataset**, which has been published in 2002 by George Tzanetakis and is one of the first genre classification datasets in the MIR field.
You can read more about it here:
  - https://music-classification.github.io/tutorial/part2_basics/dataset.html#gtzan-music-genre-2002
  - George Tzanetakis and Perry Cook. Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002.
  
The **genre_mini** dataset was created for this notebook, it includes
  - 10 music genres
  - 10 audio clips (2.5s) each per genre (total of 100 clips)


In [None]:
if not os.path.isfile('genres_mini.zip'):
    print('Please wait a couple of seconds ...')
    wget.download('https://github.com/machinelistening/machinelistening.github.io/blob/master/genres_mini.zip?raw=true', 
                      out='genres_mini.zip', bar=None)
    print('genres_mini downloaded successfully ...')
else:
    print('Files already exist!')
    
if not os.path.isdir('genres_mini.zip'):
    print("Let's unzip the file ... ")
    assert os.path.isfile('genres_mini.zip')
    with zipfile.ZipFile('genres_mini.zip', 'r') as f:
        # Entpacke alle Inhalte in das angegebene Verzeichnis
        f.extractall('.')
    assert os.path.isdir('genres_mini')
    print("All done :)")

dir_dataset = 'genres_mini'

Let's check our dataset:

In [None]:
# List files in the directory
files = os.listdir(dir_dataset)

# Print the list of files
print(files)

### Annotations

Let's load the **metadata.csv** file, which includes three columns:
- WAV file name (in the genre_mini dataset)
- Music genre
- Original WAV file name (from the GTZAN dataset)

In [None]:
df = pd.read_csv(os.path.join(dir_dataset, 'metadata.csv'), names=('fn_wav', 'genre', 'fn_wav_orig'))
df.head

In [None]:
unique_genres = sorted(list(set(df['genre'])))
n_genres = len(unique_genres)
genre_to_id = {unique_genres[_]: _ for _ in range(n_genres)}
id_to_genre = {_: unique_genres[_] for _ in range(n_genres)}

print(unique_genres)
print(genre_to_id)
print(id_to_genre)

In [None]:
class_ids = [genre_to_id[_] for _ in df['genre']]
class_ids = np.array(class_ids)
print(class_ids)

Let's listen to some files

In [None]:
random_ids = [3,45,23,67,77]

for i in random_ids:
    fn_wav = os.path.join(dir_dataset, df["fn_wav"][i])
    genre = df["genre"][i]
    print(f"File: {fn_wav} - Music Genre: {genre}")
    x, fs = librosa.load(fn_wav, sr=44100)
    ipd.display(ipd.Audio(data=x, rate=fs))

## Feature Extraction

In this notebook, we want to compare two audio feature representations:

- the **Mel-Frequency Cepstral Coefficients (MFCCs)** as an example of a traditional audio feature, which characterizes mainly the spectral envelope and therefore the timbral properties of an audio recording

- the **OpenL3 deep audio embeddings** - these are extracted using a pre-trained DNN model, which has been trained in a self-supervised manner by solving an audio-video correpondance task. 
  - you can find a tutorial on how to use the **openl3** Python package here: https://openl3.readthedocs.io/en/latest/tutorial.html

In [None]:
model = openl3.models.load_audio_embedding_model(input_repr="mel256", 
                                                 content_type="music",
                                                 embedding_size=512)

def compute_mfcc(fn_wav):
    """ Extract time-averaged MFCC features
    Args:
        fn_wav (str): WAV file name
    Returns:
        emb (np.ndarray): 40-dimensional Mel-frequency coefficients
    """
    y, sr = librosa.load(fn_wav)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    # the shape of the MFCC matrix is (40, number_of_frames), we average this over time
    return np.mean(mfcc, axis=1)

def compute_openl3(fn_wav):
    """ Extract OpenL3 embeddings
    Args:
        fn_wav (str): WAV file name
    Returns:
        emb (np.ndarray): 512-dimensional embedding vector
    """
    y, sr = librosa.load(fn_wav)
    emb, _ = openl3.get_audio_embedding(y, sr, model=model, hop_size=0.5)
    emb = emb.T  # transpose to shape (512, number_of_frames)
    return np.mean(emb, axis=1)

Extract time-averaged MFCC vector and OpenL3 embeddings as feature representations for each audio clip (**this takes some seconds**)

In [None]:
n_files = df.shape[0]
mfcc = np.zeros((n_files, 40))
emb = np.zeros((n_files, 512))

# iterate over files and extract feature representations
for n in range(n_files):
    if n % 10 == 0:
        print(f"{n+1}/{n_files}")
    fn_wav = os.path.join(dir_dataset, df["fn_wav"][n])
    mfcc[n, :] = compute_mfcc(fn_wav)
    emb[n, :] = compute_openl3(fn_wav)

print("Feature extraction finished")  

## Neural Network Architecture

We'll create a simple Multilayer Perceptron (MLP) for genre classification and compare both feature representations

In [None]:
def create_model(n_in, n_classes=10):
    """ Simple MLP model """
    inp = tf.keras.layers.Input(shape=(n_in,))
    x = tf.keras.layers.Dense(128, activation="relu")(inp)
    x = tf.keras.layers.Dropout(0.3)(x)
    x = tf.keras.layers.Dense(32, activation="relu")(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    out = tf.keras.layers.Dense(n_classes, activation="softmax")(x)
    model = tf.keras.Model(inputs=inp, outputs=out)
    return model

## Music Genre Classification

We want to evaluate our two audio feature representations (MFCC, OpenL3) for the **music genre classification** task.

**Normaization**: We standardize both the training and test set using the **StandardScaler()** class from scikit-learn.

**Train-Test-Split**: We split our dataset by using the first 7 files per genre as **training set** and the remaining 3 files as **test set**.

In [None]:
# feature standardization
mfcc = StandardScaler().fit_transform(mfcc)
emb = StandardScaler().fit_transform(emb)

# file indices within each genre
file_num = np.concatenate([np.arange(10) for _ in range(10)])

# training & test set indices
is_train = file_num < 7
is_test = file_num >= 7

mfcc_train = mfcc[is_train, :]
mfcc_test = mfcc[is_test, :]
emb_train = emb[is_train, :]
emb_test = emb[is_test, :]

# one-hot-encoded targets
class_ids_train = class_ids[is_train]
class_ids_test = class_ids[is_test]

target_train = to_categorical(class_ids_train, num_classes=10)


In [None]:
print("Evaluate MFCC")
model = create_model(n_in = mfcc_train.shape[1])
model.compile(loss = 'categorical_crossentropy', 
              optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
              metrics=['accuracy'])
hist_mfcc = model.fit(mfcc_train, target_train, batch_size=4, epochs=300, verbose=0)
target_pred_mfcc = model.predict(mfcc_test)
class_id_pred_mfcc = np.argmax(target_pred_mfcc, axis=1)
acc_mfcc = accuracy_score(class_ids_test, class_id_pred_mfcc)
cm_mfcc = confusion_matrix(class_ids_test, class_id_pred_mfcc)

print("Evaluate OpenL3")
model = create_model(n_in = emb_train.shape[1])
model.compile(loss = 'categorical_crossentropy', 
              optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
              metrics=['accuracy'])
hist_emb = model.fit(emb_train, target_train, batch_size=4, epochs=300, verbose=0)
target_pred_emb = model.predict(emb_test)
class_id_pred_emb = np.argmax(target_pred_emb, axis=1)
acc_emb = accuracy_score(class_ids_test, class_id_pred_emb)
cm_emb = confusion_matrix(class_ids_test, class_id_pred_emb)


In [None]:
# plot training curves
pl.figure(figsize=(5,3))
pl.plot(hist_mfcc.history['accuracy'], label='MFCC')
pl.plot(hist_emb.history['accuracy'], label='OpenL3')
pl.legend()
pl.xlabel('Epoch')
pl.ylabel('Accuracy')
pl.tight_layout()
pl.show()

In [None]:
def plot_confusion_matrix(cm, class_labels):

    # Compute confusion matrix
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # normalize
    
    # Plotting
    sns.heatmap(cm_normalized, annot=True, ax=pl.gca(), cmap='Blues', cbar=False)

    # labels, title and ticks
    pl.gca().set_xlabel('Predicted')
    pl.gca().set_ylabel('True')
    pl.gca().xaxis.set_ticklabels(class_labels)
    pl.gca().yaxis.set_ticklabels(class_labels)


In [None]:
pl.figure(figsize=(12,8))
pl.subplot(1,2,1)
plot_confusion_matrix(cm_mfcc, unique_genres)
pl.title(f'MFCC (Accuracy = {acc_mfcc})')
pl.subplot(1,2,2)
plot_confusion_matrix(cm_emb, unique_genres)
pl.title(f'OpenL3 (Accuracy = {acc_emb})')
pl.tight_layout()
pl.show()


## Observation

The OpenL3 embeddings outperform the MFCC-based model, which is trained **from scratch**. The OpenL3 model was pre-trained on a larger dataset before and can use this knowledge...

# Music Similarity

Let's try both feature representations to implement a **music recommendation** algorithm. 

The idea is, given a **query song** (randomly chosen clip from our dataset), we compute the **distance** between the query song and all other files in the dataset by computing the **Euclidean distance** in the feature space (either MFCC or OpenL3).
Then, we look for the *N* **closest songs in the feature space** (which will become the **recommended songs** most similar to the **query songs**)

In [None]:
# query song
random_id = 66
fn_wav = os.path.join(dir_dataset, df["fn_wav"][random_id])
genre = df["genre"][random_id]
print(f"File: {fn_wav} - Music Genre: {genre}")
x, fs = librosa.load(fn_wav, sr=44100)
ipd.display(ipd.Audio(data=x, rate=fs))

Let's compute the **Euclidean distance** between all songs and the query song...

In [None]:
dist_mfcc = np.sqrt(np.sum((emb - emb[random_id])**2, axis=1))

... and sort the songs by distance (closer songs are presumably more similar).

In [None]:
idx = np.argsort(dist_mfcc)

## Music recommendation scenario

Let's take an arbitrary song and show the seven most closest songs in our dataset (this would be our recommendation result).

You can **evaluate the results** with **two strategies**:
1. Listen and see if you think they are similar to the query (**subjective evaluation**)
2. Observe whether they come from the same music genre as the query (somewhat more **objective evaluation**), assuming that songs from the same genre are more similar to each other than songs from different genres...

In [None]:
query_id = 66
fn_wav = os.path.join(dir_dataset, df["fn_wav"][query_id])
genre = df["genre"][random_id]
print(f"QUERY File: {fn_wav} - Music Genre: {genre}")

x, fs = librosa.load(fn_wav, sr=44100)
ipd.display(ipd.Audio(data=x, rate=fs))

for i in range(7):
    
    curr_idx = idx[i+1] 
    fn_wav = os.path.join(dir_dataset, df["fn_wav"][curr_idx])
    genre = df["genre"][curr_idx]
    print(f"File: {fn_wav} - Music Genre: {genre} - Distance {dist_mfcc[idx[i+1]]}")
    x, fs = librosa.load(fn_wav, sr=44100)
    ipd.display(ipd.Audio(data=x, rate=fs))
    

### Observation

- most of the retrieved songs also come from the Metal genre and show a lot of similarities in terms of instrumentation and rhythm with the query file :)