Before you start¶

  • Go to "File" --> "Save a copy in Drive"
  • Open that copy (might open automatically)
  • Then continue below



Machine Listening Seminar 3: Sound event classification¶

What we are going to do:

  • Implement a full processing pipeline for sound event classification
  • Use a traditional classification technique (and compare with deep-learning based methods later).

1. Import libraries¶

  • We need a number of libraries. Import them once to use throughout the document.
In [ ]:
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import IPython

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from tqdm import tqdm

2. Fetch the Dataset¶

  • ESC-50: a dataset for Environmental Sound Classification (https://github.com/karolpiczak/ESC-50, https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf)
  • 50 classes, 40 files per class, 5s clips
  • Download & unzip the dataset running the cell below. This will take a minute. You will see the new files on the left (folder icon).
In [ ]:
!wget https://github.com/karolpiczak/ESC-50/archive/master.zip
!unzip master.zip

3. Metadata and analysis I¶

*Tasks:*

  • Use pandas to read the csv file in ESC-50-master/meta/
  • Print the first elements of the csv. Pandas has a standard function for this.
  • Print the list of unique class labels in the dataset, and check whether there really are 50 of them
In [ ]:
fn_csv = 'ESC-50-master/meta/esc50.csv'


### START CODING
df = pd. ...  # pd = pandas dataframe
...

unique_classes = df...  # you may use unique()
print(f'...')
### END CODING

Expected output:

    filename            fold  target  category        esc10   src_file  take
0   1-100032-A-0.wav    1     0       dog             True    100032    A
1   1-100038-A-14.wav   1     14      chirping_birds  False   100038    A
2   1-100210-A-36.wav   1     36      vacuum_cleaner  False   100210    A
3   1-100210-B-36.wav   1     36      vacuum_cleaner  False   100210    B
4   1-101296-A-19.wav   1     19      thunderstorm    False   101296    A
Unique classes: ['dog' 'chirping_birds' ... ]
Count: 50

4. Metadata and analysis II¶

  • View and listen to some examples in the dataset to get a "feeling" for the sound classes.
In [ ]:
# Setup some filepaths
path = 'ESC-50-master/audio/'
file0 = path + df['filename'][2]  # We use indices [2-4] here, feel free to choose other files
file1 = path + df['filename'][3]
file2 = path + df['filename'][4]

# Show audio player for each file
print(df['category'][2])
IPython.display.display(IPython.display.Audio(data=file0))
print(df['category'][3])
IPython.display.display(IPython.display.Audio(data=file1))
print(df['category'][4])
IPython.display.display(IPython.display.Audio(data=file2))

# Plot mel specs
files = [file0, file1, file2]
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 6))
for i in range(3):
  y, sr = librosa.load(files[i])
  D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
  img = librosa.display.specshow(D, y_axis='linear', x_axis='time', sr=sr, ax=ax[i])
  ax[i].set(title='Power spec')
  ax[i].label_outer()
fig.colorbar(img, ax=ax, format="%+2.f dB")

5. ESC-5: Curation¶

Let's select 5 classes (our_classes) from ESC-50 to make things a bit faster.

*Tasks:*

  • Collect all files that belong to our_classes.
  • Put the files and their respective classes in separate lists. Make sure their indices are equal (meaning: the value at index 3 of list A is related to the value at index 3 of list B).
    • Idea 1: Use df.values to iterate over the rows of the csv
    • Idea 2: Use df.query('category in @our_classes')
  • Print the first 5 elements of each list as (file, class)-tuples. Also, print the overall lengths of the lists.
In [ ]:
our_classes = ['crying_baby', 'dog', 'rain', 'rooster', 'sneezing']  # Note: This is also a class map for later.
esc5_X = []  # File list
esc5_y = []  # Class list
fn_csv = 'ESC-50-master/meta/esc50.csv'


### START CODING ###
df = ...

for ... in df.values:  # This is one way, not an ideal way: This loop aims to find files for each class in our_classes.
  if ... :
    esc5_X.append( ... )  # filename column of df
    ...                   # class column of df

print( ...(esc5_X[:5], esc5_y[:5])... )
print(f'Lengths: ...')
### END CODING ###

Expected output:

[('1-100032-A-0.wav', 'dog'), ('1-110389-A-0.wav', 'dog'), ('1-17367-A-10.wav', 'rain'), ('1-187207-A-20.wav', 'crying_baby'), ('1-211527-A-20.wav', 'crying_baby')]
Lengths: esc5_X: 200, esc5_y: 200

6. ESC-5: Dataset splitting¶

*Tasks:*

  • Split the dataset into train and test subsets: split ratio is 80%/20%, and random state 1337.
    • Use a suitable straight-forward function from sklearn.
  • Print the first 3 elements of the resulting X_train.
  • Print the overall lengths of the resulting lists. Are they aligned with the ratio?

Result: ESC-5 is ready. We have a train and test set consisting of file lists and their respective classes.

In [ ]:
### START CODING HERE ###
X_train, X_test, ... = ...(..., random_state=1337)

print(...)
print(f'X: ...; y: ...')
### END CODING ###

Expected output:

['5-203128-A-0.wav', '4-181286-A-10.wav', '3-157615-A-10.wav']
X: 160, 40; y: 160, 40

7. ESC-5: Create mel spectrograms¶

We need to compute features and corresponding labels for each file in our ESC-5.

*Tasks:*

  • Define a function that does the following (in this order!):
    • takes input parameters: X_train (list of filenames), y_train (list of classes)
    • loops over X_train (hint: enumerate it), and loads each file (.wav) using librosa
    • creates the mel spectrogram from the loaded .wav samples
    • normalizes the mel spec by dividing it through the number of given mel_bands.
    • transposes the mel spec
    • appends the features (mel spec) to a feature tensor
    • creates a target vector consisting of as many values as there are frames
      • hint: use .shape to see which value you need
    • each value inside the vector must correspond to the index of the class in our_classes
      • hint: remember numpy.ones(...) ?
      • hint: use .index(...) here. Not ideal, but works here.
    • appends the targets to a target tensor
    • stacks the large feature and target lists appropriately
    • returns the tensors
  • Finally, print the shapes of all 4 arrays.
In [ ]:
### START CODING HERE ###
def extract_mel_spec(...):
  X = []  # feature tensor
  y = []  # target tensor

  mel_bands = 128
  for ..., ... in tqdm(enumerate(...)):  # tqdm simply displays a progress bar.
    wav_data, sr = librosa...  # Use the wav file's sample rate.

    # Features (2D)
    mel_spec = librosa...  # Create mel spectrogram. Output shape: (128, 216) (n_mels, frames)
    mel_spec = mel_spec/... # Normalization
    mel_spec = mel_spec...  # Transposition. Output shape: (216, 128)
    X.append( ... )  # Append to feature tensor

    # Targets == class_name
    targets = np.ones( mel_spec... )  # Create a PLACEHOLDER target vector. Output shape: (216) (Note: silent frames are not going to be labeled as "silent")
    targets = targets * our_classes...  # Convert values to actual class-index from 'our_classes'
    ...  # Append to target tensor

  # Stack tensors
  X = np.vstack(X)
  y = np.hstack(y)

  # Return the tensors
  return ...


# Call the function on our previously generated lists
X_train_ready, y_train_ready = extract_mel_spec(...)
X_test_ready, y_test_ready = ...
### END CODING HERE ###


print(f'\nShapes: X_train_ready: {X_train_ready.shape}, y_train_ready: {y_train_ready.shape}')
print(f'Shapes: X_test_ready: {X_test_ready.shape}, y_test_ready: {y_test_ready.shape}')

Expected output:

Shapes: X_train_ready: (34560, 128), y_train_ready: (34560,)
Shapes: X_test_ready: (8640, 128), y_test_ready: (8640,)

8. ESC-5: Train a nearest neighbor classifier¶

*Tasks:*

  • Standardize the features from step 7 using sklearn.
  • Use the features and targets to train (fit) a kNN-classifier from sklearn, with 5 neighbors and uniform weighting.
  • Print the scores on the train set and test set, rounded to 4 decimals. (This will take some time!)
In [ ]:
### START CODING HERE ###
# Feature scaling / Data standardization / Normalization
print('Scaling...')
scaler = StandardScaler()
scaler.fit(X_train_ready)
X_train_ready = scaler.transform(...)  # Normalize the features here for both train and test
X_test_ready = ...

print('Fitting...')
model = ...(n_neighbors=..., weights=...)  # Call the kNN-classifier. Look at your imports again for a hint.
model...  # Fit/Train the classifier using our generated tensors.

print('Evaluating...')
print(f'Train score: {np.round(model.score(...), decimals=...)}')
print(f'Test score: ...')
### END CODING HERE ###

Expected output (might differ slightly):

Train score: 0.7624
Test score: 0.5812

9. ESC-5: Plot the confusion matrix¶

We want to learn more about our classifier. How well does it perform per class?

*Tasks:*

  • Using scikit-learn, create a confusion matrix of our classifier over the test set
  • Normalize the rows, use 'our_classes' as axes tick values
  • Display the plot
In [ ]:
### START CODING HERE ###
...(model, ..., ..., normalize=...)  # Call the confusion matrix plot function. Look at your imports again for a hint.
plt.xticks(ticks=np.arange(...), labels=...)  # hint: np.arange(5) = (0, 1, 2, 3, 4)
plt.yticks(...)
plt...
### END CODING HERE ###

Expected output:

  • a coloured confusion matrix
  • each row should add up to 1
  • labels from our_classes on x-axis and y-axis
  • similar to this:

image.png