Machine Hearing Research
From the research group of Professor I. V. McLoughlin
(with help from students Zhang Haomin and Xie Zhipeng)


This page contains full MATLAB code, along with details of the sound and background noise databases necessary to reproduce the isolated sound event detection classifiers using DNN and CNN. Please cite the following paper if you used the code or found this page to be useful. Note: code will be placed here later for continuous recognition experiments.

McLoughlin I, Zhang H.-M., Xie Z.-P., Song Y., Xiao. W., “Robust Sound Event Classification using Deep Neural Networks”, IEEE Trans. Audio, Speech and Language Processing, Jan 2015



1. Setup - obtain the required data and software

1.1

The tested sounds and noises used in all of the published evaluations are chosen to exactly follow the experimental conditions of Jonathan Dennis (click here to view his PhD thesis).
Therefore, to perform comparable experiments it is necessary to obtain the same dataset:

  • Noise database: NOISEX-92 (specifically the following four files “Speech Babble”, “Destroyer Control Room”, “Factory Floor 1” and “Jet Cockpit 1”)

  • Sound database: Real Word Computing Partnership (RWCP) Sound Scene Database (SSD) in Real Acoustical Environments, which is kindly made available by free mail order from http://research.nii.ac.jp/src/en/RWCP-SSD.html

1.2

Dennis chose 50 classes of sounds from RWCP, and used 80 files from each class: 50 for testing and the remaining 30 for training. Most of the experiments use mismatched conditions (i.e. training only with clean sounds, but testing with noise-corrupt sounds), but some used multi-condition training (i.e. training with clean AND noisy sounds).

First we create the clean sounds database:

  • data_wav – contains 50 subdirectories (one for each class, with a class name as label), each subdirectory contains 80 files (with numerical names)

We then create noise-corrupted versions, in three further subdirectories (one for each SNR level). To create these, a random choice is made of the type of corrupting noise (from 4 choices), and a random noise starting point is selected (i.e. so the mix does not always start from the beginning of the noise file). Noise is added at 0, 10 and 20db SNR.

  • data_wav_mix0 – 50 subdirectories x 80 files with random NOISEX-92 noise added at 0dB SNR

  • data_wav_mix10 – 50 subdirectories x 80 files with random NOISEX-92 noise added at 10dB SNR

  • data_wav_mix20 – 50 subdirectories x 80 files with random NOISEX-92 noise added at 20dB SNR

These sound directories are then used for subsequent training and testing. In practice, the code below will use every 1st to 5th file for training and the remainder for testing. We can easily recreate the sound database and change the train/test mix to randomise the conditions.

The 50 sound classes that we use for the published experiments are as follows; aircap , bank , bells5 , book1 , bottle1 , bowl , buzzer , candybwl , cap1 , case1 , cherry1 , clap1 , clock1 , coffcan , coffmill , coin1 , crumple , cup1 , cymbals , dice1 , doorlock , drum , dryer , file , horn , kara , magno1 , maracas , mechbell , metal05 , pan , particl1 , phone1 , pipong , pump , punch , ring , sandpp1 , saw1 , shaver , snap , spray , stapler , sticks , string , teak1 , tear , trashbox , whistle1 , wood1

So, for example, the directory structure would contain;
data_wav/aircap/000.raw.wav
data_wav/aircap/001.raw.wav
data_wav/aircap/002.raw.wav
data_wav/aircap/003.raw.wav
...
data_wav/aircap/079.raw.wav

and later;
data_wav_mix10/buzzer/000.raw.wav
data_wav_mix10/buzzer/001.raw.wav

much later;
data_wav_mix20/wood1/078.raw.wav
data_wav_mix20/wood1/079.raw.wav

All of these subdirectories containing sounds mixed with noise are prepared in advance of any training/testing and are unchanged for each instance of tests, except when cross-verification over different noise types is performed, when several versions of the entire setup are created with different noise mixes.

1.3

All testing is performed using MATLAB (or Octave). If you use MATLAB, it is probably necessary to ensure that you have a copy of the Signal Processing Toolbox.

1.4

Although we have our own DNN implementation, the implementation does not affect the performance score (but it can affect the computation speed quite substantially). The easiest way to get set up to do the experiments is to use the very convenient DeepLean toolbox from Palm. If you use this, please ensure you also cite their paper. You can download the toolbox – which works in either MATLAB or Octave – from here https://github.com/rasmusbergpalm/DeepLearnToolbox

Download the software, unpack the toolbox and then add it to your MATLAB path as shown in the documentation.

Note: you will need a reasonably fast computer, and at least 4GB of memory to do this. Training and testing for one conditions (i.e. one noise mix) takes a couple of hours.

1.5

PASS 1: training a DNN

First we set up the system:

% Set up initial variables
clear all;

data_dir='data_wav/'; %this is where the clean sounds are stored

directory=dir(data_dir);

nclass=50;

nfile=80;

ny=24;

nx=30;

winlen=2048;

overlap=2048-16;


Next step is to run through and bring all sounds into memory:

for class=1:nclass

sub_d=dir([data_dir,directory(class+2).name]);

for file=1:nfile

if mod(file-1,8)>2 %select the specific files used for training

[wave,fs]=audioread([data_dir,directory(class+2).name,'/',sub_d(file+2).name]);

data0=abs(spectrogram(wave,winlen,overlap,winlen,16000));

clear data;

nchannel=size(data0,1);

for y=1:ny

data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:));

end;

for y=1:ny

data(y,:)=data(y,:)-min(data(y,:));

end;

nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1;

for frame=1:nFrames(80*(class-1)+file)

ntrain=ntrain+1;

train_data(ntrain,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny);

energy=sum(train_data(ntrain,1:nx*ny));

if energy~=0

train_data(ntrain,1:nx*ny)=train_data(ntrain,1:nx*ny)/energy;

end;

train_data(ntrain,nx*ny+1)=energy;

train_label(ntrain,:)=class;

end;

train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)=train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end)/sum(train_data(ntrain-nFrames(80*(class-1)+file)+1:ntrain,end));

end;

end;

fprintf('Done reading %d class training files\n',class);

end;



The array train_data now contains all of the training data feature vectors. Please refer to our paper to see how the feature vectors are constructed from the spectrogram representations.

Before we continue, we must condition the data to ensure it is scaled appropriately.

train_data(:,end)=train_data(:,end)/max(train_data(:,end))*max(max(train_data(:,1:end-1)));

mi=min(min(train_data));

train_x=train_data-mi;


ma=max(max(train_x));

train_x=train_x/ma;


clear train_data;


train_y=zeros(length(train_label),50);

for i=1:length(train_label)

train_y(i,train_label(i))=1;

end;

clear train_label;



Next step will be to set up the neural network parameters, using the recommended settings by Palm for the DeepLearn toolbox:

nnsize=210;

dropout=0.10;

nt=size(train_x,1);

rand('state',0) %ensure the random number generator is set into 'reproducible' mode

%the input vectors need to be oganised into batch subdivisions

for i=1:(ceil(nt/100)*100-nt)

np=ceil(rand()*nt);

train_x=[train_x;train_x(np,:)];

train_y=[train_y;train_y(np,:)];

end;



Now we start to create and stack RBM layers – as many as we want, to create a deep structure (but this example is not particularly deep, and was adopted from Palm's documentation):

% train a 100 hidden unit RBM

rand('state',0)

% train rbm

dbn.sizes = [nnsize];

opts.numepochs = 1;

opts.batchsize = 100;

opts.momentum = 0;

opts.alpha = 1;

dbn = dbnsetup(dbn, train_x, opts);

dbn = dbntrain(dbn, train_x, opts);

%% train a 100-100 hidden unit DBN and use its weights to initialize a NN

rand('state',0);

%train dbn

dbn.sizes = [nnsize nnsize];

opts.numepochs = 1;

opts.batchsize = b;

opts.momentum = 0;

opts.alpha = 1;

dbn = dbnsetup(dbn, train_x, opts);

dbn = dbntrain(dbn, train_x, opts);


%unfold dbn to nn

nn = dbnunfoldtonn(dbn, 50);

nn.activation_function = 'sigm';



Now we treat this network as a NN, ready for fine-tuning using back-propagation:

%train nn

opts.numepochs = 1;

opts.batchsize = b;

nn.dropoutFraction = dropout;

nn.learningRate = 10;

for i=1:1000

fprintf('Epoch=%d\n',i);

nn = nntrain(nn, train_x, train_y, opts);

%adapt the learning rate as training continues

if i==100

nn.learningRate = 5;

end;

if i==400

nn.learningRate = 2;

end;

if i==800

nn.learningRate = 1;

end;

end;



The outcome of this process is a fairly large array in MATLAB's memory called nn. This defined the DNN structure, and contains all weights and connections.

1.6

PASS 2: use the DNN for testing

The learned DNN (called nn) is now used for classification. Again, we first set up the system similarly to the way we did it before.

%% denoise by substract the min per channel

clear test_x;

clear test_y;

data_dir='data_wav/';

noise_dir='data_wav_mix0/'; %This is for 0dB mixture – change to the 10 or 20dB directories for the other scores

directory=dir(data_dir);

nclass=50;

nfile=80;

ny=24;

nx=30;

winlen=2048;

overlap=2048-16;



And again, read in the data for testing in the same way we did for training previously:

ntest=0;

rand('state',0)

for class=1:nclass

sub_d=dir([data_dir,directory(class+2).name]);

for file=1:nfile

if mod(file-1,8)<3 %select the specific files used for testing

[wave,fs]=audioread([noise_dir,directory(class+2).name,'/',sub_d(file+2).name]);

data0=abs(spectrogram(wave,winlen,overlap,winlen,16000));

clear data;

nchannel=size(data0,1);

for y=1:ny

data(y,:)=mean(data0(ceil(nchannel/ny*(y-1))+1:ceil(nchannel/ny*y),:));

end;

for y=1:ny

data(y,:)=data(y,:)-min(data(y,:));

end;

nFrames(80*(class-1)+file)=floor(size(data,2)/nx*2)-1;

for frame=1:nFrames(80*(class-1)+file)

ntest=ntest+1;

test_data(ntest,1:nx*ny)=reshape(data(:,(frame-1)*nx/2+1:(frame+1)*nx/2),1,nx*ny);

energy=sum(test_data(ntest,1:nx*ny));

if energy~=0

test_data(ntest,1:nx*ny)=test_data(ntest,1:nx*ny)/energy;

end;

test_data(ntest,nx*ny+1)=energy;

test_label(ntest,:)=class;

end;

test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)=test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end)/sum(test_data(ntest-nFrames(80*(class-1)+file)+1:ntest,end));

end;

end;

fprintf('done reading %d class test_files\n',class);

end;



Next – as before – we also condition the files and ensure they are scaled appropriately:

test_data(:,end)=test_data(:,end)/max(test_data(:,end))*max(max(test_data(:,1:end-1)));

%% dbn -- normalize to [0,1] first

mi=min(min(test_data));

test_x=test_data-mi;

ma=max(max(test_x));

test_x=test_x/ma;

clear test_data;

test_y=zeros(length(test_label),50);

for i=1:length(test_label)

test_y(i,test_label(i))=1;

end;

clear test_label;



Now we execute the actual test:

correct=0;

test_now=0;

nfile=4000;

for file=1:nfile

if mod(file-1,8)<3 %the specific files that we selected for testing

[label, prob] = nnpredict_p(nn,test_x(test_now+1:test_now+nFrames(file),:));

if label==ceil(file/80)

printf('correct\n');

else

printf('NOT correct\n');

end;

test_now=test_now+nFrames(file);

end;

end;



This makes use of a function called nnpredict_p to do probability and energy scaling (one of the two output scoring options in our paper), which is shown below.

Here is the nnpredict_p function:

function [label, maxp] = nnpredict_p(nn, x)

nn.testing = 1;

nn = nnff(nn, x, zeros(size(x,1), nn.size(end)));

nn.testing = 0;

prob = sum(nn.a{end});

label = find(prob==max(prob));

maxp=max(prob);

end



1.7

The above code will output either a “correct” or “NOT correct” output for each of the 1500 files in the training set.

The performance score for that particular test condition is simply the proportion 'number of correct'/1500