-
Notifications
You must be signed in to change notification settings - Fork 42
Journal
The large data set has now been preprocessed and uploaded to the TitanX PC. There is a performance problem with the noise segments. The amount of noise segments become quite large for the large dataset (~150000 segments), and browsing this directory takes unnec time.
The python library librosa seems to contain a lot of useful functions which could be used in the project. Not the least mfcc, delta, and delta-delta methods.
Looking at other networks models such as inception_v3 and xception it seems that BN is used after Conv layers. Maybe this could be part of what is causing the problem.
Making the noise segments a class, and train the network also on noise may be a good way of improving accuracy. In that way segments that are classified as noise could be considered to not contribute to the mean of the predictions over a file, which could make it easier for the network to predict the true sound class. It would also allow the network to predict sound files where noise have not been removed as a preprocess.
It may be more interesting to use the available meta-data for each sound class then to try the multiple-width frequency-delta (MWFD) data augmentation method. This method will diverge from using spectrograms as input, and instead use variations of the MFCCs of the audio wave as input.
Pros of MWFD:
- lower dimension
- could make it easier to include meta-data
Cons of MWFD:
- very different from using spectrogram chunks, which could make it harder or less relevant to compare
- could take alot of extra time to implement
- MFCCs are designed for human sounds
I have reimplementated the spectrogram calculations as specified by Elias Sprengel, and I have included a Residual Network model in the implementation. The data flow is now a modified version of Keras own ImageDataGenerator instead of a manual implementaion. The modified generator is called SoundDataGenerator, and supports same class, noise, time, and pitch augmentations.
Feedback from Elias Sprengel:
- Do you re-sample the files? I use 22050 frames.
The files are now resampled to 22050 frames instead of 16000.
- Have you checked that your noise/signal segments are correct?
The noise/signal extraction have been reimplemented as specified by Elias Sprengel in the paper, and the resultsing noise/signal waves have been confirmed to be very similar to the noise/signal extracted from the birdCLEF example shown in the paper by manual inspection of the resulting log amplitude spectrograms.
Noise and Signal Extraction |
---|
- The data-augmentation is applied every time I get a new sample for the network. That means I could show sample1-signal + sample5-signal + sample10-noise + sample12-noise + sample28-noise in the first epoch and in the second epoch use sample1-signal + sample3-signal + sample8-noise
- sample12-noise + sample14-noise.
The new data flow now lets samples flow from directory with each class in its own subdir. Each sample is loaded, and then randomly augmented with another sample from the same subdir (class dir), as well as with three samples from the noise dir. This means that every time the network is shown a new sample it should reflect this feedback from Elias.
- Technically I compute the stft then store the complex-valued matrix. I split that matrix into chunks and store that as samples. I then combine the samples (complex values) to get my augmented sample. Then, just before I pass it to the neural network, I call np.log(np.abs(x) **
- to get rid of the complex values (only magnitude) and to take the logarithm. (log-e instead of your log-10).
The specgrograms are now computed using an STFT and they are split into segments which are stored in their respective class dir. However, the samples are combined in the time-domain rather than using the complex values of the STFT. When passed to the network the complex spectrogram is computed, and then we call np.log(np.abs(X) ** 2) on the compex spectrogram segment before passing it to the network. Now using log-e rather than log-10.
However, the MAP still seems to get higher when training on amplitude spectrograms, rather than log amplitude spectrograms.
Further testing is warranted.
Created a random subset of 20 species from the BirdCLEF2016 dataset. Used to test the methods on varying data sample lengths. The methods and training scheme seems to be working quite well on this subset, and what is needed to evaluate it now is a way of combining multiple predictions for one file into one prediction. What is also needed is a proper validation set.
Todo:
- split into validation set and training set
- create an averaging multi-prediction method
- proper metric, e.g., Area Under Curve, or Mean Average Precision.
- pitch shift augmentation
- median filtering in the mask calculations
The F-score metric was readily available in the Keras library. This measures a weighted average of the precision and recall of the predictions. The accuracy metric has been changed to fbeta_score, and the adadelta optimizer has been changed to sgd. The learning-rate, momentum, decay and Nesterov momentum need to be properly tuned.
Todo:
- tune SGD parameters
The pipeline now reads compressed (gzip) files instead of raw .wav files. This reduces the space on disk for the data set by a factor of three. An augmentation method called time_shift_signal has been added which will simply split the signal in two at random, and place the second part before the first part. Everything seems to be running well on the GPU, and the memory usage is low and well maintained.
After running a first experiment with 10 mini batches, each run for 10 epochs, where the training set consists of 5000 randomly chosen, and augmented signal segments, from which each mini-batch drawn at random. The training loss seems to be slowly decreasing (from ~0.25 to ~0.20). A slower convergence is what we want to achieve in hope of model that will generalize well. However, the validation loss does not seem to be decreasing at all. This may be due to the very low number of mini-batches and epochs. Therefore a longer training instance has been started to see if it will change the results.
Next implementation:
- proper metric, e.g., Area Under Curve, or Mean Average Precision.
Still not in the pipline:
- pitch shift augmentation
- median filtering in the mask calculations
Implemented a data generator scheme which should be easy to use, which can be configured to return a set of augmented samples in mini-batches. The mini-baches can then be used to fit the model, for a couple of epochs, in small steps (every mini-batch is fit into memory). It seems to be working, and is running ok on the GPU.
Next steps:
- compress data set and read from compressed files if needed
- add time augmentation
- add pitch shift augmentation
Implemented a couple of data augmentation generator methods. The agumented samples are now computed using only their filenames and stored as dicts. It should be possible to create a large data set of such unique samples, and then only load the files into memory for each mini-batch that is generated by the augmented data samples set.
- make sure that the mini-batch is removed from memory when it has been used
- compress the data set using gzip, and decompress on the fly in python
- connect the augmented data samples to the training scheme
Discussed how to load data for augmentation, and whether it makes sense to normalize the data in the spectral domain to zero mean, and identity variance. The former should probably be done on the fly from a compressed data set file on disk. Where each epoch is loaded at random, and same class + noise augmented in real time.
The next step is to write the relevant methods for loading, and choosing the random batches of data. And then augment them on the fly. The augmented files are then used as input to the CNN.
- open channel to the compressed data set
- load as many segments as possible (memory constraints), at random without replacement, from the compressed data set
- randomly augment each segment with:
- three noise segments
- a same class signal segment
- and time/frequency shift it
- connect the augmented segments to the network model
- Added mask scaling
- Added benchmark file
- Added preprocessing step
- Fixed bottleneck from preprocessing step
The script pp.py can now preprocess a data set. The script is hardcoded to the mlsp2013 training data, but should generalize to any data set with 16-bit mono wave files with a sample rate of 16000Hz. The script assumes that a file called file2labels.csv is present in the data set directory, and will
- read each wave file
- mask out the noise and signal part of the wave file
- split the noise and signal parts into equally sized segments
- save the segments in the specified output directory
- create a new file2labels.csv file in this directory with the labels for each signal segment
I have gotten first hand experience with for loop bottlen necks, and simply by removing the for loops in the compute_binary_mask method, and replace them with numpy array operation I got a speedup of around 30x in the wave file preprocessing.
Daily take away: Do not use iterative for loops
Do tomorrow:
- Create a couple of preprocessing images and compare to Mario nips2013
- Fix the scaling error in reshape_binary_mask
- Start with data augmentation