Few-shot bioacoustic event detection using a prototypical network ensemble with adaptive embedding functions

20 October 2022, 10:24

Climate change affects the whole planet. Both humans and other species face increasing challenges in the near future. We are in the center of the sixth mass extinction of species at the moment, and having trustworthy quantifications of the state of biological systems may help us mitigate some of these effects. Detecting acoustic events in wildlife recordings have a great potential to assist ecologists working with biodiversity and the tracking of acoustically active animals. Specifically, being able to automatically detect events in long audio recordings where only a few examples of events are given by expert annotators, will make the work much more efficient.

Our approach (which was recently accepted to be presented at the Workshop of Detection and Classification of Acoustic Scenes and Events) employs prototypical networks in an ensemble using adaptive embedding functions to solve the task of few-shot bioacoustic event detection. In few-shot sound event detection you get very few examples of each class (in DCASE task-5, there are five examples for each class), and given these, you have to correctly label the start- and end-time of all successive events of the same class. In the DCASE test set, each few-shot class is given in a separate recording, where the five first annotations are seen as training examples, while the rest are test samples. An additional base training set is provided with labels of other classes, which can be used for pretraining.

Passive acoustic monitoring devices, recording wildlife sounds, are becoming increasingly used in wildlife preservation and tracking efforts. To efficiently be able to go through large quantities of data, automatic or semi-automatic annotation of long recordings are a big help.

In this work, we start out with a prototypical network, but instead of the regular embedding function, train an embedding function to solve a multi-class sound event detection task since two different sound events can occur (partially) at the same time, we then adapt this embedding function to the bioacoustic events we want to detect at prediction time using the few-shot examples. The solution includes using ensemble predictions from multiple models trained on different time-frequency transforms to reduce false-positives, and in the paper, we perform a comparison of three different time-frequency transformations.

We pretrain the embedding functions on the base training data using a traditional multi-class classification objective, where each sound segment can belong to more than one class. The input to the embedding functions are spectrograms representing the audio input, and the model for the embedding functions are 10 layered residual convolutional neural networks. We train a set of C such embedding functions for a fixed window size T with different random initialization of the network weights. We do this for 4 different window sizes T, the smallest one corresponding to an audio sample of 0.09 seconds, and the largest 0.72 seconds.

At test time, we choose T such that it fits the five training points by minimizing |T- l_min/2|, where l_min is the shortest event length among the five training points. We then pick the embedding function matching T, and make predictions as is done with prototypical networks, where the embedding function consists of our ensemble described above.

This work has improved the performance on the DCASE task 5 from an F-score of 41.3% to an F-score of 60.0%, which resulted in a top-3 position in the DCASE challenge 2022. The results will be presented at the DCASE Workshop in France, November 3-4, 2022. The work has shed light on the need to look into varying the window sizes, and to adapt the embedding function for the expected event lengths. This may not be very surprising; an embedding function trained on detecting very short events may not be successful in detecting long events. We have also compared different spectrogram representations and choosing the right input representation affects the results by quite an amount. This warrants further research in how to choose, or learn, the best input representation for sound processing.

For more information, see the article.

Authors: John Martinsson, Martin Willbo, Aleksis Pirinen, Olof Mogren, Maria Sandsten

Olof Mogren

Senior Researcher

+46 73 023 56 09