-
A global evidence map of human well-being and biodiversity co-benefits and trade-offs of natural climate solutions
Authors:
Charlotte H. Chang,
James T. Erbaugh,
Paola Fajardo,
Luci Lu,
István Molnár,
Dávid Papp,
Brian E. Robinson,
Kemen Austin,
Susan Cook-Patton,
Timm Kroeger,
Lindsey Smart,
Miguel Castro,
Samantha H. Cheng,
Peter W. Ellis,
Rob I. McDonald,
Teevrat Garg,
Erin E. Poor,
Preston Welker,
Andrew R. Tilman,
Stephen A. Wood,
Yuta J. Masuda
Abstract:
Natural climate solutions (NCS) are critical for mitigating climate change through ecosystem-based carbon removal and emissions reductions. NCS implementation can also generate biodiversity and human well-being co-benefits and trade-offs ("NCS co-impacts"), but the volume of evidence on NCS co-impacts has grown rapidly across disciplines, is poorly understood, and remains to be systematically coll…
▽ More
Natural climate solutions (NCS) are critical for mitigating climate change through ecosystem-based carbon removal and emissions reductions. NCS implementation can also generate biodiversity and human well-being co-benefits and trade-offs ("NCS co-impacts"), but the volume of evidence on NCS co-impacts has grown rapidly across disciplines, is poorly understood, and remains to be systematically collated and synthesized. A global evidence map of NCS co-impacts would overcome key barriers to NCS implementation by providing relevant information on co-benefits and trade-offs where carbon mitigation potential alone does not justify NCS projects. We employ large language models to assess over two million articles, finding 257,266 relevant articles on NCS co-impacts. We analyze this large and dispersed body of literature using innovative machine learning methods to extract relevant data (e.g., study location, species, and other key variables), and create a global evidence map on NCS co-impacts. Evidence on NCS co-impacts has grown approximately ten-fold in three decades, although some of the most abundant evidence is associated with pathways that have less mitigation potential. We find that studies often examine multiple NCS pathways, indicating natural NCS pathway complements, and each NCS is often associated with two or more coimpacts. Finally, NCS co-impacts evidence and priority areas for NCS are often mismatched--some countries with high mitigation potential from NCS have few published studies on the broader co-impacts of NCS implementation. Our work advances and makes available novel methods and systematic and representative data of NCS co-impacts studies, thus providing timely insights to inform NCS research and action globally.
△ Less
Submitted 30 April, 2024;
originally announced May 2024.
-
Dataset balancing can hurt model performance
Authors:
R. Channing Moore,
Daniel P. W. Ellis,
Eduardo Fonseca,
Shawn Hershey,
Aren Jansen,
Manoj Plakal
Abstract:
Machine learning from training data with a skewed distribution of examples per class can lead to models that favor performance on common classes at the expense of performance on rare ones. AudioSet has a very wide range of priors over its 527 sound event classes. Classification performance on AudioSet is usually evaluated by a simple average over per-class metrics, meaning that performance on rare…
▽ More
Machine learning from training data with a skewed distribution of examples per class can lead to models that favor performance on common classes at the expense of performance on rare ones. AudioSet has a very wide range of priors over its 527 sound event classes. Classification performance on AudioSet is usually evaluated by a simple average over per-class metrics, meaning that performance on rare classes is equal in importance to the performance on common ones. Several recent papers have used dataset balancing techniques to improve performance on AudioSet. We find, however, that while balancing improves performance on the public AudioSet evaluation data it simultaneously hurts performance on an unpublished evaluation set collected under the same conditions. By varying the degree of balancing, we show that its benefits are fragile and depend on the evaluation set. We also do not find evidence indicating that balancing improves rare class performance relative to common classes. We therefore caution against blind application of balancing, as well as against paying too much attention to small improvements on a public evaluation set.
△ Less
Submitted 30 June, 2023;
originally announced July 2023.
-
Description and analysis of novelties introduced in DCASE Task 4 2022 on the baseline system
Authors:
Francesca Ronchini,
Samuele Cornell,
Romain Serizel,
Nicolas Turpault,
Eduardo Fonseca,
Daniel P. W. Ellis
Abstract:
The aim of the Detection and Classification of Acoustic Scenes and Events Challenge Task 4 is to evaluate systems for the detection of sound events in domestic environments using an heterogeneous dataset. The systems need to be able to correctly detect the sound events present in a recorded audio clip, as well as localize the events in time. This year's task is a follow-up of DCASE 2021 Task 4, wi…
▽ More
The aim of the Detection and Classification of Acoustic Scenes and Events Challenge Task 4 is to evaluate systems for the detection of sound events in domestic environments using an heterogeneous dataset. The systems need to be able to correctly detect the sound events present in a recorded audio clip, as well as localize the events in time. This year's task is a follow-up of DCASE 2021 Task 4, with some important novelties. The goal of this paper is to describe and motivate these new additions, and report an analysis of their impact on the baseline system. We introduced three main novelties: the use of external datasets, including recently released strongly annotated clips from Audioset, the possibility of leveraging pre-trained models, and a new energy consumption metric to raise awareness about the ecological impact of training sound events detectors. The results on the baseline system show that leveraging open-source pretrained on AudioSet improves the results significantly in terms of event classification but not in terms of event segmentation.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
MuLan: A Joint Embedding of Music Audio and Natural Language
Authors:
Qingqing Huang,
Aren Jansen,
Joonseok Lee,
Ravi Ganti,
Judith Yue Li,
Daniel P. W. Ellis
Abstract:
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedd…
▽ More
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.
△ Less
Submitted 25 August, 2022;
originally announced August 2022.
-
The Benefit Of Temporally-Strong Labels In Audio Event Classification
Authors:
Shawn Hershey,
Daniel P W Ellis,
Eduardo Fonseca,
Aren Jansen,
Caroline Liu,
R Channing Moore,
Manoj Plakal
Abstract:
To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (~0.1 sec resolution) "strong" labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset's 1.8M clips labeled at 10 sec…
▽ More
To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (~0.1 sec resolution) "strong" labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset's 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak and strongly labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Self-Supervised Learning from Automatically Separated Sound Scenes
Authors:
Eduardo Fonseca,
Aren Jansen,
Daniel P. W. Ellis,
Scott Wisdom,
Marco Tagliasacchi,
John R. Hershey,
Manoj Plakal,
Shawn Hershey,
R. Channing Moore,
Xavier Serra
Abstract:
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this…
▽ More
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
△ Less
Submitted 14 September, 2021; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Authors:
Efthymios Tzinis,
Scott Wisdom,
Aren Jansen,
Shawn Hershey,
Tal Remez,
Daniel P. W. Ellis,
John R. Hershey
Abstract:
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Pri…
▽ More
Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
△ Less
Submitted 29 May, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking
Authors:
Eduardo Fonseca,
Shawn Hershey,
Manoj Plakal,
Daniel P. W. Ellis,
Aren Jansen,
R. Channing Moore,
Xavier Serra
Abstract:
The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first ident…
▽ More
The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process. We find that a simple optimisation of the training label set improves recognition performance without additional computation. We discover that most of the improvement comes from ignoring a critical tiny portion of the missing labels. We also show that the damage done by missing labels is larger as the training set gets smaller, yet it can still be observed even when training with massive amounts of audio. We believe these insights can generalize to other large-scale datasets.
△ Less
Submitted 25 July, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Orientational correlations in active and passive nematic defects
Authors:
D. J. G. Pearce,
J. Nambisan,
P. W. Ellis,
A. Fernandez-Nieves,
L. Giomi
Abstract:
We investigate the emergence of orientational order among +1/2 disclinations in active nematic liquid crystals. Using a combination of theoretical and experimental methods, we show that +1/2 disclinations have short-range antiferromagnetic alignment, as a consequence of the elastic torques originating from their polar structure. The presence of intermediate -1/2 disclinations, however, turns this…
▽ More
We investigate the emergence of orientational order among +1/2 disclinations in active nematic liquid crystals. Using a combination of theoretical and experimental methods, we show that +1/2 disclinations have short-range antiferromagnetic alignment, as a consequence of the elastic torques originating from their polar structure. The presence of intermediate -1/2 disclinations, however, turns this interaction from anti-aligning to aligning at scales that are smaller than the typical distance between like-sign defects. No long-range orientational order is observed. Strikingly, these effects are insensitive to material properties and qualitatively similar to what is found for defects in passive nematic liquid crystals.
△ Less
Submitted 5 November, 2021; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Improving Universal Sound Separation Using Sound Classification
Authors:
Efthymios Tzinis,
Scott Wisdom,
John R. Hershey,
Aren Jansen,
Daniel P. W. Ellis
Abstract:
Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic s…
▽ More
Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation.
△ Less
Submitted 18 November, 2019;
originally announced November 2019.
-
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Authors:
Aren Jansen,
Daniel P. W. Ellis,
Shawn Hershey,
R. Channing Moore,
Manoj Plakal,
Ashok C. Popat,
Rif A. Saurous
Abstract:
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and…
▽ More
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
Audio tagging with noisy labels and minimal supervision
Authors:
Eduardo Fonseca,
Manoj Plakal,
Frederic Font,
Daniel P. W. Ellis,
Xavier Serra
Abstract:
This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sou…
▽ More
This paper introduces Task 2 of the DCASE2019 Challenge, titled "Audio tagging with noisy labels and minimal supervision". This task was hosted on the Kaggle platform as "Freesound Audio Tagging 2019". The task evaluates systems for multi-label audio tagging using a large set of noisy-labeled data, and a much smaller set of manually-labeled data, under a large vocabulary setting of 80 everyday sound classes. In addition, the proposed dataset poses an acoustic mismatch problem between the noisy train set and the test set due to the fact that they come from different web audio sources. This can correspond to a realistic scenario given by the difficulty in gathering large amounts of manually labeled data. We present the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network. All these resources are freely available.
△ Less
Submitted 19 January, 2020; v1 submitted 7 June, 2019;
originally announced June 2019.
-
Learning Sound Event Classifiers from Web Audio with Noisy Labels
Authors:
Eduardo Fonseca,
Manoj Plakal,
Daniel P. W. Ellis,
Frederic Font,
Xavier Favory,
Xavier Serra
Abstract:
As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the map**. There is, however, little research into the impact of these errors. To foster the investigation of label no…
▽ More
As sound event classification moves towards larger datasets, issues of label noise become inevitable. Web sites can supply large volumes of user-contributed audio and metadata, but inferring labels from this metadata introduces errors due to unreliable inputs, and limitations in the map**. There is, however, little research into the impact of these errors. To foster the investigation of label noise in sound event classification we present FSDnoisy18k, a dataset containing 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise empirically, and provide a CNN baseline system. Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. We also show that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
△ Less
Submitted 7 March, 2019; v1 submitted 4 January, 2019;
originally announced January 2019.
-
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
Authors:
Sourish Chaudhuri,
Joseph Roth,
Daniel P. W. Ellis,
Andrew Gallagher,
Liat Kaver,
Radhika Marvin,
Caroline Pantofaru,
Nathan Reale,
Loretta Guarino Reid,
Kevin Wilson,
Zhonghua Xi
Abstract:
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or…
▽ More
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings, often tailored toward end applications. However, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches and understand their strengths and weaknesses. In this paper, we describe a new dataset which we will release publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for this task. The labels in the dataset annotate three different speech activity conditions: clean speech, speech co-occurring with music, and speech co-occurring with noise, which enable analysis of model performance in more challenging conditions based on the presence of overlap** noise. We report benchmark performance numbers on AVA-Speech using off-the-shelf, state-of-the-art audio and vision models that serve as a baseline to facilitate future research.
△ Less
Submitted 23 August, 2018; v1 submitted 1 August, 2018;
originally announced August 2018.
-
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
Authors:
Eduardo Fonseca,
Manoj Plakal,
Frederic Font,
Daniel P. W. Ellis,
Xavier Favory,
Jordi Pons,
Xavier Serra
Abstract:
This paper describes Task 2 of the DCASE 2018 Challenge, titled "General-purpose audio tagging of Freesound content with AudioSet labels". This task was hosted on the Kaggle platform as "Freesound General-Purpose Audio Tagging Challenge". The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the A…
▽ More
This paper describes Task 2 of the DCASE 2018 Challenge, titled "General-purpose audio tagging of Freesound content with AudioSet labels". This task was hosted on the Kaggle platform as "Freesound General-Purpose Audio Tagging Challenge". The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline system.
△ Less
Submitted 6 October, 2018; v1 submitted 25 July, 2018;
originally announced July 2018.
-
Geometrical control of active turbulence in curved topographies
Authors:
D. J. G. Pearce,
Perry W. Ellis,
Alberto Fernandez-Nieves,
L. Giomi
Abstract:
We investigate the turbulent dynamics of a two-dimensional active nematic liquid crystal con- strained on a curved surface. Using a combination of hydrodynamic and particle-based simulations, we demonstrate that the fundamental structural features of the fluid, such as the topological charge density, the defect number density, the nematic order parameter and defect creation and annihilation rates,…
▽ More
We investigate the turbulent dynamics of a two-dimensional active nematic liquid crystal con- strained on a curved surface. Using a combination of hydrodynamic and particle-based simulations, we demonstrate that the fundamental structural features of the fluid, such as the topological charge density, the defect number density, the nematic order parameter and defect creation and annihilation rates, are simple linear functions of the substrate Gaussian curvature, which then acts as a control parameter for the chaotic flow. Our theoretical predictions are then compared with experiments on microtubule-kinesin suspensions confined on toroidal active droplets, finding excellent qualitative agreement.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.
-
Unsupervised Learning of Semantic Audio Representations
Authors:
Aren Jansen,
Manoj Plakal,
Ratheet Pandya,
Daniel P. W. Ellis,
Shawn Hershey,
Jiayang Liu,
R. Channing Moore,
Rif A. Saurous
Abstract:
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the ca…
▽ More
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.
-
CNN Architectures for Large-Scale Audio Classification
Authors:
Shawn Hershey,
Sourish Chaudhuri,
Daniel P. W. Ellis,
Jort F. Gemmeke,
Aren Jansen,
R. Channing Moore,
Manoj Plakal,
Devin Platt,
Rif A. Saurous,
Bryan Seybold,
Malcolm Slaney,
Ron J. Weiss,
Kevin Wilson
Abstract:
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th…
▽ More
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.
△ Less
Submitted 10 January, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.
-
Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems
Authors:
Colin Raffel,
Daniel P. W. Ellis
Abstract:
We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.
We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.
△ Less
Submitted 20 September, 2016; v1 submitted 29 December, 2015;
originally announced December 2015.
-
Stable nematic droplets with handles
Authors:
E. Pairam,
J. Vallamkondu,
V. Koning,
B. C. van Zuiden,
P. W. Ellis,
M. A. Bates,
V. Vitelli,
A. Fernandez Nieves
Abstract:
We stabilize nematic droplets with handles against surface-tension-driven instabilities using a yield-stress material as outer fluid and study the complex nematic textures and defect structures that result from the competition between topological constraints and the elasticity of the nematic liquid crystal. We uncover a surprisingly persistent twisted configuration of the nematic director inside t…
▽ More
We stabilize nematic droplets with handles against surface-tension-driven instabilities using a yield-stress material as outer fluid and study the complex nematic textures and defect structures that result from the competition between topological constraints and the elasticity of the nematic liquid crystal. We uncover a surprisingly persistent twisted configuration of the nematic director inside the droplets when tangential anchoring is established at their boundaries, which we explain after considering the influence of saddle-splay on the elastic free energy. For toroidal droplets, we find that the saddle-splay energy screens the twisting energy resulting in a spontaneous breaking of mirror symmetry; the chiral twisted state persists for aspect ratios as large as ~20. For droplets with additional handles, we observe in experiments and computer simulations that there are two additional -1 surface defects per handle; these are located in regions with local saddle geometry to minimize the nematic distortions and hence the corresponding elastic free energy.
△ Less
Submitted 25 April, 2013; v1 submitted 8 December, 2012;
originally announced December 2012.