Search | arXiv e-print repository

arXiv:2406.19363 [pdf, ps, other]

Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment

Authors: Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff

Abstract: Forced alignment (FA) plays a key role in speech research through the automatic time alignment of speech signals with corresponding text transcriptions. Despite the move towards end-to-end architectures for speech technology, FA is still dominantly achieved through a classic GMM-HMM acoustic model. This work directly compares alignment performance from leading automatic speech recognition (ASR) me… ▽ More Forced alignment (FA) plays a key role in speech research through the automatic time alignment of speech signals with corresponding text transcriptions. Despite the move towards end-to-end architectures for speech technology, FA is still dominantly achieved through a classic GMM-HMM acoustic model. This work directly compares alignment performance from leading automatic speech recognition (ASR) methods, WhisperX and Massively Multilingual Speech Recognition (MMS), against a Kaldi-based GMM-HMM system, the Montreal Forced Aligner (MFA). Performance was assessed on the manually aligned TIMIT and Buckeye datasets, with comparisons conducted only on words correctly recognized by WhisperX and MMS. The MFA outperformed both WhisperX and MMS, revealing a shortcoming of modern ASR systems. These findings highlight the need for advancements in forced alignment and emphasize the importance of integrating traditional expertise with modern innovation to foster progress. Index Terms: forced alignment, phoneme alignment, word alignment △ Less

Submitted 27 June, 2024; originally announced June 2024.

Journal ref: Interspeech 2024

arXiv:2402.14023 [pdf]

25-Fold Resolution Enhancement of X-ray Microscopy Using Multipixel Ghost Imaging

Authors: O. Sefi, A. Ben Yehuda, Y. Klein, S. Bloch, H. Schwartz, E. Cohen, S. Shwartz

Abstract: Hard x-ray imaging is indispensable across diverse fields owing to its high penetrability. However, the resolution of traditional x-ray imaging modalities, such as computed tomography (CT) systems, is constrained by factors including beam properties, the absence of optical components, and detection resolution. As a result, typical resolution in commercial imaging systems is limited to a few hundre… ▽ More Hard x-ray imaging is indispensable across diverse fields owing to its high penetrability. However, the resolution of traditional x-ray imaging modalities, such as computed tomography (CT) systems, is constrained by factors including beam properties, the absence of optical components, and detection resolution. As a result, typical resolution in commercial imaging systems is limited to a few hundred microns. This study advances high-photon-energy imaging by extending the concept of computational ghost imaging to multipixel ghost imaging with x-rays. We demonstrate a remarkable enhancement in resolution from 500 microns to approximately 20 microns for an image spanning 0.9 by 1 cm^2, comprised of 400,000 pixels and involving only 1000 realizations. Furthermore, we present a high-resolution CT reconstruction using our method, revealing enhanced visibility and resolution. Our achievement is facilitated by an innovative x-ray lithography technique and the computed tiling of images captured by each detector pixel. Importantly, this method can be scaled up for larger images without sacrificing the short measurement time, thereby opening intriguing possibilities for noninvasive high-resolution imaging of small features that are invisible with the present modalities. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 9 pages, 4 figures

arXiv:2311.18604 [pdf, other]

doi 10.5334/tismir.167

Barwise Music Structure Analysis with the Correlation Block-Matching Segmentation Algorithm

Authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot

Abstract: Music Structure Analysis (MSA) is a Music Information Retrieval task consisting of representing a song in a simplified, organized manner by breaking it down into sections typically corresponding to ``chorus'', ``verse'', ``solo'', etc. In this work, we extend an MSA algorithm called the Correlation Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b). The CBM algorithm is a… ▽ More Music Structure Analysis (MSA) is a Music Information Retrieval task consisting of representing a song in a simplified, organized manner by breaking it down into sections typically corresponding to ``chorus'', ``verse'', ``solo'', etc. In this work, we extend an MSA algorithm called the Correlation Block-Matching (CBM) algorithm introduced by (Marmoret et al., 2020, 2022b). The CBM algorithm is a dynamic programming algorithm that segments self-similarity matrices, which are a standard description used in MSA and in numerous other applications. In this work, self-similarity matrices are computed from the feature representation of an audio signal and time is sampled at the bar-scale. This study examines three different standard similarity functions for the computation of self-similarity matrices. Results show that, in optimal conditions, the proposed algorithm achieves a level of performance which is competitive with supervised state-of-the-art methods while only requiring knowledge of bar positions. In addition, the algorithm is made open-source and is highly customizable. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: 19 pages, 13 figures, 11 tables, 1 algorithm, published in Transactions of the International Society for Music Information Retrieval

ACM Class: H.5.5

Journal ref: Transactions of the International Society for Music Information Retrieval, 6(1), 2023, 167--185

arXiv:2212.11054 [pdf, other]

Polytopic Analysis of Music

Authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot

Abstract: Structural segmentation of music refers to the task of finding a symbolic representation of the organisation of a song, reducing the musical flow to a partition of non-overlap** segments. Under this definition, the musical structure may not be unique, and may even be ambiguous. One way to resolve that ambiguity is to see this task as a compression process, and to consider the musical structure a… ▽ More Structural segmentation of music refers to the task of finding a symbolic representation of the organisation of a song, reducing the musical flow to a partition of non-overlap** segments. Under this definition, the musical structure may not be unique, and may even be ambiguous. One way to resolve that ambiguity is to see this task as a compression process, and to consider the musical structure as the optimization of a given compression criteria. In that viewpoint, C. Guichaoua developed a compression-driven model for retrieving the musical structure, based on the "System and Contrast" model, and on polytopes, which are extension of nhypercubes. We present this model, which we call "polytopic analysis of music", along with a new opensource dedicated toolbox called MusicOnPolytopes (in Python). This model is also extended to the use of the Tonnetz as a relation system. Structural segmentation experiments are conducted on the RWC Pop dataset. Results show improvements compared to the previous ones, presented by C. Guichaoua. △ Less

Submitted 22 December, 2022; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: Work document

ACM Class: H.5.5

arXiv:2210.15356 [pdf, other]

Convolutive Block-Matching Segmentation Algorithm with Application to Music Structure Analysis

Authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot

Abstract: Music Structure Analysis (MSA) consists of representing a song in sections (such as ``chorus'', ``verse'', ``solo'' etc), and can be seen as the retrieval of a simplified organization of the song. This work presents a new algorithm, called Convolutive Block-Matching (CBM) algorithm, devoted to MSA. In particular, the CBM algorithm is a dynamic programming algorithm, applying on autosimilarity matr… ▽ More Music Structure Analysis (MSA) consists of representing a song in sections (such as ``chorus'', ``verse'', ``solo'' etc), and can be seen as the retrieval of a simplified organization of the song. This work presents a new algorithm, called Convolutive Block-Matching (CBM) algorithm, devoted to MSA. In particular, the CBM algorithm is a dynamic programming algorithm, applying on autosimilarity matrices, a standard tool in MSA. In this work, autosimilarity matrices are computed from the feature representation of an audio signal, and time is sampled on the barscale. We study three different similarity functions for the computation of autosimilarity matrices. We report that the proposed algorithm achieves a level of performance competitive to that of supervised State-of-the-Art methods on 3 among 4 metrics, while being unsupervised. △ Less

Submitted 26 September, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: 4 pages, 7 figures. Accepted for publication at WASPAA 2023. The associated toolbox is available at https://gitlab.inria.fr/amarmore/autosimilarity_segmentation/-/tree/WASPAA23

ACM Class: H.5.5

arXiv:2202.04989 [pdf, other]

Semi-Supervised Convolutive NMF for Automatic Piano Transcription

Authors: Haoran Wu, Axel Marmoret, Jérémy E. Cohen

Abstract: Automatic Music Transcription, which consists in transforming an audio recording of a musical performance into symbolic format, remains a difficult Music Information Retrieval task. In this work, which focuses on piano transcription, we propose a semi-supervised approach using low-rank matrix factorization techniques, in particular Convolutive Nonnegative Matrix Factorization. In the semi-supervis… ▽ More Automatic Music Transcription, which consists in transforming an audio recording of a musical performance into symbolic format, remains a difficult Music Information Retrieval task. In this work, which focuses on piano transcription, we propose a semi-supervised approach using low-rank matrix factorization techniques, in particular Convolutive Nonnegative Matrix Factorization. In the semi-supervised setting, only a single recording of each individual notes is required. We show on the MAPS dataset that the proposed semi-supervised CNMF method performs better than state-of-the-art low-rank factorization techniques and a little worse than supervised deep learning state-of-the-art methods, while however suffering from generalization issues. △ Less

Submitted 14 April, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: Published at the 2022 Sound and Music Computing (SMC) conference, 7 pages, 5 figures, 3 tables, code available at https://github.com/cohenjer/TransSSCNMF

ACM Class: H.5.5

arXiv:2202.04981 [pdf, other]

Barwise Compression Schemes for Audio-Based Music Structure Analysis

Authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot

Abstract: Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song. More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, this article… ▽ More Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections. We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song. More specifically, under the hypothesis that MSA is correlated with similarities occurring at the bar scale, this article introduces the use of linear and non-linear compression schemes on barwise audio signals. Compressed representations capture the most salient components of the different bars in the song and are then used to infer the song structure using a dynamic programming algorithm. This work explores both low-rank approximation models such as Principal Component Analysis or Nonnegative Matrix Factorization and "piece-specific" Auto-Encoding Neural Networks, with the objective to learn latent representations specific to a given song. Such approaches do not rely on supervision nor annotations, which are well-known to be tedious to collect and possibly ambiguous in MSA description. In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of state-of-the-art supervised methods (for 3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise compression processing for MSA. △ Less

Submitted 15 April, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: Published at the 2022 Sound and Music Computing (SMC) conference, 8 pages, 6 figures, 1 table, code available at https://gitlab.inria.fr/amarmore/barwisemusiccompression. arXiv admin note: substantial text overlap with arXiv:2110.14437

ACM Class: H.5.5

arXiv:2110.14437 [pdf, other]

Exploring single-song autoencoding schemes for audio-based music structure analysis

Authors: Axel Marmoret, Jérémy E. Cohen, Frédéric Bimbot

Abstract: The ability of deep neural networks to learn complex data relations and representations is established nowadays, but it generally relies on large sets of training data. This work explores a "piece-specific" autoencoding scheme, in which a low-dimensional autoencoder is trained to learn a latent/compressed representation specific to a given song, which can then be used to infer the song structure.… ▽ More The ability of deep neural networks to learn complex data relations and representations is established nowadays, but it generally relies on large sets of training data. This work explores a "piece-specific" autoencoding scheme, in which a low-dimensional autoencoder is trained to learn a latent/compressed representation specific to a given song, which can then be used to infer the song structure. Such a model does not rely on supervision nor annotations, which are well-known to be tedious to collect and often ambiguous in Music Structure Analysis. We report that the proposed unsupervised auto-encoding scheme achieves the level of performance of supervised state-of-the-art methods with 3 seconds tolerance when using a Log Mel spectrogram representation on the RWC-Pop dataset. △ Less

Submitted 7 March, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: 4 pages, 4 figures, 2 tables. Rejected from ICASSP 2022, an extended version is available at arXiv:2202.04981

ACM Class: H.5.5

arXiv:2110.14434 [pdf, ps, other]

Nonnegative Tucker Decomposition with Beta-divergence for Music Structure Analysis of Audio Signals

Authors: Axel Marmoret, Florian Voorwinden, Valentin Leplat, Jérémy E. Cohen, Frédéric Bimbot

Abstract: Nonnegative Tucker decomposition (NTD), a tensor decomposition model, has received increased interest in the recent years because of its ability to blindly extract meaningful patterns, in particular in Music Information Retrieval. Nevertheless, existing algorithms to compute NTD are mostly designed for the Euclidean loss. This work proposes a multiplicative updates algorithm to compute NTD with th… ▽ More Nonnegative Tucker decomposition (NTD), a tensor decomposition model, has received increased interest in the recent years because of its ability to blindly extract meaningful patterns, in particular in Music Information Retrieval. Nevertheless, existing algorithms to compute NTD are mostly designed for the Euclidean loss. This work proposes a multiplicative updates algorithm to compute NTD with the beta-divergence loss, often considered a better loss for audio processing. We notably show how to implement efficiently the multiplicative rules using tensor algebra. Finally, we show on a music structure analysis task that unsupervised NTD fitted with beta-divergence loss outperforms earlier results obtained with the Euclidean loss. △ Less

Submitted 2 August, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: 4 pages, 2 figures, 1 table, 1 algorithm. To be published in GRETSI2022. The algorithm is available at https://gitlab.inria.fr/amarmore/nonnegative-factorization

MSC Class: 15-04 ACM Class: G.1.6; H.5.5

arXiv:2104.08580 [pdf, other]

Uncovering audio patterns in music with Nonnegative Tucker Decomposition for structural segmentation

Authors: Axel Marmoret, Jérémy E. Cohen, Nancy Bertin, Frédéric Bimbot

Abstract: Recent work has proposed the use of tensor decomposition to model repetitions and to separate tracks in loop-based electronic music. The present work investigates further on the ability of Nonnegative Tucker Decompositon (NTD) to uncover musical patterns and structure in pop songs in their audio form. Exploiting the fact that NTD tends to express the content of bars as linear combinations of a few… ▽ More Recent work has proposed the use of tensor decomposition to model repetitions and to separate tracks in loop-based electronic music. The present work investigates further on the ability of Nonnegative Tucker Decompositon (NTD) to uncover musical patterns and structure in pop songs in their audio form. Exploiting the fact that NTD tends to express the content of bars as linear combinations of a few patterns, we illustrate the ability of the decomposition to capture and single out repeated motifs in the corresponding compressed space, which can be interpreted from a musical viewpoint. The resulting features also turn out to be efficient for structural segmentation, leading to experimental results on the RWC Pop data set which are potentially challenging state-of-the-art approaches that rely on extensive example-based learning schemes. △ Less

Submitted 17 April, 2021; originally announced April 2021.

Comments: 7 pages, 6 figures; Code and experiments details available at https://gitlab.inria.fr/amarmore/musicntd/-/tree/0.1.0; Experiments details available at https://ax-le.github.io/resources/ISMIR2020/Notebooks_mainpage.html

Report number: ISBN: 978-0-9813537-0-8 ACM Class: H.5.5

Journal ref: 21st International Society for Music Information Retrieval Conference (ISMIR), Montréal, Canada, 2020, 788-794

arXiv:2102.01389 [pdf, other]

aura-net : robust segmentation of phase-contrast microscopy images with few annotations

Authors: Ethan Cohen, Virginie Uhlmann

Abstract: We present AURA-net, a convolutional neural network (CNN) for the segmentation of phase-contrast microscopy images. AURA-net uses transfer learning to accelerate training and Attention mechanisms to help the network focus on relevant image features. In this way, it can be trained efficiently with a very limited amount of annotations. Our network can thus be used to automate the segmentation of dat… ▽ More We present AURA-net, a convolutional neural network (CNN) for the segmentation of phase-contrast microscopy images. AURA-net uses transfer learning to accelerate training and Attention mechanisms to help the network focus on relevant image features. In this way, it can be trained efficiently with a very limited amount of annotations. Our network can thus be used to automate the segmentation of datasets that are generally considered too small for deep learning techniques. AURA-net also uses a loss inspired by active contours that is well-adapted to the specificity of phase-contrast images, further improving performance. We show that AURA-net outperforms state-of-the-art alternatives in several small (less than 100images) datasets. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: Accepted at ISBI 2021

arXiv:2011.08001 [pdf, other]

doi 10.1016/j.media.2021.102138

Deep-LIBRA: Artificial intelligence method for robust quantification of breast density with independent validation in breast cancer risk assessment

Authors: Omid Haji Maghsoudi, Aimilia Gastounioti, Christopher Scott, Lauren Pantalone, Fang-Fang Wu, Eric A. Cohen, Stacey Winham, Emily F. Conant, Celine Vachon, Despina Kontos

Abstract: Breast density is an important risk factor for breast cancer that also affects the specificity and sensitivity of screening mammography. Current federal legislation mandates reporting of breast density for all women undergoing breast screening. Clinically, breast density is assessed visually using the American College of Radiology Breast Imaging Reporting And Data System (BI-RADS) scale. Here, we… ▽ More Breast density is an important risk factor for breast cancer that also affects the specificity and sensitivity of screening mammography. Current federal legislation mandates reporting of breast density for all women undergoing breast screening. Clinically, breast density is assessed visually using the American College of Radiology Breast Imaging Reporting And Data System (BI-RADS) scale. Here, we introduce an artificial intelligence (AI) method to estimate breast percentage density (PD) from digital mammograms. Our method leverages deep learning (DL) using two convolutional neural network architectures to accurately segment the breast area. A machine-learning algorithm combining superpixel generation, texture feature analysis, and support vector machine is then applied to differentiate dense from non-dense tissue regions, from which PD is estimated. Our method has been trained and validated on a multi-ethnic, multi-institutional dataset of 15,661 images (4,437 women), and then tested on an independent dataset of 6,368 digital mammograms (1,702 women; cases=414) for both PD estimation and discrimination of breast cancer. On the independent dataset, PD estimates from Deep-LIBRA and an expert reader were strongly correlated (Spearman correlation coefficient = 0.90). Moreover, Deep-LIBRA yielded a higher breast cancer discrimination performance (area under the ROC curve, AUC = 0.611 [95% confidence interval (CI): 0.583, 0.639]) compared to four other widely-used research and commercial PD assessment methods (AUCs = 0.528 to 0.588). Our results suggest a strong agreement of PD estimates between Deep-LIBRA and gold-standard assessment by an expert reader, as well as improved performance in breast cancer risk assessment over state-of-the-art open-source and commercial methods. △ Less

Submitted 18 October, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

arXiv:2009.02264 [pdf, other]

Improving axial resolution in SIM using deep learning

Authors: Miguel Boland, Edward A. K. Cohen, Seth Flaxman, Mark A. A. Neil

Abstract: Structured Illumination Microscopy is a widespread methodology to image live and fixed biological structures smaller than the diffraction limits of conventional optical microscopy. Using recent advances in image up-scaling through deep learning models, we demonstrate a method to reconstruct 3D SIM image stacks with twice the axial resolution attainable through conventional SIM reconstructions. We… ▽ More Structured Illumination Microscopy is a widespread methodology to image live and fixed biological structures smaller than the diffraction limits of conventional optical microscopy. Using recent advances in image up-scaling through deep learning models, we demonstrate a method to reconstruct 3D SIM image stacks with twice the axial resolution attainable through conventional SIM reconstructions. We further evaluate our method for robustness to noise & generalisability to varying observed specimens, and discuss potential adaptions of the method to further improvements in resolution. △ Less

Submitted 18 February, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

ACM Class: I.4.5; I.2.10

arXiv:2006.07553 [pdf, other]

Sparse Separable Nonnegative Matrix Factorization

Authors: Nicolas Nadisic, Arnaud Vandaele, Jeremy E. Cohen, Nicolas Gillis

Abstract: We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-complete, as opposed to s… ▽ More We propose a new variant of nonnegative matrix factorization (NMF), combining separability and sparsity assumptions. Separability requires that the columns of the first NMF factor are equal to columns of the input matrix, while sparsity requires that the columns of the second NMF factor are sparse. We call this variant sparse separable NMF (SSNMF), which we prove to be NP-complete, as opposed to separable NMF which can be solved in polynomial time. The main motivation to consider this new model is to handle underdetermined blind source separation problems, such as multispectral image unmixing. We introduce an algorithm to solve SSNMF, based on the successive nonnegative projection algorithm (SNPA, an effective algorithm for separable NMF), and an exact sparse nonnegative least squares solver. We prove that, in noiseless settings and under mild assumptions, our algorithm recovers the true underlying sources. This is illustrated by experiments on synthetic data sets and the unmixing of a multispectral image. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 20 pages, accepted in ECML 2020

arXiv:1809.07824 [pdf, other]

Metric Learning for Phoneme Perception

Authors: Yair Lakretz, Gal Chechik, Evan-Gary Cohen, Alessandro Treves, Naama Friedmann

Abstract: Metric functions for phoneme perception capture the similarity structure among phonemes in a given language and therefore play a central role in phonology and psycho-linguistics. Various phenomena depend on phoneme similarity, such as spoken word recognition or serial recall from verbal working memory. This study presents a new framework for learning a metric function for perceptual distances amon… ▽ More Metric functions for phoneme perception capture the similarity structure among phonemes in a given language and therefore play a central role in phonology and psycho-linguistics. Various phenomena depend on phoneme similarity, such as spoken word recognition or serial recall from verbal working memory. This study presents a new framework for learning a metric function for perceptual distances among pairs of phonemes. Previous studies have proposed various metric functions, from simple measures counting the number of phonetic dimensions that two phonemes share (place-, manner-of-articulation and voicing), to more sophisticated ones such as deriving perceptual distances based on the number of natural classes that both phonemes belong to. However, previous studies have manually constructed the metric function, which may lead to unsatisfactory account of the empirical data. This study presents a framework to derive the metric function from behavioral data on phoneme perception using learning algorithms. We first show that this approach outperforms previous metrics suggested in the literature in predicting perceptual distances among phoneme pairs. We then study several metric functions derived by the learning algorithms and show how perceptual saliencies of phonological features can be derived from them. For English, we show that the derived perceptual saliencies are in accordance with a previously described order among phonological features and show how the framework extends the results to more features. Finally, we explore how the metric function and perceptual saliencies of phonological features may vary across languages. To this end, we compare results based on two English datasets and a new dataset that we have collected for Hebrew. △ Less

Submitted 20 September, 2018; originally announced September 2018.

Showing 1–15 of 15 results for author: Cohen, E