Skip to main content

Showing 1–50 of 88 results for author: Widmer, G

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.15897  [pdf, other

    eess.AS cs.LG cs.SD

    Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

    Authors: Paul Primus, Gerhard Widmer

    Abstract: Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadat… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: EUSIPCO 2024

  2. arXiv:2406.14850  [pdf, other

    eess.AS

    DExter: Learning and Controlling Performance Expression with Diffusion Models

    Authors: Huan Zhang, Shreyan Chowdhury, Carlos Eduardo Cancino-Chacón, **hua Liang, Simon Dixon, Gerhard Widmer

    Abstract: In the pursuit of develo** expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while b… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: in submission to appsci special session

  3. arXiv:2406.08454  [pdf, other

    cs.SD eess.AS

    Towards Musically Informed Evaluation of Piano Transcription Models

    Authors: Patricia Hu, Lukáš Samuel Marták, Carlos Cancino-Chacón, Gerhard Widmer

    Abstract: Automatic piano transcription models are typically evaluated using simple frame- or note-wise information retrieval (IR) metrics. Such benchmark metrics do not provide insights into the transcription quality of specific musical aspects such as articulation, dynamics, or rhythmic precision of the output, which are essential in the context of expressive performance analysis. Furthermore, in recent y… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2405.10018  [pdf, other

    eess.AS cs.SD

    Data-Efficient Low-Complexity Acoustic Scene Classification in the DCASE 2024 Challenge

    Authors: Florian Schmid, Paul Primus, Toni Heittola, Annamaria Mesaros, Irene Martín-Morató, Khaled Koutini, Gerhard Widmer

    Abstract: This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop da… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Task Description Page: https://dcase.community/challenge2024/task-data-efficient-low-complexity-acoustic-scene-classification

  5. arXiv:2405.09241  [pdf, other

    cs.SD eess.AS

    SMUG-Explain: A Framework for Symbolic Music Graph Explanations

    Authors: Emmanouil Karystinaios, Francesco Foscarin, Gerhard Widmer

    Abstract: In this work, we present Score MUsic Graph (SMUG)-Explain, a framework for generating and visualizing explanations of graph neural networks applied to arbitrary prediction tasks on musical scores. Our system allows the user to visualize the contribution of input notes (and note features) to the network output, directly in the context of the musical score. We provide an interactive interface based… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: In Proceedings of the Sound and Music Computing Conference 2024 (SMC2024), Porto, Portugal

  6. arXiv:2405.09224  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Perception-Inspired Graph Convolution for Music Understanding Tasks

    Authors: Emmanouil Karystinaios, Francesco Foscarin, Gerhard Widmer

    Abstract: We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding proble… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted at the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24)

  7. arXiv:2401.14826  [pdf, other

    cs.SD cs.IR eess.AS

    Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings

    Authors: Shreyan Chowdhury, Gerhard Widmer

    Abstract: This paper explores a specific sub-task of cross-modal music retrieval. We consider the delicate task of retrieving a performance or rendition of a musical piece based on a description of its style, expressive character, or emotion from a set of different performances of the same piece. We observe that a general purpose cross-modal system trained to learn a common text-audio embedding space does n… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

    Comments: Presented at FIRE 2023 (Forum for Information Retrieval Evaluation) conference, Goa, India

  8. Sounding Out Reconstruction Error-Based Evaluation of Generative Models of Expressive Performance

    Authors: Silvan David Peter, Carlos Eduardo Cancino-Chacón, Emmanouil Karystinaios, Gerhard Widmer

    Abstract: Generative models of expressive piano performance are usually assessed by comparing their predictions to a reference human performance. A generative algorithm is taken to be better than competing ones if it produces performances that are closer to a human reference performance. However, expert human performers can (and do) interpret music in different ways, making for different possible references… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Journal ref: 10th International Conference on Digital Libraries for Musicology, November 10, 2023, Milan, Italy

  9. arXiv:2310.15648  [pdf, other

    cs.SD cs.LG eess.AS

    Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio Models

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popula… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Source Code available at: https://github.com/fschmid56/EfficientAT

  10. arXiv:2310.14952  [pdf, other

    cs.SD eess.AS

    8+8=4: Formalizing Time Units to Handle Symbolic Music Durations

    Authors: Emmanouil Karystinaios, Francesco Foscarin, Florent Jacquemard, Masahiko Sakai, Satoshi Tojo, Gerhard Widmer

    Abstract: This paper focuses on the nominal durations of musical events (notes and rests) in a symbolic musical score, and on how to conveniently handle these in computer applications. We propose the usage of a temporal unit that is directly related to the graphical symbols in musical scores and pair this with a set of operations that cover typical computations in music applications. We formalize this time… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: In Proceedings of the International Symposium on Computer Music Multidisciplinary Research (CMMR 2023), Tokyo, Japan

  11. arXiv:2309.12158  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Towards Robust and Truly Large-Scale Audio-Sheet Music Retrieval

    Authors: Luis Carvalho, Gerhard Widmer

    Abstract: A range of applications of multi-modal music information retrieval is centred around the problem of connecting large collections of sheet music (images) to corresponding audio recordings, that is, identifying pairs of audio and score excerpts that refer to the same musical content. One of the typical and most recent approaches to this task employs cross-modal deep learning architectures to learn j… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Proceedings of the IEEE 6th International Conference on Multimedia Information Processing and Retrieval (MIPR)

  12. arXiv:2309.12134  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

    Authors: Luis Carvalho, Tobias Washüttl, Gerhard Widmer

    Abstract: Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Journal ref: Proceedings of the 14th ACM Multimedia Systems Conference (MMSys '23), June 7-10, 2023, Vancouver, BC, Canada

  13. arXiv:2309.12111  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    Passage Summarization with Recurrent Models for Audio-Sheet Music Retrieval

    Authors: Luis Carvalho, Gerhard Widmer

    Abstract: Many applications of cross-modal music retrieval are related to connecting sheet music images to audio recordings. A typical and recent approach to this is to learn, via deep neural networks, a joint embedding space that correlates short fixed-size snippets of audio and sheet music by means of an appropriate similarity structure. However, two challenges that arise out of this strategy are the requ… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: In Proceedings of the 24th Conference of the International Society for Music Information Retrieval (ISMIR 2023), Milan, Italy

  14. arXiv:2309.02567  [pdf, other

    eess.AS cs.MM cs.SD

    Symbolic Music Representations for Classification Tasks: A Systematic Evaluation

    Authors: Huan Zhang, Emmanouil Karystinaios, Simon Dixon, Gerhard Widmer, Carlos Eduardo Cancino-Chacón

    Abstract: Music Information Retrieval (MIR) has seen a recent surge in deep learning-based approaches, which often involve encoding symbolic music (i.e., music represented in terms of discrete note events) in an image-like or language like fashion. However, symbolic music is neither an image nor a sentence, and research in the symbolic domain lacks a comprehensive overview of the different available represe… ▽ More

    Submitted 10 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: To be published in the Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

    Journal ref: Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

  15. arXiv:2309.02399  [pdf, other

    cs.SD cs.DL eess.AS

    The Batik-plays-Mozart Corpus: Linking Performance to Score to Musicological Annotations

    Authors: Patricia Hu, Gerhard Widmer

    Abstract: We present the Batik-plays-Mozart Corpus, a piano performance dataset combining professional Mozart piano sonata performances with expert-labelled scores at a note-precise level. The performances originate from a recording by Viennese pianist Roland Batik on a computer-monitored Bösendorfer grand piano, and are available both as MIDI files and audio recordings. They have been precisely aligned, no… ▽ More

    Submitted 6 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: To be published in the Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

  16. arXiv:2308.09454  [pdf, other

    cs.SD cs.CL eess.AS

    Exploring Sampling Techniques for Generating Melodies with a Transformer Language Model

    Authors: Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer

    Abstract: Research in natural language processing has demonstrated that the quality of generations from trained autoregressive language models is significantly influenced by the used sampling strategy. In this study, we investigate the impact of different sampling techniques on musical qualities such as diversity and structure. To accomplish this, we train a high-capacity transformer model on a vast collect… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: 7 pages, 5 figures, 1 table, accepted at the 24th Int. Society for Music Information Retrieval Conf., Milan, Italy, 2023

  17. arXiv:2308.04258  [pdf, other

    eess.AS cs.IR cs.LG cs.SD

    Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

    Authors: Paul Primus, Khaled Koutini, Gerhard Widmer

    Abstract: This work presents a text-to-audio-retrieval system based on pre-trained text and spectrogram transformers. Our method projects recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. Through a systematic analysis, we examine how each component of the system influences retrieval performance. As a result, we identify two k… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: submitted to DCASE Workshop 2023

  18. arXiv:2306.16955  [pdf, other

    cs.SD cs.CL eess.AS

    Predicting Music Hierarchies with a Graph-Based Neural Decoder

    Authors: Francesco Foscarin, Daniel Harasim, Gerhard Widmer

    Abstract: This paper describes a data-driven framework to parse musical sequences into dependency trees, which are hierarchical structures used in music cognition research and music analysis. The parsing involves two steps. First, the input sequence is passed through a transformer encoder to enrich it with contextual information. Then, a classifier filters the graph of all possible dependency arcs to produc… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: To be published in the Proceedings of the International Society for Music Information Retrieval Conference (ISMIR)

  19. arXiv:2306.11764  [pdf, other

    eess.AS cs.LG cs.SD

    On Frequency-Wise Normalizations for Better Recording Device Generalization in Audio Spectrogram Transformers

    Authors: Paul Primus and, Gerhard Widmer

    Abstract: Varying conditions between the data seen at training and at application time remain a major challenge for machine learning. We study this problem in the context of Acoustic Scene Classification (ASC) with mismatching recording devices. Previous works successfully employed frequency-wise normalization of inputs and hidden layer activations in convolutional neural networks to reduce the recording de… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: EUSIPCO 2023

  20. arXiv:2306.08010  [pdf, other

    cs.SD cs.AI eess.AS

    Domain Information Control at Inference Time for Acoustic Scene Classification

    Authors: Shahed Masoudian, Khaled Koutini, Markus Schedl, Gerhard Widmer, Navid Rekabsaz

    Abstract: Domain shift is considered a challenge in machine learning as it causes significant degradation of model performance. In the Acoustic Scene Classification task (ASC), domain shift is mainly caused by different recording devices. Several studies have already targeted domain generalization to improve the performance of ASC models on unseen domains, such as new devices. Recently, the Controllable Gat… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  21. arXiv:2305.09489  [pdf, other

    cs.SD cs.AI eess.AS

    Discrete Diffusion Probabilistic Models for Symbolic Music Generation

    Authors: Matthias Plasser, Silvan Peter, Gerhard Widmer

    Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have made great strides in generating high-quality samples in both discrete and continuous domains. However, Discrete DDPMs (D3PMs) have yet to be applied to the domain of Symbolic Music. This work presents the direct generation of Polyphonic Symbolic Music using D3PMs. Our model exhibits state-of-the-art sample quality, according to current quantit… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI-23), Macau, China

  22. arXiv:2305.07499  [pdf, other

    cs.SD cs.LG eess.AS

    Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

    Authors: Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely wh… ▽ More

    Submitted 27 June, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: In Proceedings of the 31st European Signal Processing Conference, EUSIPCO 2023. Source Code available at: https://github.com/theMoro/DIRAugmentation/

  23. arXiv:2304.14848  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Musical Voice Separation as Link Prediction: Modeling a Musical Perception Task as a Multi-Trajectory Tracking Problem

    Authors: Emmanouil Karystinaios, Francesco Foscarin, Gerhard Widmer

    Abstract: This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece. We target symbolic music, where notes are explicitly encoded, and model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e., notes in a pitch-time space. Our approach builds a graph from a musical piece, by creat… ▽ More

    Submitted 28 April, 2023; originally announced April 2023.

    Comments: Accepted at the 32nd International Joint Conference on Artificial Intelligence (IJCAI-23)

  24. arXiv:2304.12939  [pdf, other

    cs.SD cs.HC eess.AS

    The ACCompanion: Combining Reactivity, Robustness, and Musical Expressivity in an Automatic Piano Accompanist

    Authors: Carlos Cancino-Chacón, Silvan Peter, Patricia Hu, Emmanouil Karystinaios, Florian Henkel, Francesco Foscarin, Nimrod Varga, Gerhard Widmer

    Abstract: This paper introduces the ACCompanion, an expressive accompaniment system. Similarly to a musician who accompanies a soloist playing a given musical piece, our system can produce a human-like rendition of the accompaniment part that follows the soloist's choices in terms of tempo, dynamics, and articulation. The ACCompanion works in the symbolic domain, i.e., it needs a musical instrument capable… ▽ More

    Submitted 30 May, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI-23), Macao, China. The differences/extensions with the previous version include a technical appendix, added missing links, and minor text updates. 10 pages, 4 figures

  25. arXiv:2303.01879  [pdf, other

    cs.SD eess.AS

    Low-Complexity Audio Embedding Extractors

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embeddings (GPAEs) are dense representations of audio signals that allow lightweight, shallow classifiers to tackle various audio tasks. The idea is that a single complex feature extractor would extra… ▽ More

    Submitted 23 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: In Proceedings of the 31st European Signal Processing Conference, EUSIPCO 2023. Source Code available at: https://github.com/fschmid56/EfficientAT_HEAR

  26. arXiv:2303.01875  [pdf, other

    cs.SD eess.AS

    Decoding and Visualising Intended Emotion in an Expressive Piano Performance

    Authors: Shreyan Chowdhury, Gerhard Widmer

    Abstract: Expert musicians can mould a musical piece to convey specific emotions that they intend to communicate. In this paper, we place a mid-level features based music emotion model in this performer-to-listener communication scenario, and demonstrate via a small visualisation music emotion decoding in real time. We also extend the existing set of mid-level features using analogues of perceptual speed an… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: Extended version of Late-Breaking Demo Session paper accepted at ISMIR 2022 (23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022)

  27. arXiv:2211.15524  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Differentiable Dictionary Search: Integrating Linear Mixing with Deep Non-Linear Modelling for Audio Source Separation

    Authors: Lukáš Samuel Marták, Rainer Kelz, Gerhard Widmer

    Abstract: This paper describes several improvements to a new method for signal decomposition that we recently formulated under the name of Differentiable Dictionary Search (DDS). The fundamental idea of DDS is to exploit a class of powerful deep invertible density estimators called normalizing flows, to model the dictionary in a linear decomposition method such as NMF, effectively creating a bijection betwe… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: Published in the Proceedings of the 24th International Congress on Acoustics (ICA 2022), Gyeongju, Korea, October 24-28, 2022

  28. Probabilistic Modelling of Signal Mixtures with Differentiable Dictionaries

    Authors: Lukáš Samuel Marták, Rainer Kelz, Gerhard Widmer

    Abstract: We introduce a novel way to incorporate prior information into (semi-) supervised non-negative matrix factorization, which we call differentiable dictionary search. It enables general, highly flexible and principled modelling of mixtures where non-linear sources are linearly mixed. We study its behavior on an audio decomposition task, and conduct an extensive, highly controlled study of its modell… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: Published in the Proceedings of the 29th European Signal Processing Conference (EUSIPCO 2021), Dublin, Ireland, August 23-27, 2021 (IEEE), 441-445

  29. arXiv:2211.13956  [pdf, other

    cs.SD cs.LG eess.AS

    Learning General Audio Representations with Large-Scale Training of Patchout Audio Transformers

    Authors: Khaled Koutini, Shahed Masoudian, Florian Schmid, Hamid Eghbal-zadeh, Jan Schlüter, Gerhard Widmer

    Abstract: The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature ex… ▽ More

    Submitted 2 March, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: will apear in HEAR: Holistic Evaluation of Audio Representations Proceedings of Machine Learning Research PMLR 166. Source code: https://github.com/kkoutini/passt_hear21

    Journal ref: Proceedings of Machine Learning Research v166 (2022) 65-89

  30. arXiv:2211.04772  [pdf, other

    cs.SD cs.LG eess.AS

    Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

    Authors: Florian Schmid, Khaled Koutini, Gerhard Widmer

    Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient… ▽ More

    Submitted 23 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023. Source Code available at: https://github.com/fschmid56/EfficientAT

  31. arXiv:2208.14819  [pdf, other

    cs.SD cs.LG eess.AS

    Cadence Detection in Symbolic Classical Music using Graph Neural Networks

    Authors: Emmanouil Karystinaios, Gerhard Widmer

    Abstract: Cadences are complex structures that have been driving music from the beginning of contrapuntal polyphony until today. Detecting such structures is vital for numerous MIR tasks such as musicological analysis, key detection, or music segmentation. However, automatic cadence detection remains challenging mainly because it involves a combination of high-level musical elements like harmony, voice lead… ▽ More

    Submitted 31 August, 2022; originally announced August 2022.

    Comments: In proceedings of the International Society for Music Information Retrieval Conference 2022 (ISMIR)

  32. arXiv:2208.12485  [pdf, other

    cs.SD cs.AI eess.AS

    Concept-Based Techniques for "Musicologist-friendly" Explanations in a Deep Music Classifier

    Authors: Francesco Foscarin, Katharina Hoedt, Verena Praher, Arthur Flexer, Gerhard Widmer

    Abstract: Current approaches for explaining deep learning systems applied to musical data provide results in a low-level feature space, e.g., by highlighting potentially relevant time-frequency bins in a spectrogram or time-pitch bins in a piano roll. This can be difficult to understand, particularly for musicologists without technical knowledge. To address this issue, we focus on more human-friendly explan… ▽ More

    Submitted 29 August, 2022; v1 submitted 26 August, 2022; originally announced August 2022.

    Comments: In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), Bengaluru, India

  33. arXiv:2208.11460  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations

    Authors: Paul Primus, Gerhard Widmer

    Abstract: The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embed… ▽ More

    Submitted 29 October, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: accepted at DCASE Workshop 2022

  34. arXiv:2208.11402  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

    Authors: Paul Primus, Gerhard Widmer

    Abstract: Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: published in EUSIPCO 2022

  35. arXiv:2206.01104  [pdf, other

    cs.SD cs.DL eess.AS

    The match file format: Encoding Alignments between Scores and Performances

    Authors: Francesco Foscarin, Emmanouil Karystinaios, Silvan David Peter, Carlos Cancino-Chacón, Maarten Grachten, Gerhard Widmer

    Abstract: This paper presents the specifications of match: a file format that extends a MIDI human performance with note-, beat-, and downbeat-level alignments to a corresponding musical score. This enables advanced analyses of the performance that are relevant for various tasks, such as expressive performance modeling, score following, music transcription, and performer classification. The match file inclu… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Journal ref: Proceedings of the Music Encoding Conference (MEC), 2022, Halifax, Canada

  36. arXiv:2206.01071  [pdf, other

    cs.SD cs.DL eess.AS

    Partitura: A Python Package for Symbolic Music Processing

    Authors: Carlos Cancino-Chacón, Silvan David Peter, Emmanouil Karystinaios, Francesco Foscarin, Maarten Grachten, Gerhard Widmer

    Abstract: Partitura is a lightweight Python package for handling symbolic musical information. It provides easy access to features commonly used in music information retrieval tasks, like note arrays (lists of timed pitched events) and 2D piano roll matrices, as well as other score elements such as time and key signatures, performance directives, and repeat structures. Partitura can load musical scores (in… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Journal ref: Proceedings of the Music Encoding Conference (MEC), 2022, Halifax, Canada

  37. arXiv:2205.12032  [pdf, ps, other

    eess.AS cs.AI cs.SD

    Defending a Music Recommender Against Hubness-Based Adversarial Attacks

    Authors: Katharina Hoedt, Arthur Flexer, Gerhard Widmer

    Abstract: Adversarial attacks can drastically degrade performance of recommenders and other machine learning systems, resulting in an increased demand for defence mechanisms. We present a new line of defence against attacks which exploit a vulnerability of recommenders that operate in high dimensional data spaces (the so-called hubness problem). We use a global data scaling method, namely Mutual Proximity (… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

    Comments: 6 pages, to be published in Proceedings of the 19th Sound and Music Computing Conference 2022 (SMC-22)

  38. arXiv:2111.06643  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Fully Automatic Page Turning on Real Scores

    Authors: Florian Henkel, Stephanie Schwaiger, Gerhard Widmer

    Abstract: We present a prototype of an automatic page turning system that works directly on real scores, i.e., sheet images, without any symbolic representation. Our system is based on a multi-modal neural network architecture that observes a complete sheet image page as input, listens to an incoming musical performance, and predicts the corresponding position in the image. Using the position estimation of… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

    Comments: ISMIR 2021 Late Breaking/Demo

  39. Efficient Training of Audio Transformers with Patchout

    Authors: Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, Gerhard Widmer

    Abstract: The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established C… ▽ More

    Submitted 29 March, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Submitted to Interspeech 2022. Source code: https://github.com/kkoutini/PaSST

  40. arXiv:2110.02592  [pdf, other

    eess.AS cs.SD

    Improving Real-time Score Following in Opera by Combining Music with Lyrics Tracking

    Authors: Charles Brazier, Gerhard Widmer

    Abstract: Fully automatic opera tracking is challenging because of the acoustic complexity of the genre, combining musical and linguistic information (singing, speech) in complex ways. In this paper, we propose a new pipeline for complete opera tracking. The pipeline is based on two trackers. A music tracker that has proven to be effective at tracking orchestral parts, will lead the tracking process. In add… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Comments: 5 pages, In Proceedings of the 2nd Workshop on NLP for Music and Audio (NLP4MusA), Online, 2021

  41. arXiv:2107.14496  [pdf, other

    eess.AS

    On-Line Audio-to-Lyrics Alignment Based on a Reference Performance

    Authors: Charles Brazier, Gerhard Widmer

    Abstract: Audio-to-lyrics alignment has become an increasingly active research task in MIR, supported by the emergence of several open-source datasets of audio recordings with word-level lyrics annotations. However, there are still a number of open problems, such as a lack of robustness in the face of severe duration mismatches between audio and lyrics representation; a certain degree of language-specificit… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: 8 pages, 1 figure, In Proceedings of the 22nd International Society for Music Information Retrieval (ISMIR) Conference, Online, 2021

  42. arXiv:2107.13231  [pdf, other

    cs.SD eess.AS

    On Perceived Emotion in Expressive Piano Performance: Further Experimental Evidence for the Relevance of Mid-level Perceptual Features

    Authors: Shreyan Chowdhury, Gerhard Widmer

    Abstract: Despite recent advances in audio content-based music emotion recognition, a question that remains to be explored is whether an algorithm can reliably discern emotional or expressive qualities between different performances of the same piece. In the present work, we analyze several sets of features on their effectiveness in predicting arousal and valence of six different performances (by six famous… ▽ More

    Submitted 28 July, 2021; originally announced July 2021.

    Comments: In Proceedings of the 22nd International Society for Music Information Retrieval (ISMIR) Conference, Online, 2021

  43. arXiv:2107.09045  [pdf, other

    eess.AS cs.AI cs.SD

    On the Veracity of Local, Model-agnostic Explanations in Audio Classification: Targeted Investigations with Adversarial Examples

    Authors: Verena Praher, Katharina Prinz, Arthur Flexer, Gerhard Widmer

    Abstract: Local explanation methods such as LIME have become popular in MIR as tools for generating post-hoc, model-agnostic explanations of a model's classification decisions. The basic idea is to identify a small set of human-understandable features of the classified example that are most influential on the classifier's prediction. These are then presented as an explanation. Evaluation of such explanation… ▽ More

    Submitted 6 September, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

    Comments: 8 pages, 4 figures, to be published in Proceedings of the International Society for Music Information Retrieval Conference 2021 (ISMIR 2021)

  44. arXiv:2107.08933  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Over-Parameterization and Generalization in Audio Classification

    Authors: Khaled Koutini, Hamid Eghbal-zadeh, Florian Henkel, Jan Schlüter, Gerhard Widmer

    Abstract: Convolutional Neural Networks (CNNs) have been dominating classification tasks in various domains, such as machine vision, machine listening, and natural language processing. In machine listening, while generally exhibiting very good generalization capabilities, CNNs are sensitive to the specific audio recording device used, which has been recognized as a substantial problem in the acoustic scene… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

    Comments: Presented at the ICML 2021 Workshop on Overparameterization: Pitfalls & Opportunities

  45. arXiv:2106.07787  [pdf, other

    cs.SD cs.LG eess.AS

    Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities

    Authors: Shreyan Chowdhury, Verena Praher, Gerhard Widmer

    Abstract: Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in develo** reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from… ▽ More

    Submitted 16 June, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: In Proceedings of the 18th Sound and Music Computing Conference (SMC 2021)

  46. arXiv:2105.12536  [pdf, other

    eess.AS cs.SD

    Exploiting Temporal Dependencies for Cross-Modal Music Piece Identification

    Authors: Luis Carvalho, Gerhard Widmer

    Abstract: This paper addresses the problem of cross-modal musical piece identification and retrieval: finding the appropriate recording(s) from a database given a sheet music query, and vice versa, working directly with audio and scanned sheet music images. The fundamental approach to this is to learn a cross-modal embedding space with a suitable similarity structure for audio and sheet image snippets, usin… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures

    Journal ref: Proceedings of the 29th European Signal Processing Conference (EUSIPCO 2021), Dublin, Ireland

  47. arXiv:2105.12395  [pdf, other

    cs.SD cs.LG cs.NE eess.AS

    Receptive Field Regularization Techniques for Audio Classification and Tagging with Deep Convolutional Neural Networks

    Authors: Khaled Koutini, Hamid Eghbal-zadeh, Gerhard Widmer

    Abstract: In this paper, we study the performance of variants of well-known Convolutional Neural Network (CNN) architectures on different audio tasks. We show that tuning the Receptive Field (RF) of CNNs is crucial to their generalization. An insufficient RF limits the CNN's ability to fit the training data. In contrast, CNNs with an excessive RF tend to over-fit the training data and fail to generalize to… ▽ More

    Submitted 26 May, 2021; originally announced May 2021.

    Comments: Accepted in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Code available: https://github.com/kkoutini/cpjku_dcase20

  48. arXiv:2105.08531  [pdf, other

    eess.AS cs.SD

    Handling Structural Mismatches in Real-time Opera Tracking

    Authors: Charles Brazier, Gerhard Widmer

    Abstract: Algorithms for reliable real-time score following in live opera promise a lot of useful applications such as automatic subtitles display, or real-time video cutting in live streaming. Until now, such systems were based on the strong assumption that an opera performance follows the structure of the score linearly. However, this is rarely the case in practice, because of different opera versions and… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

    Comments: 5 pages, 1 figure, In Proceedings of the 29th European Signal Processing Conference (EUSIPCO 2020), Dublin, Ireland

  49. arXiv:2105.04309  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-modal Conditional Bounding Box Regression for Music Score Following

    Authors: Florian Henkel, Gerhard Widmer

    Abstract: This paper addresses the problem of sheet-image-based on-line audio-to-score alignment also known as score following. Drawing inspiration from object detection, a conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musical performance. Experiments are conducted on a sy… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

    Comments: Accepted for publication in the Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021

  50. arXiv:2102.13479  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Explaining Expressive Qualities in Piano Recordings: Transfer of Explanatory Features via Acoustic Domain Adaptation

    Authors: Shreyan Chowdhury, Gerhard Widmer

    Abstract: Emotion and expressivity in music have been topics of considerable interest in the field of music information retrieval. In recent years, mid-level perceptual features have been suggested as means to explain computational predictions of musical emotion. We find that the diversity of musical styles and genres in the available dataset for learning these features is not sufficient for models to gener… ▽ More

    Submitted 26 February, 2021; originally announced February 2021.

    Comments: 5 pages, 3 figures; accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)