Skip to main content

Showing 1–50 of 53 results for author: Herremans, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08820  [pdf, other

    eess.AS cs.CL

    DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

    Authors: Kyra Wang, Dorien Herremans

    Abstract: Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not includ… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 4 pages, 1 figure, submitted to IEEE TENCON 2024

  2. arXiv:2406.08809  [pdf, other

    cs.SD cs.AI eess.AS

    Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges

    Authors: Jaeyong Kang, Dorien Herremans

    Abstract: Deep learning models for music have advanced drastically in the last few years. But how good are machine learning models at capturing emotion these days and what challenges are researchers facing? In this paper, we provide a comprehensive overview of the available music-emotion datasets and discuss evaluation standards as well as competitions in the field. We also provide a brief overview of vario… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2406.02255  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    MidiCaps -- A large-scale MIDI dataset with text captions

    Authors: Jan Melechovsky, Abhinaba Roy, Dorien Herremans

    Abstract: Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist, mostly due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting the first large-scale MIDI dataset with text captions that is openly available: MidiCaps. MIDI (Musical Instrument Digital Interfac… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Under review

  4. arXiv:2406.01018  [pdf, other

    eess.AS cs.LG cs.SD

    Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (T… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Under review

  5. arXiv:2402.17467  [pdf, other

    cs.IR cs.AI cs.SD eess.AS

    Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

    Authors: Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller, Dorien Herremans

    Abstract: Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 36 pages, 5 figures, 4 tables

  6. arXiv:2311.00968  [pdf, other

    cs.SD cs.AI eess.AS

    Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

    Authors: Jaeyong Kang, Soujanya Poria, Dorien Herremans

    Abstract: Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene… ▽ More

    Submitted 4 March, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Journal ref: Expert Systems with Applications 249 (2024): 123640

  7. arXiv:2306.13661  [pdf, other

    q-fin.CP cs.LG q-fin.PM

    Constructing Time-Series Momentum Portfolios with Deep Multi-Task Learning

    Authors: Joel Ong, Dorien Herremans

    Abstract: A diversified risk-adjusted time-series momentum (TSMOM) portfolio can deliver substantial abnormal returns and offer some degree of tail risk protection during extreme market events. The performance of existing TSMOM strategies, however, relies not only on the quality of the momentum signal but also on the efficacy of the volatility estimator. Yet many of the existing studies have always consider… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Journal ref: Expert Systems with Applications Volume 230, 15 November 2023, 120587

  8. arXiv:2302.00286  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilize… ▽ More

    Submitted 1 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: arXiv admin note: text overlap with arXiv:2206.10805

  9. arXiv:2212.00973  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    A Domain-Knowledge-Inspired Music Embedding Space and a Novel Attention Mechanism for Symbolic Music Modeling

    Authors: Z. Guo, J. Kang, D. Herremans

    Abstract: Following the success of the transformer architecture in the natural language domain, transformer-like architectures have been widely applied to the domain of symbolic music recently. Symbolic music and text, however, are two different modalities. Symbolic music contains multiple attributes, both absolute attributes (e.g., pitch) and relative attributes (e.g., pitch interval). These relative attri… ▽ More

    Submitted 2 December, 2022; originally announced December 2022.

    Comments: This paper is accepted at AAAI 2023

  10. arXiv:2211.08281  [pdf, other

    q-fin.TR cs.AI cs.LG q-fin.CP q-fin.PM

    Forecasting Bitcoin volatility spikes from whale transactions and CryptoQuant data using Synthesizer Transformer models

    Authors: Dorien Herremans, Kah Wee Low

    Abstract: The cryptocurrency market is highly volatile compared to traditional financial markets. Hence, forecasting its volatility is crucial for risk management. In this paper, we investigate CryptoQuant data (e.g. on-chain analytics, exchange and miner data) and whale-alert tweets, and explore their relationship to Bitcoin's next-day volatility, with a focus on extreme volatility spikes. We propose a dee… ▽ More

    Submitted 6 October, 2022; originally announced November 2022.

    Comments: Co-first authors

  11. arXiv:2211.07283  [pdf, other

    eess.AS cs.SD

    SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

    Authors: Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

    Abstract: Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to acc… ▽ More

    Submitted 1 June, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

  12. arXiv:2211.03316  [pdf, other

    eess.AS cs.LG cs.SD

    Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

    Authors: Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

    Abstract: Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, which is converted to any desired target accent. Ou… ▽ More

    Submitted 3 June, 2024; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: preprint submitted to a conference, under review

  13. arXiv:2210.05148  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

    Authors: Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

    Abstract: In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms.… ▽ More

    Submitted 20 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

    Journal ref: Proceedings of ICASSP - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023

  14. arXiv:2206.10805  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utiliz… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Submitted to ISMIR

  15. arXiv:2206.00648  [pdf, other

    q-fin.ST cs.CL cs.LG q-fin.CP q-fin.TR

    PreBit -- A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin

    Authors: Yanzhao Zou, Dorien Herremans

    Abstract: Bitcoin, with its ever-growing popularity, has demonstrated extreme price volatility since its origin. This volatility, together with its decentralised nature, make Bitcoin highly subjective to speculative trading as compared to more traditional assets. In this paper, we propose a multimodal model for predicting extreme price fluctuations. This model takes as input a variety of correlated assets,… ▽ More

    Submitted 21 October, 2023; v1 submitted 30 May, 2022; originally announced June 2022.

    Comments: 21 pages, submitted preprint to Elsevier Expert Systems with Applications

    Journal ref: Expert Systems with Applications, 233, 120838 (2023)

  16. arXiv:2204.11437  [pdf, other

    cs.SD eess.AS eess.SP

    Understanding Audio Features via Trainable Basis Functions

    Authors: Kwan Yee Heung, Kin Wai Cheuk, Dorien Herremans

    Abstract: In this paper we explore the possibility of maximizing the information represented in spectrograms by making the spectrogram basis functions trainable. We experiment with two different tasks, namely keyword spotting (KWS) and automatic speech recognition (ASR). For most neural network models, the architecture and hyperparameters are typically fine-tuned and optimized in experiments. Input features… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: under review in Interspeech 2022

  17. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu **, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  18. arXiv:2202.10453  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

    Authors: Phoebe Chua, Dimos Makris, Dorien Herremans, Gemma Roig, Kat Agres

    Abstract: Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived e… ▽ More

    Submitted 19 February, 2022; originally announced February 2022.

    Comments: 16 pages with 9 figures

  19. arXiv:2202.05528  [pdf, other

    cs.AI cs.MM

    MusIAC: An extensible generative framework for Music Infilling Applications with multi-level Control

    Authors: Rui Guo, Ivor Simpson, Chris Kiefer, Thor Magnusson, Dorien Herremans

    Abstract: We present a novel music generation framework for music infilling, with a user friendly interface. Infilling refers to the task of generating musical sections given the surrounding multi-track music. The proposed transformer-based framework is extensible for new control tokens as the added music control tokens such as tonal tension per bar and track polyphony level in this work. We explore the eff… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

    Comments: preprint for The 11th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART) 2022

  20. arXiv:2202.04464  [pdf, other

    cs.SD cs.LG eess.AS

    Conditional Drums Generation using Compound Word Representations

    Authors: Dimos Makris, Guo Zixun, Maximos Kaliakatsos-Papakostas, Dorien Herremans

    Abstract: The field of automatic music composition has seen great progress in recent years, specifically with the invention of transformer-based architectures. When using any deep learning model which considers music as a sequence of events with multiple complex dependencies, the selection of a proper data representation is crucial. In this paper, we tackle the task of conditional drums generation using a n… ▽ More

    Submitted 21 February, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: Accepted for the 11th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART), 2022

  21. aiSTROM -- A roadmap for develo** a successful AI strategy

    Authors: Dorien Herremans

    Abstract: A total of 34% of AI research and development projects fails or are abandoned, according to a recent survey by Rackspace Technology of 1,870 companies. We propose a new strategic framework, aiSTROM, that empowers managers to create a successful AI strategy based on a thorough literature review. This provides a unique and integrated approach that guides managers and lead developers through the vari… ▽ More

    Submitted 15 November, 2021; v1 submitted 25 June, 2021; originally announced July 2021.

    MSC Class: 68Txx; 97Pxx ACM Class: K.5; K.6; C.5; D.m; H.2; K.7

    Journal ref: IEEE Access, 2021

  22. arXiv:2107.04954  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

    Authors: Kin Wai Cheuk, Dorien Herremans, Li Su

    Abstract: Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlab… ▽ More

    Submitted 29 July, 2021; v1 submitted 10 July, 2021; originally announced July 2021.

    Comments: Accepted in ACMMM 21. Camera ready version

  23. arXiv:2106.12174  [pdf, other

    cs.LG cs.MM cs.SD eess.AS

    Deep Neural Network Based Respiratory Pathology Classification Using Cough Sounds

    Authors: Balamurali B T, Hwan Ing Hee, Saumitra Kapoor, Oon Hoe Teoh, Sung Shin Teng, Khai Pin Lee, Dorien Herremans, Jer Ming Chen

    Abstract: Intelligent systems are transforming the world, as well as our healthcare system. We propose a deep learning-based cough sound classification model that can distinguish between children with healthy versus pathological coughs such as asthma, upper respiratory tract infection (URTI), and lower respiratory tract infection (LRTI). In order to train a deep neural network model, we collected a new data… ▽ More

    Submitted 23 June, 2021; originally announced June 2021.

    MSC Class: 62-XX; 92-XX; 68Txx; ACM Class: J.3; I.2

  24. arXiv:2104.13056  [pdf, other

    cs.SD cs.LG eess.AS

    Generating Lead Sheets with Affect: A Novel Conditional seq2seq Framework

    Authors: Dimos Makris, Kat R. Agres, Dorien Herremans

    Abstract: The field of automatic music composition has seen great progress in the last few years, much of which can be attributed to advances in deep neural networks. There are numerous studies that present different strategies for generating sheet music from scratch. The inclusion of high-level musical characteristics (e.g., perceived emotional qualities), however, as conditions for controlling the generat… ▽ More

    Submitted 27 April, 2021; originally announced April 2021.

    Comments: Accepted for the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18-22 July 2021 (virtual)

  25. arXiv:2104.06607  [pdf, other

    cs.SD eess.AS

    Revisiting the Onsets and Frames Model with Additive Attention

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted in IJCNN 2021 Special Session S04. https://dr-costas.github.io/rlasmp2021-website/

  26. arXiv:2102.13397  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Underwater Acoustic Communication Receiver Using Deep Belief Network

    Authors: Abigail Lee-Leon, Chau Yuen, Dorien Herremans

    Abstract: Underwater environments create a challenging channel for communications. In this paper, we design a novel receiver system by exploring the machine learning technique--Deep Belief Network (DBN)-- to combat the signal distortion caused by the Doppler effect and multi-path propagation. We evaluate the performance of the proposed receiver system in both simulation experiments and sea trials. Our propo… ▽ More

    Submitted 26 February, 2021; originally announced February 2021.

  27. arXiv:2010.11188  [pdf

    cs.SD cs.CV eess.AS

    AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

    Authors: Ha Thi Phuong Thao, Balamurali B. T., Dorien Herremans, Gemma Roig

    Abstract: In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integrati… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: 8 pages, 6 figures

    Journal ref: Proceedings of the International Conference on Pattern Recognition (ICPR2020)

  28. arXiv:2010.09969  [pdf, other

    cs.SD cs.LG eess.AS

    The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Emmanouil Benetos, Dorien Herremans

    Abstract: Most of the state-of-the-art automatic music transcription (AMT) models break down the main transcription task into sub-tasks such as onset prediction and offset prediction and train them with onset and offset labels. These predictions are then concatenated together and used as the input to train another model with the pitch labels to obtain the final transcription. We attempt to use only the pitc… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: Accepted in ICPR

  29. arXiv:2010.09489  [pdf, other

    cs.SD cs.LG cs.MM

    Hit Song Prediction Based on Early Adopter Data and Audio Features

    Authors: Dorien Herremans, Tom Bergmans

    Abstract: Billions of USD are invested in new artists and songs by the music industry every year. This research provides a new strategy for assessing the hit potential of songs, which can help record companies support their investment decisions. A number of models were developed that use both audio data, and a novel feature based on social media listening behaviour. The results show that models based on ear… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Journal ref: The 18th International Society for Music Information Retrieval Conference (ISMIR)2018 - LBD

  30. arXiv:2010.06230  [pdf, ps, other

    cs.SD cs.SC eess.AS

    A variational autoencoder for music generation controlled by tonal tension

    Authors: Rui Guo, Ivor Simpson, Thor Magnusson, Chris Kiefer, Dorien Herremans

    Abstract: Many of the music generation systems based on neural networks are fully autonomous and do not offer control over the generation process. In this research, we present a controllable music generation system in terms of tonal tension. We incorporate two tonal tension measures based on the Spiral Array Tension theory into a variational autoencoder model. This allows us to control the direction of the… ▽ More

    Submitted 14 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: 2020 Joint Conference on AI Music Creativity

  31. arXiv:2009.04459  [pdf, other

    cs.SD cs.LG eess.AS

    A dataset and classification model for Malay, Hindi, Tamil and Chinese music

    Authors: Fajilatun Nahar, Kat Agres, Balamurali BT, Dorien Herremans

    Abstract: In this paper we present a new dataset, with musical excepts from the three main ethnic groups in Singapore: Chinese, Malay and Indian (both Hindi and Tamil). We use this new dataset to train different classification models to distinguish the origin of the music in terms of these ethnic groups. The classification models were optimized by exploring the use of different musical features as the input… ▽ More

    Submitted 15 September, 2020; v1 submitted 9 September, 2020; originally announced September 2020.

    Comments: 4 pages

  32. arXiv:2007.15474  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling

    Authors: Hao Hao Tan, Dorien Herremans

    Abstract: High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-le… ▽ More

    Submitted 29 July, 2020; originally announced July 2020.

    Journal ref: Proc. of 21st International Society of Music Information Retrieval Conference, ISMIR 2020

  33. arXiv:2007.00977  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding

    Authors: Kanish Garg, Ajeet kumar Singh, Dorien Herremans, Brejesh Lall

    Abstract: Generating an image from a provided descriptive text is quite a challenging task because of the difficulty in incorporating perceptual information (object shapes, colors, and their interactions) along with providing high relevancy related to the provided text. Current methods first generate an initial low-resolution image, which typically has irregular object shapes, colors, and interaction betwee… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

    Comments: Proceedings of IEEE International Conference on Imaging, Vision & Pattern Recognition, (IVPR 2020, Japan)

    MSC Class: 68Txx; 68-XX ACM Class: I.4; I.5; I.3; I.2

    Journal ref: Proceedings of IEEE International Conference on Imaging, Vision & Pattern Recognition, (IVPR 2020, Japan)

  34. arXiv:2006.09833  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Generative Modelling for Controllable Audio Synthesis of Expressive Piano Performance

    Authors: Hao Hao Tan, Yin-Jyun Luo, Dorien Herremans

    Abstract: We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of syn… ▽ More

    Submitted 12 July, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Journal ref: Published at ICML Workshop on Machine Learning for Media Discovery Workshop (ML4MD) 2020

  35. arXiv:2006.09016  [pdf, other

    physics.comp-ph cs.LG stat.AP

    Acoustic prediction of flowrate: varying liquid jet stream onto a free surface

    Authors: Balamurali B T, Edwin Jonathan Aslim, Yun Shu Lynn Ng, Tricia Li, Chuen Kuo, Jacob Shihang Chen, Dorien Herremans, Lay Guat Ng, Jer-Ming Chen

    Abstract: Information on liquid jet stream flow is crucial in many real world applications. In a large number of cases, these flows fall directly onto free surfaces (e.g. pools), creating a splash with accompanying splashing sounds. The sound produced is supplied by energy interactions between the liquid jet stream and the passive free surface. In this investigation, we collect the sound of a water jet of v… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

    MSC Class: 76-XX; 92C55; 92-XX ACM Class: J.2

    Journal ref: Proceedings of the IEEE International Conference on Signal Processing and Communications (SPCOM), 2020

  36. arXiv:2001.09989  [pdf, other

    cs.SD eess.AS

    The impact of Audio input representations on neural network based music transcription

    Authors: Kin Wai Cheuk, Kat Agres, Dorien Herremans

    Abstract: This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate the influence of using a linear-frequency spectrogram, log-frequency spectrogram, Mel spectrogram, and constant-Q transform (CQT). Our results show that a $8.33$% increase in transcription accu… ▽ More

    Submitted 21 July, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: Paper accepted in IJCNN 2020

    Journal ref: IJCNN 2020

  37. arXiv:2001.09988  [pdf, other

    cs.SD eess.AS

    Regression-based music emotion prediction using triplet neural networks

    Authors: Kin Wai Cheuk, Yin-Jyun Luo, Balamurali B, T, Gemma Roig, Dorien Herremans

    Abstract: In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not for regression, we propose a mechanism that allows them to provide meaningful low dimensional representations for regression tasks. We then use these new representations as the input for regression algorithms such as support vector… ▽ More

    Submitted 21 July, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: Paper Accepted i nIJCNN 2020

    Journal ref: IJCNN 2020

  38. arXiv:1912.12055  [pdf, other

    cs.SD cs.LG eess.AS

    nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

    Authors: Kin Wai Cheuk, Hans Anderson, Kat Agres, Dorien Herremans

    Abstract: Converting time domain waveforms to frequency domain spectrograms is typically considered to be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes a lot of hard disk space to store different frequency domain representations. This is especially true during the model development and tuning process, when exploring various types of spectrogr… ▽ More

    Submitted 21 August, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

    Comments: Accepted In IEEE Access

  39. arXiv:1912.02613  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

    Authors: Yin-Jyun Luo, Chin-Chen Hsu, Kat Agres, Dorien Herremans

    Abstract: We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion. The proposed model is trained on non-parallel corpora, accommodates many-to-many conversion, and leverages recent advances of variational autoencoders. It employs separate encoders to learn disentangled latent representations of singer identity and vocal technique separately, with a joint… ▽ More

    Submitted 24 February, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2020

  40. arXiv:1910.02049  [pdf, ps, other

    cs.SD cs.IR cs.LG eess.AS

    Midi Miner -- A Python library for tonal tension and track classification

    Authors: Rui Guo, Dorien Herremans, Thor Magnusson

    Abstract: We present a Python library, called Midi Miner, that can calculate tonal tension and classify different tracks. MIDI (Music Instrument Digital Interface) is a hardware and software standard for communicating musical events between digital music devices. It is often used for tasks such as music representation, communication between devices, and even music generation [5]. Tension is an essential ele… ▽ More

    Submitted 26 May, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

    Comments: 2 pages. ISMIR - Late Breaking Demo, Delft, The Netherlands. November 2019

  41. arXiv:1910.01463  [pdf, other

    cs.SD cs.LG eess.AS

    Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks

    Authors: Kin Wai Cheuk, Balamurali B. T., Gemma Roig, Dorien Herremans

    Abstract: We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, the $i$-vector representation with probabilistic linear discriminant analysis (PLDA) is the most commonly used technique to solve this problem, due to high classification accuracy with a relatively short computation time. In this paper, we explore a neural network approach, namely Triplet Neu… ▽ More

    Submitted 3 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted for ASRU 2019

    MSC Class: 68T10; 68Txx

    Journal ref: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019). Singapore. 2019

  42. arXiv:1909.06957  [pdf, other

    cs.CV

    Multimodal Deep Models for Predicting Affective Responses Evoked by Movies

    Authors: Ha Thi Phuong Thao, Dorien Herremans, Gemma Roig

    Abstract: The goal of this study is to develop and analyze multimodal models for predicting experienced affective responses of viewers watching movie clips. We develop hybrid multimodal prediction models based on both the video and audio of the clips. For the video content, we hypothesize that both image content and motion are crucial features for evoked emotion prediction. To capture such information, we e… ▽ More

    Submitted 17 September, 2019; v1 submitted 15 September, 2019; originally announced September 2019.

    Comments: 10 pages, 7 figures, Preprint accepted for publication in the Proceedings of the 2nd International Workshop on Computer Vision for Physiological Measurement as part of ICCV. Seoul, South Korea. 2019

    MSC Class: 97R40; 68T45; 68Txx; 92B20

    Journal ref: Proceedings of the 2nd International Workshop on Computer Vision for Physiological Measurement as part of ICCV. Seoul, South Korea. 2019

  43. arXiv:1909.02850  [pdf, other

    eess.SP cs.LG cs.SD stat.ML

    Doppler Invariant Demodulation for Shallow Water Acoustic Communications Using Deep Belief Networks

    Authors: Abigail Lee-Leon, Chau Yuen, Dorien Herremans

    Abstract: Shallow water environments create a challenging channel for communications. In this paper, we focus on the challenges posed by the frequency-selective signal distortion called the Doppler effect. We explore the design and performance of machine learning (ML) based demodulation methods --- (1) Deep Belief Network-feed forward Neural Network (DBN-NN) and (2) Deep Belief Network-Convolutional Neural… ▽ More

    Submitted 5 September, 2019; originally announced September 2019.

    Journal ref: Proceedings of 16th IEEE Asia Pacific Wireless Communications Symposium (APWCS). 2019. Singapore

  44. arXiv:1906.10428  [pdf, other

    cs.HC cs.MM

    A novel music-based game with motion capture to support cognitive and motor function in the elderly

    Authors: Kat Agres, Simon Lui, Dorien Herremans

    Abstract: This paper presents a novel game prototype that uses music and motion detection as preventive medicine for the elderly. Given the aging populations around the globe, and the limited resources and staff able to care for these populations, eHealth solutions are becoming increasingly important, if not crucial, additions to modern healthcare and preventive medicine. Furthermore, because compliance rat… ▽ More

    Submitted 25 June, 2019; originally announced June 2019.

    Journal ref: IEEE Conference on Games 2019, London, UK

  45. arXiv:1906.08152  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

    Authors: Yin-Jyun Luo, Kat Agres, Dorien Herremans

    Abstract: In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively. For… ▽ More

    Submitted 29 June, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: 20th Conference of the International Society for Music Information Retrieval

  46. arXiv:1905.12439  [pdf, other

    cs.SD cs.CR cs.LG cs.MM stat.ML

    Towards robust audio spoofing detection: a detailed comparison of traditional and learned features

    Authors: Balamurali BT, Kin Wah Edward Lin, Simon Lui, Jer-Ming Chen, Dorien Herremans

    Abstract: Automatic speaker verification, like every other biometric system, is vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a genuine client of a speaker verification system, attackers can develop a variety of spoofing attacks that might trick such systems. Detecting these attacks using the audio cues present in the recordings is an important challenge. Most existing spoofi… ▽ More

    Submitted 18 June, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

    Journal ref: IEEE Access. 2019

  47. arXiv:1905.08076  [pdf, other

    cs.SD cs.IR cs.LG eess.AS stat.ML

    Dance Hit Song Prediction

    Authors: Dorien herremans, David Martens, Kenneth Sörensen

    Abstract: Record companies invest billions of dollars in new talent around the globe each year. Gaining insight into what actually makes a hit song would provide tremendous benefits for the music industry. In this research we tackle this question by focussing on the dance hit song classification problem. A database of dance hit songs from 1985 until 2013 is built, including basic musical features, as well a… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.

    Journal ref: Journal of New music Research. 43:302 (2014)

  48. MorpheuS: generating structured music with constrained patterns and tension

    Authors: Dorien Herremans, Elaine Chew

    Abstract: Automatic music generation systems have gained in popularity and sophistication as advances in cloud computing have enabled large-scale complex computations such as deep models and optimization algorithms on personal devices. Yet, they still face an important challenge, that of long-term structure, which is key to conveying a sense of musical coherence. We present the MorpheuS music generation sys… ▽ More

    Submitted 12 December, 2018; originally announced December 2018.

    Comments: IEEE Transactions on Affective Computing. PP(99)

  49. arXiv:1812.04186  [pdf, other

    cs.SD cs.LG eess.AS

    A Functional Taxonomy of Music Generation Systems

    Authors: Dorien Herremans, Ching-Hua Chuan, Elaine Chew

    Abstract: Digital advances have transformed the face of automatic music generation since its beginnings at the dawn of computing. Despite the many breakthroughs, issues such as the musical tasks targeted by different machines and the degree to which they succeed remain open questions. We present a functional taxonomy for music generation systems with reference to existing systems. The taxonomy organizes sys… ▽ More

    Submitted 10 December, 2018; originally announced December 2018.

    Comments: survey, music generation, taxonomy, functional survey, survey, automatic composition, algorithmic composition

    MSC Class: 68Txx; 68-XX

    Journal ref: ACM Computing Surveys (CSUR), 50(5), 69. https://dl.acm.org/citation.cfm?id=3145473.3108242

  50. arXiv:1812.01278  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    Singing Voice Separation Using a Deep Convolutional Neural Network Trained by Ideal Binary Mask and Cross Entropy

    Authors: Kin Wah Edward Lin, Balamurali B. T., Enyan Koh, Simon Lui, Dorien Herremans

    Abstract: Separating a singing voice from its music accompaniment remains an important challenge in the field of music information retrieval. We present a unique neural network approach inspired by a technique that has revolutionized the field of vision: pixel-wise image classification, which we combine with cross entropy loss and pretraining of the CNN as an autoencoder on singing voice spectrograms. The p… ▽ More

    Submitted 4 December, 2018; originally announced December 2018.

    Comments: In Press, Neural Computing and Applications, Springer. 2019

    MSC Class: 68-XX; 68Txx