Search | arXiv e-print repository

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Authors: Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod… ▽ More Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

Comments: 6 pages, 8 figures, 8 tables. Audio samples: https://zzaudio.github.io/SpecMaskGIT/index.html

arXiv:2406.01867 [pdf, other]

MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

Authors: Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

Abstract: In motion generation, controllability as well as generation quality and speed is becoming more and more important. There are various motion editing tasks, such as in-betweening, upper body editing, and path-following, but existing methods perform motion editing with a data-space diffusion model, which is slow in inference compared to a latent diffusion model. In this paper, we propose MoLA, which… ▽ More In motion generation, controllability as well as generation quality and speed is becoming more and more important. There are various motion editing tasks, such as in-betweening, upper body editing, and path-following, but existing methods perform motion editing with a data-space diffusion model, which is slow in inference compared to a latent diffusion model. In this paper, we propose MoLA, which provides fast and high-quality motion generation and also can deal with multiple editing tasks in a single framework. For high-quality and fast generation, we employ a variational autoencoder and latent diffusion model, and improve the performance with adversarial training. In addition, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 12 pages, 6 figures

arXiv:2405.14598 [pdf, other]

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Authors: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method… ▽ More In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/ △ Less

Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Comments: 10 pages

arXiv:2309.09223 [pdf, other]

Zero- and Few-shot Sound Event Localization and Detection

Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlap** cases. To tackle the assignment problem in overlap** cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset. △ Less

Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

arXiv:2308.06981 [pdf, other]

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

Abstract: This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes… ▽ More This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail. △ Less

Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

Comments: Accepted for Transactions of the International Society for Music Information Retrieval

arXiv:2306.10029 [pdf, other]

Pseudo Session-Based Recommendation with Hierarchical Embedding and Session Attributes

Authors: Yuta Sumiya, Ryusei Numata, Satoshi Takahashi

Abstract: Recently, electronic commerce (EC) websites have been unable to provide an identification number (user ID) for each transaction data entry because of privacy issues. Because most recommendation methods assume that all data are assigned a user ID, they cannot be applied to the data without user IDs. Recently, session-based recommendation (SBR) based on session information, which is short-term behav… ▽ More Recently, electronic commerce (EC) websites have been unable to provide an identification number (user ID) for each transaction data entry because of privacy issues. Because most recommendation methods assume that all data are assigned a user ID, they cannot be applied to the data without user IDs. Recently, session-based recommendation (SBR) based on session information, which is short-term behavioral information of users, has been studied. A general SBR uses only information about the item of interest to make a recommendation (e.g., item ID for an EC site). Particularly in the case of EC sites, the data recorded include the name of the item being purchased, the price of the item, the category hierarchy, and the gender and region of the user. In this study, we define a pseudo--session for the purchase history data of an EC site without user IDs and session IDs. Finally, we propose an SBR with a co-guided heterogeneous hypergraph and globalgraph network plus, called CoHHGN+. The results show that our CoHHGN+ can recommend items with higher performance than other methods. △ Less

Submitted 5 August, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: 15 pages, 1 figures, 5 tables

arXiv:2306.09126 [pdf, other]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637. △ Less

Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

arXiv:2306.01764 [pdf, ps, other]

Data Science in an Agent-Based Simulation World

Authors: Satoshi Takahashi, Atushi Yoshikawa

Abstract: In data science education, the importance of learning to solve real-world problems has been argued. However, there are two issues with this approach: (1) it is very costly to prepare multiple real-world problems (using real data) according to the learning objectives, and (2) the learner must suddenly tackle complex real-world problems immediately after learning from a textbook using ideal data. To… ▽ More In data science education, the importance of learning to solve real-world problems has been argued. However, there are two issues with this approach: (1) it is very costly to prepare multiple real-world problems (using real data) according to the learning objectives, and (2) the learner must suddenly tackle complex real-world problems immediately after learning from a textbook using ideal data. To solve these issues, this paper proposes data science teaching material that uses agent-based simulation (ABS). The proposed teaching material consists of an ABS model and an ABS story. To solve issue 1, the scenario of the problem can be changed according to the learning objectives by setting the appropriate parameters of the ABS model. To solve issue 2, the difficulty level of the tasks can be adjusted by changing the description in the ABS story. We show that, by using this teaching material, the learner can simulate the typical tasks performed by a data scientist in a step-by-step manner (causal inference, data understanding, hypothesis building, data collection, data wrangling, data analysis, and hypothesis testing). The teaching material described in this paper focuses on causal inference as the learning objectives and infectious diseases as the model theme for ABS, but ABS is used as a model to reproduce many types of social phenomena, and its range of expression is extremely wide. Therefore, we expect that the proposed teaching material will inspire the construction of teaching material for various objectives in data science education. △ Less

Submitted 27 May, 2023; originally announced June 2023.

Comments: 9 pages, 10 figures

arXiv:2305.10734 [pdf, other]

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+). △ Less

Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

arXiv:2305.07855 [pdf, ps, other]

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

Authors: Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequen… ▽ More This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source, hence we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX) and densely connected dilated DenseNet (D3Net) extended with our X-scheme, respectively called X-UMX and X-D3Net, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX. △ Less

Submitted 13 May, 2023; originally announced May 2023.

Comments: Submitted to IEEE TASLP (under review), 11 pages, 8 figures

arXiv:2305.06701 [pdf, ps, other]

Extending Audio Masked Autoencoders Toward Audio Restoration

Authors: Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s… ▽ More Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., speech enhancement (SE). Previous works have shown that the features extracted by pretrained audio encoders are effective for SE tasks, but these speech-specialized encoder-only models usually require extra decoders to become compatible with SE, and involve complicated pretraining procedures or complex data augmentation. Therefore, in pursuit of a universal audio model, the audio masked autoencoder (MAE) whose backbone is the autoencoder of Vision Transformers (ViT-AE), is extended from audio classification to SE, a representative restoration task with well-established evaluation standards. ViT-AE learns to restore masked audio signal via a mel-to-mel map** during pretraining, which is similar to restoration tasks like SE. We propose variations of ViT-AE for a better SE performance, where the mel-to-mel variations yield high scores in non-intrusive metrics and the STFT-oriented variation is effective at intrusive metrics such as PESQ. Different variations can be used in accordance with the scenarios. Comprehensive evaluations reveal that MAE pretraining is beneficial to SE tasks and help the ViT-AE to better generalize to out-of-domain distortions. We further found that large-scale noisy data of general audio sources, rather than clean speech, is sufficiently effective for pretraining. △ Less

Submitted 17 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: WASPAA 2023.Copyright 2023 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses,in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works

arXiv:2305.05857 [pdf, other]

Diffusion-based Signal Refiner for Speech Separation

Authors: Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

Abstract: We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion mod… ▽ More We have developed a diffusion-based speech refiner that improves the reference-free perceptual quality of the audio predicted by preceding single-channel speech separation models. Although modern deep neural network-based speech separation models have show high performance in reference-based metrics, they often produce perceptually unnatural artifacts. The recent advancements made to diffusion models motivated us to tackle this problem by restoring the degraded parts of initial separations with a generative approach. Utilizing the denoising diffusion restoration model (DDRM) as a basis, we propose a shared DDRM-based refiner that generates samples conditioned on the global information of preceding outputs from arbitrary speech separation models. We experimentally show that our refiner can provide a clearer harmonic structure of speech and improves the reference-free metric of perceptual quality for arbitrary preceding model architectures. Furthermore, we tune the variance of the measurement noise based on preceding outputs, which results in higher scores in both reference-free and reference-based metrics. The separation quality can also be further improved by blending the discriminative and generative outputs. △ Less

Submitted 12 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Under review

arXiv:2303.01308 [pdf, ps, other]

In-the-wild vibrotactile sensation: Perceptual transformation of vibrations from smartphones

Authors: Keiko Yamaguchi, Satoshi Takahashi

Abstract: Vibrations emitted by smartphones have become a part of our daily lives. The vibrations can add various meanings to the information people obtain from the screen. Hence, it is worth understanding the perceptual transformation of vibration with ordinary devices to evaluate the possibility of enriched vibrotactile communication via smartphones. This study assessed the reproducibility of vibrotactile… ▽ More Vibrations emitted by smartphones have become a part of our daily lives. The vibrations can add various meanings to the information people obtain from the screen. Hence, it is worth understanding the perceptual transformation of vibration with ordinary devices to evaluate the possibility of enriched vibrotactile communication via smartphones. This study assessed the reproducibility of vibrotactile sensations via smartphone in the in-the-wild environment. To realize improved haptic design to communicate with smartphone users smoothly, we also focused on the moderation effects of the in-the-wild environments on the vibrotactile sensations: the physical specifications of mobile devices, the manner of device operation by users, and the personal traits of the users about the desire for touch. We conducted a Web-based in-the-wild experiment instead of a laboratory experiment to reproduce an environment as close to the daily lives of users as possible. Through a series of analyses, we revealed that users perceive the weight of vibration stimuli to be higher in sensation magnitude than intensity under identical conditions of vibration stimuli. We also showed that it is desirable to consider the moderation effects of the in-the-wild environments for realizing better tactile system design to maximize the impact of vibrotactile stimuli. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 8 pages, 9 figures

arXiv:2302.08136 [pdf, ps, other]

An Attention-based Approach to Hierarchical Multi-label Music Instrument Classification

Authors: Zhi Zhong, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic sett… ▽ More Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic setting where multi-instrument real music data is assumed. Various hierarchical methods that jointly train a DNN are summarized and explored in the context of the fusion of deep learning and conventional techniques. For the effective joint training in the multi-label setting, we propose two methods to model the connection between fine- and coarse-level tags, where one uses rule-based grouped max-pooling, the other one uses the attention mechanism obtained in a data-driven manner. Our evaluation reveals that the proposed methods have advantages over the method without joint training. In addition, the decision procedure within the proposed methods can be interpreted by visualizing attention maps or referring to fixed rules. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: To appear at ICASSP 2023

arXiv:2210.17287 [pdf, ps, other]

doi 10.21437/Interspeech.2023-1547

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

Authors: Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase… ▽ More Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a dataset consisting of clean speech only. Then, our refiner effectively mixes clean parts newly generated via denoising diffusion restoration into the degraded and distorted parts caused by a preceding SE method, resulting in refined speech. Once our refiner is trained on a set of clean speech, it can be applied to various SE methods without additional training specialized for each SE module. Therefore, our refiner can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity. Experimental results show that our method improved perceptual speech quality regardless of the preceding SE methods used. △ Less

Submitted 30 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted by Interspeech 2023

arXiv:2210.05148 [pdf, other]

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

Authors: Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji

Abstract: In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms.… ▽ More In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative task where we train our model to generate realistic looking piano rolls from pure Gaussian noise conditioned on spectrograms. This new AMT formulation enables DiffRoll to transcribe, generate and even inpaint music. Due to the classifier-free nature, DiffRoll is also able to be trained on unpaired datasets where only piano rolls are available. Our experiments show that DiffRoll outperforms its discriminative counterpart by 19 percentage points (ppt.) and our ablation studies also indicate that it outperforms similar existing methods by 4.8 ppt. Source code and demonstration are available https://sony.github.io/DiffRoll/. △ Less

Submitted 20 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

Journal ref: Proceedings of ICASSP - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023

arXiv:2208.07301 [pdf, other]

A Survey on Computing Schematic Network Maps: The Challenge to Interactivity

Authors: Hsiang-Yun Wu, Benjamin Niedermann, Shigeo Takahashi, Martin Nöllenburg

Abstract: Schematic maps are in daily use to show the connectivity of subway systems and to facilitate travellers to plan their journeys effectively. This study surveys up-to-date algorithmic approaches in order to give an overview of the state of the art in schematic network map**. The study investigates the hypothesis that the choice of algorithmic approach is often guided by the requirements of the map… ▽ More Schematic maps are in daily use to show the connectivity of subway systems and to facilitate travellers to plan their journeys effectively. This study surveys up-to-date algorithmic approaches in order to give an overview of the state of the art in schematic network map**. The study investigates the hypothesis that the choice of algorithmic approach is often guided by the requirements of the map** application. For example, an algorithm that computes globally optimal solutions for schematic maps is capable of producing results for printing, while it is not suitable for computing instant layouts due to its long running time. Our analysis and discussion, therefore, focus on the computational complexity of the problem formulation and the running times of the schematic map algorithms, including algorithmic network layout techniques and station labeling techniques. The correlation between problem complexity and running time is then visually depicted using scatter plot diagrams. Moreover, since metro maps are common metaphors for data visualization, we also investigate online tools and application domains using metro map representations for analytics purposes, and finally summarize the potential future opportunities for schematic maps. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: The 2nd Schematic Map** Workshop

arXiv:2206.01948 [pdf, other]

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880. △ Less

Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

arXiv:2205.07547 [pdf, other]

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

Authors: Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

Abstract: One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa… ▽ More One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks. △ Less

Submitted 9 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

Comments: 25 pages with 10 figures, accepted for publication in ICML 2022 (Our code is available at https://github.com/sony/sqvae)

arXiv:2110.07124 [pdf, other]

Multi-ACCDOA: Localizing and Detecting Overlap** Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

Authors: Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji

Abstract: Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target u… ▽ More Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target using a single network. However, there is still a challenge in detecting the same event class from multiple locations. To overcome this problem while maintaining the advantages of the class-wise format, we extended ACCDOA to a multi one and proposed auxiliary duplicating permutation invariant training (ADPIT). The multi- ACCDOA format (a class- and track-wise output format) enables the model to solve the cases with overlaps from the same class. The class-wise ADPIT scheme enables each track of the multi-ACCDOA format to learn with the same target as the single-ACCDOA format. In evaluations with the DCASE 2021 Task 3 dataset, the model trained with the multi-ACCDOA format and with the class-wise ADPIT detects overlap** events from the same class while maintaining its performance in the other cases. Also, the proposed method performed comparably to state-of-the-art SELD methods with fewer parameters. △ Less

Submitted 27 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, accepted for publication in IEEE ICASSP 2022

arXiv:2110.06501 [pdf, other]

Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

Authors: Yuichiro Koyama, Kazuhide Shigemi, Masafumi Takahashi, Kazuki Shimada, Naoya Takahashi, Emiru Tsunoo, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events… ▽ More Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events make it difficult to accurately extract spatial characteristics from target sound events. To address this problem, we propose an impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR). RIRs corresponding to a microphone array assumed to be placed in various rooms are accurately simulated, and the source signals of the target sound events are extracted from a mixture. The simulated RIRs are then convolved with the extracted source signals to obtain an augmented multi-channel training dataset. Evaluation results obtained using the TAU-NIGENS Spatial Sound Events 2021 dataset show that the IRS contributes to improving the overall SELD performance. Additionally, we conducted an ablation study to discuss the contribution and need for each component within the IRS. △ Less

Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures, accepted for publication in IEEE ICASSP 2022

arXiv:2110.06494 [pdf, other]

Music Source Separation with Deep Equilibrium Models

Authors: Yuichiro Koyama, Naoki Murata, Stefan Uhlich, Giorgio Fabbro, Shusuke Takahashi, Yuki Mitsufuji

Abstract: While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while kee** the number of parameters small. This… ▽ More While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while kee** the number of parameters small. This makes DEQ also attractive for MSS, especially as it was originally applied to sequential modeling tasks in natural language processing and thus should in principle be also suited for MSS. However, an investigation of a good architecture and training scheme for MSS with DEQ is needed as the characteristics of acoustic signals are different from those of natural language data. Hence, in this paper we propose an architecture and training scheme for MSS with DEQ. Starting with the architecture of Open-Unmix (UMX), we replace its sequence model with DEQ. We refer to our proposed method as DEQ-based UMX (DEQ-UMX). Experimental results show that DEQ-UMX performs better than the original UMX while reducing its number of parameters by 30%. △ Less

Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2022

arXiv:2110.06126 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747312

Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection

Authors: Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full… ▽ More Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full 3D audio scene. We propose Spatial Mixup, as an application of parametric spatial audio effects for data augmentation, which modifies the directional properties of a multi-channel spatial audio signal encoded in the ambisonics domain. Similarly to beamforming, these modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced. Therefore enabling deep learning models to achieve invariance to small spatial perturbations. The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline, and compares to other well known augmentation methods. Furthermore, combining spatial mixup with other methods greatly improves performance. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures, 4 tables. Submitted to the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP)

arXiv:2110.05968 [pdf, ps, other]

Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Authors: Ryosuke Sawata, Yosuke Kashiwagi, Shusuke Takahashi

Abstract: A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and… ▽ More A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM). Then both of DNNs are alternately optimized in the training phase. Even if the AM is a black-box, e.g., like one provided by a third-party, the proposed method enables the DNN-based SE model to be optimized in terms of the CER since the DNN mimicking the AM is differentiable. Consequently, it becomes feasible to build CER-centric SE model that has no negative effect, e.g., additional calculation cost and changing network architecture, on the inference phase since our method is merely a training scheme for the existing DNN-based methods. Experimental results show that our method improved CER by 8.8% relative derived through a black-box AM although certain noise levels are kept. △ Less

Submitted 22 February, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2110.03615 [pdf, ps, other]

doi 10.1016/j.imu.2022.101084

School Virus Infection Simulator for Customizing School Schedules During COVID-19

Authors: Satoshi Takahashi, Masaki Kitazawa, Atsushi Yoshikawa

Abstract: During the Coronavirus 2019 (the covid-19) pandemic, schools continuously strive to provide consistent education to their students. Teachers and education policymakers are seeking ways to re-open schools, as it is necessary for community and economic development. However, in light of the pandemic, schools require customized schedules that can address the health concerns and safety of the students… ▽ More During the Coronavirus 2019 (the covid-19) pandemic, schools continuously strive to provide consistent education to their students. Teachers and education policymakers are seeking ways to re-open schools, as it is necessary for community and economic development. However, in light of the pandemic, schools require customized schedules that can address the health concerns and safety of the students considering classroom sizes, air conditioning equipment, classroom systems, e.g., self-contained or compartmentalized. To solve this issue, we developed the School-Virus-Infection-Simulator (SVIS) for teachers and education policymakers. SVIS simulates the spread of infection at a school considering the students' lesson schedules, classroom volume, air circulation rates in classrooms, and infectability of the students. Thus, teachers and education policymakers can simulate how their school schedules can impact current health concerns. We then demonstrate the impact of several school schedules in self-contained and departmentalized classrooms and evaluate them in terms of the maximum number of students infected simultaneously and the percentage of face-to-face lessons. The results show that increasing classroom ventilation rate is effective, however, the impact is not stable compared to customizing school schedules, in addition, school schedules can differently impact the maximum number of students infected depending on whether classrooms are self-contained or compartmentalized. It was found that one of school schedules had a higher maximum number of students infected, compared to schedules with a higher percentage of face-to-face lessons. SVIS and the simulation results can help teachers and education policymakers plan school schedules appropriately in order to reduce the maximum number of students infected, while also maintaining a certain percentage of face-to-face lessons. △ Less

Submitted 6 January, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: 10 pages, 10 figures, https://github.com/satoshi-takahashi-lab/school-virus-infection-simulator

Journal ref: Informatics in medicine unlocked, 101084 (2022)

arXiv:2107.08351 [pdf, ps, other]

doi 10.1007/978-3-030-91669-5_15

A Novel Approach to Analyze Fashion Digital Archive from Humanities

Authors: Satoshi Takahashi, Keiko Yamaguchi, Asuka Watanabe

Abstract: Fashion styles adopted every day are an important aspect of culture, and style trend analysis helps provide a deeper understanding of our societies and cultures. To analyze everyday fashion trends from the humanities perspective, we need a digital archive that includes images of what people wore in their daily lives over an extended period. In fashion research, building digital fashion image archi… ▽ More Fashion styles adopted every day are an important aspect of culture, and style trend analysis helps provide a deeper understanding of our societies and cultures. To analyze everyday fashion trends from the humanities perspective, we need a digital archive that includes images of what people wore in their daily lives over an extended period. In fashion research, building digital fashion image archives has attracted significant attention. However, the existing archives are not suitable for retrieving everyday fashion trends. In addition, to interpret how the trends emerge, we need non-fashion data sources relevant to why and how people choose fashion. In this study, we created a new fashion image archive called Chronicle Archive of Tokyo Street Fashion (CAT STREET) based on a review of the limitations in the existing digital fashion archives. CAT STREET includes images showing the clothing people wore in their daily lives during the period 1970--2017, which contain timestamps and street location annotations. We applied machine learning to CAT STREET and found two types of fashion trend patterns. Then, we demonstrated how magazine archives help us interpret how trend patterns emerge. These empirical analyses show our approach's potential to discover new perspectives to promote an understanding of our societies and cultures through fashion embedded in consumers' daily lives. △ Less

Submitted 10 September, 2021; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: In Proceedings of 'The 23rd International Conference on Asia-Pacific Digital Libraries' 17 pages, 8 figures. arXiv admin note: text overlap with arXiv:2009.13395

Journal ref: In International Conference on Asian Digital Libraries (pp. 179-194). Springer, Cham (2021)

arXiv:2107.05326 [pdf, other]

Learning interaction rules from multi-animal trajectories via augmented behavioral models

Authors: Keisuke Fujii, Naoya Takeishi, Kazushi Tsutsui, Emyo Fujioka, Nozomi Nishiumi, Ryoya Tanaka, Mika Fukushiro, Kaoru Ide, Hiroyoshi Kohno, Ken Yoda, Susumu Takahashi, Shizuko Hiryu, Yoshinobu Kawahara

Abstract: Extracting the interaction rules of biological agents from movement sequences pose challenges in various domains. Granger causality is a practical framework for analyzing the interactions from observed time-series data; however, this framework ignores the structures and assumptions of the generative process in animal behaviors, which may lead to interpretational problems and sometimes erroneous as… ▽ More Extracting the interaction rules of biological agents from movement sequences pose challenges in various domains. Granger causality is a practical framework for analyzing the interactions from observed time-series data; however, this framework ignores the structures and assumptions of the generative process in animal behaviors, which may lead to interpretational problems and sometimes erroneous assessments of causality. In this paper, we propose a new framework for learning Granger causality from multi-animal trajectories via augmented theory-based behavioral models with interpretable data-driven models. We adopt an approach for augmenting incomplete multi-agent behavioral models described by time-varying dynamical systems with neural networks. For efficient and interpretable learning, our model leverages theory-based architectures separating navigation and motion processes, and the theory-guided regularization for reliable behavioral modeling. This can provide interpretable signs of Granger-causal effects over time, i.e., when specific others cause the approach or separation. In experiments using synthetic datasets, our method achieved better performance than various baselines. We then analyzed multi-animal datasets of mice, flies, birds, and bats, which verified our method and obtained novel biological insights. △ Less

Submitted 25 October, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: 24 pages, 5 figures, to appear in NeurIPS 2021

arXiv:2106.10806 [pdf, other]

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

Authors: Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji

Abstract: This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augme… ▽ More This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and location-dependent detection. Using the ACCDOA-based system as a base, we perform model ensembles by averaging outputs of several systems trained with different conditions such as input features, training folds, and model architectures. We also use the event independent network v2 (EINV2)-based system to increase the diversity of the model ensembles. To generalize the models, we further propose impulse response simulation (IRS), which generates simulated multi-channel signals by convolving simulated room impulse responses (RIRs) with source signals extracted from the original dataset. Our systems significantly improved over the baseline system on the development dataset. △ Less

Submitted 20 June, 2021; originally announced June 2021.

Comments: 5 pages, 3 figures, submitted to DCASE2021 task3

arXiv:2106.02331 [pdf, other]

doi 10.21437/Interspeech.2021-1029

Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

Authors: Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

Abstract: This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at max… ▽ More This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at maximizing the target angle in the hyperspace based on the nature of a regular simplex. Our proposed loss imposes a higher penalty than the original DC when the speaker is assigned incorrectly. The change from DC to M-DC can be easily achieved by rewriting just one term in the loss function of DC, without any other modifications to the network architecture or model parameters. As such, our method has high practicability because it does not affect the original inference part. The experimental results show that the proposed method improves the performances of the original DC and its expansion method. △ Less

Submitted 16 October, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: Accepted by Interspeech 2021

arXiv:2103.06446 [pdf, ps, other]

Extracting candidate factors affecting long-term trends of student abilities across subjects

Authors: Satoshi Takahashi, Hiroki Kuno, Atsushi Yoshikawa

Abstract: Long-term student achievement data provide useful information to formulate the research question of what types of student skills would impact future trends across subjects. However, few studies have focused on long-term data. This is because the criteria of examinations vary depending on their designers; additionally, it is difficult for the same designer to maintain the coherence of the criteria… ▽ More Long-term student achievement data provide useful information to formulate the research question of what types of student skills would impact future trends across subjects. However, few studies have focused on long-term data. This is because the criteria of examinations vary depending on their designers; additionally, it is difficult for the same designer to maintain the coherence of the criteria of examinations beyond grades. To solve this inconsistency issue, we propose a novel approach to extract candidate factors affecting long-term trends across subjects from long-term data. Our approach is composed of three steps: Data screening, time series clustering, and causal inference. The first step extracts coherence data from long-term data. The second step groups the long-term data by shape and value. The third step extracts factors affecting the long-term trends and validates the extracted variation factors using two or more different data sets. We then conducted evaluation experiments with student achievement data from five public elementary schools and four public junior high schools in Japan. The results demonstrate that our approach extracts coherence data, clusters long-term data into interpretable groups, and extracts candidate factors affecting academic ability across subjects. Subsequently, our approach formulates a hypothesis and turns archived achievement data into useful information. △ Less

Submitted 10 March, 2021; originally announced March 2021.

arXiv:2103.05479 [pdf, ps, other]

PEAK SHIFT ESTIMATION A novel method to estimate ranking of selectively omitted examination data

Authors: Satoshi Takahashi, Masaki Kitazawa, Ryoma Aoki, Atsushi Yoshikawa

Abstract: In this paper, we focus on examination results when examinees selectively skip examinations, to compare the difficulty levels of these examinations. We call the resultant data 'selectively omitted examination data' Examples of this type of examination are university entrance examinations, certification examinations, and the outcome of students' job-hunting activities. We can learn the number of st… ▽ More In this paper, we focus on examination results when examinees selectively skip examinations, to compare the difficulty levels of these examinations. We call the resultant data 'selectively omitted examination data' Examples of this type of examination are university entrance examinations, certification examinations, and the outcome of students' job-hunting activities. We can learn the number of students accepted for each examination and organization but not the examinees' identity. No research has focused on this type of data. When we know the difficulty level of these examinations, we can obtain a new index to assess organization ability, how many students pass, and the difficulty of the examinations. This index would reflect the outcomes of their education corresponding to perspectives on examinations. Therefore, we propose a novel method, Peak Shift Estimation, to estimate the difficulty level of an examination based on selectively omitted examination data. First, we apply Peak Shift Estimation to the simulation data and demonstrate that Peak Shift Estimation estimates the rank order of the difficulty level of university entrance examinations very robustly. Peak Shift Estimation is also suitable for estimating a multi-level scale for universities, that is, A, B, C, and D rank university entrance examinations. We apply Peak Shift Estimation to real data of the Tokyo metropolitan area and demonstrate that the rank correlation coefficient between difficulty level ranking and true ranking is 0.844 and that the difference between 80 percent of universities is within 25 ranks. The accuracy of Peak Shift Estimation is thus low and must be improved; however, this is the first study to focus on ranking selectively omitted examination data, and therefore, one of our contributions is to shed light on this method. △ Less

Submitted 9 March, 2021; originally announced March 2021.

arXiv:2102.08663 [pdf, other]

Preventing Oversmoothing in VAE via Generalized Variance Parameterization

Authors: Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative. This is often related to the hyperparameter resembling the data variance. It can be shown that an inappropriate choice of this hyperparameter causes the oversmoothness in the linearly approximated case and can be empirically verified for the general c… ▽ More Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative. This is often related to the hyperparameter resembling the data variance. It can be shown that an inappropriate choice of this hyperparameter causes the oversmoothness in the linearly approximated case and can be empirically verified for the general cases. Moreover, determining such appropriate choice becomes infeasible if the data variance is non-uniform or conditional. Therefore, we propose VAE extensions with generalized parameterizations of the data variance and incorporate maximum likelihood estimation into the objective function to adaptively regularize the decoder smoothness. The images generated from proposed VAE extensions show improved Fréchet inception distance (FID) on MNIST and CelebA datasets. △ Less

Submitted 21 August, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: 35 pages with 12 figures, accepted for Neurocomputing

arXiv:2010.15306 [pdf, other]

ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

Authors: Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task in… ▽ More Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task increases system complexity and network size. To address these problems, we propose an activity-coupled Cartesian DOA (ACCDOA) representation, which assigns a sound event activity to the length of a corresponding Cartesian DOA vector. The ACCDOA representation enables us to solve a SELD task with a single target and has two advantages: avoiding the necessity of balancing the objectives and model size increase. In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size. The ACCDOA-based SELD system also performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection. △ Less

Submitted 14 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 5 pages, 5 figures, accepted for publication in IEEE ICASSP 2021

arXiv:2010.04228 [pdf, ps, other]

All for One and One for All: Improving Music Separation by Bridging Networks

Authors: Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

Abstract: This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network archi… ▽ More This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network architecture and introducing a CrossNet structure. On the other hand, we consider combinations of instrument estimates by using a new combination loss (CL). MDL and CL can easily be applied to many existing DNN-based separation methods as they are merely loss functions which are only used during training and which do not affect the inference step. Experimental results show that the performance of Open-Unmix (UMX), a well-known and state-of-the-art open source library for music separation, can be improved by utilizing our above schemes. Our modifications of UMX are open-sourced together with this paper. △ Less

Submitted 11 May, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: The both implementations of our code, i.e., NNabla and PyTorch, are available on this latest paper

arXiv:2009.13395 [pdf, ps, other]

CAT STREET: Chronicle Archive of Tokyo Street-fashion

Authors: Satoshi Takahashi, Keiko Yamaguchi, Asuka Watanabe

Abstract: The analysis of daily-life fashion trends can provide us a profound understanding of our societies and cultures. However, no appropriate digital archive exists that includes images illustrating what people wore in their daily lives over an extended period. In this study, we propose a new fashion image archive, Chronicle Archive of Tokyo Street-fashion (CAT STREET), to shed light on daily-life fash… ▽ More The analysis of daily-life fashion trends can provide us a profound understanding of our societies and cultures. However, no appropriate digital archive exists that includes images illustrating what people wore in their daily lives over an extended period. In this study, we propose a new fashion image archive, Chronicle Archive of Tokyo Street-fashion (CAT STREET), to shed light on daily-life fashion trends. CAT STREET includes images showing what people wore in their daily lives during 1970--2017, and these images contain timestamps and street location annotations. This novel database combined with machine learning enables us to observe daily-life fashion trends over a long term and analyze them quantitatively. To evaluate the potential of our proposed approach with the novel database, we corroborated the rules-of-thumb of two fashion trend phenomena that have been observed and discussed qualitatively in previous studies. Through these empirical analyses, we verified that our approach to quantify fashion trends can help in exploring unsolved research questions. We also demonstrate CAT STREET's potential to find new standpoints to promote the understanding of societies and cultures through fashion embedded in consumers' daily lives. △ Less

Submitted 29 April, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

Comments: 19 pages, 17 figures

arXiv:2009.06257 [pdf, other]

doi 10.1142/S0218348X2150033X

A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling & Neiman Methods

Authors: Kumiko Tanaka-Ishii, Shuntaro Takahashi

Abstract: This article considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this article presents a large-scale application of these methods to text. It is found that both methods can distinguish r… ▽ More This article considers the fluctuation analysis methods of Taylor and Ebeling & Neiman. While both have been applied to various phenomena in the statistical mechanics domain, their similarities and differences have not been clarified. After considering their analytical aspects, this article presents a large-scale application of these methods to text. It is found that both methods can distinguish real text from independently and identically distributed (i.i.d.) sequences. Furthermore, it is found that the Taylor exponents acquired from words can roughly distinguish text categories; this is also the case for Ebeling and Neiman exponents, but to a lesser extent. Additionally, both methods show some possibility of capturing script kinds. △ Less

Submitted 14 September, 2020; originally announced September 2020.

Journal ref: Fractals, in 2021, No.2. https://www.worldscientific.com/toc/fractals/0/ja

arXiv:2006.12014 [pdf, other]

Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

Authors: Kazuki Shimada, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, w… ▽ More Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, we propose a unified training framework that uses an activity-coupled Cartesian DOA vector~(ACCDOA) representation as a single target for both the SED and SEL tasks. To efficiently estimate sound event locations and activities, we further propose RD3Net, which incorporates recurrent and convolution layers with dense skip connections and dilation. To generalize the models, we apply three data augmentation techniques: equalized mixture data augmentation~(EMDA), rotation of first-order Ambisonic~(FOA) singals, and multichannel extension of SpecAugment. Our systems demonstrate a significant improvement over the baseline system. △ Less

Submitted 7 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Submitted to DCASE2020 task3

arXiv:1906.09379 [pdf, other]

Evaluating Computational Language Models with Scaling Properties of Natural Language

Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

Abstract: In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given… ▽ More In this article, we evaluate computational models of natural language with respect to the universal statistical behaviors of natural language. Statistical mechanical analyses have revealed that natural language text is characterized by scaling properties, which quantify the global structure in the vocabulary population and the long memory of a text. We study whether five scaling properties (given by Zipf's law, Heaps' law, Ebeling's method, Taylor's law, and long-range correlation analysis) can serve for evaluation of computational models. Specifically, we test $n$-gram language models, a probabilistic context-free grammar (PCFG), language models based on Simon/Pitman-Yor processes, neural language models, and generative adversarial networks (GANs) for text generation. Our analysis reveals that language models based on recurrent neural networks (RNNs) with a gating mechanism (i.e., long short-term memory, LSTM; a gated recurrent unit, GRU; and quasi-recurrent neural networks, QRNNs) are the only computational models that can reproduce the long memory behavior of natural language. Furthermore, through comparison with recently proposed model-based evaluation methods, we find that the exponent of Taylor's law is a good indicator of model quality. △ Less

Submitted 21 June, 2019; originally announced June 2019.

Comments: 32 pages, accepted by Computational Linguistics

arXiv:1809.03776 [pdf, other]

Solving Non-identifiable Latent Feature Models

Authors: Ryota Suzuki, Shingo Takahashi, Murtuza Petladwala, Shigeru Kohmoto

Abstract: Latent feature models (LFM)s are widely employed for extracting latent structures of data. While offering high, parameter estimation is difficult with LFMs because of the combinational nature of latent features, and non-identifiability is a particularly difficult problem when parameter estimation is not unique and there exists equivalent solutions. In this paper, a necessary and sufficient conditi… ▽ More Latent feature models (LFM)s are widely employed for extracting latent structures of data. While offering high, parameter estimation is difficult with LFMs because of the combinational nature of latent features, and non-identifiability is a particularly difficult problem when parameter estimation is not unique and there exists equivalent solutions. In this paper, a necessary and sufficient condition for non-identifiability is shown. The condition is significantly related to dependency of features, and this implies that non-identifiability may often occur in real-world applications. A novel method for parameter estimation that solves the non-identifiability problem is also proposed. This method can be combined as a post-process with existing methods and can find an appropriate solution by hop** efficiently through equivalent solutions. We have evaluated the effectiveness of the method on both synthetic and real-world datasets. △ Less

Submitted 26 September, 2018; v1 submitted 11 September, 2018; originally announced September 2018.

Comments: Submitted to NIPS 2018 (https://nips.cc/). 15 pages , 4 figures

arXiv:1804.08881 [pdf, other]

Assessing Language Models with Scaling Properties

Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

Abstract: Language models have primarily been evaluated with perplexity. While perplexity quantifies the most comprehensible prediction performance, it does not provide qualitative information on the success or failure of models. Another approach for evaluating language models is thus proposed, using the scaling properties of natural language. Five such tests are considered, with the first two accounting fo… ▽ More Language models have primarily been evaluated with perplexity. While perplexity quantifies the most comprehensible prediction performance, it does not provide qualitative information on the success or failure of models. Another approach for evaluating language models is thus proposed, using the scaling properties of natural language. Five such tests are considered, with the first two accounting for the vocabulary population and the other three for the long memory of natural language. The following models were evaluated with these tests: n-grams, probabilistic context-free grammar (PCFG), Simon and Pitman-Yor (PY) processes, hierarchical PY, and neural language models. Only the neural language models exhibit the long memory properties of natural language, but to a limited degree. The effectiveness of every test of these models is also discussed. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Comments: 14 pages, 16 figures

arXiv:1802.06564 [pdf, other]

A 4-Approximation Algorithm for k-Prize Collecting Steiner Tree Problems

Authors: Yusa Matsuda, Satoshi Takahashi

Abstract: This paper studies a 4-approximation algorithm for k-prize collecting Steiner tree problems. This problem generalizes both k-minimum spanning tree problems and prize collecting Steiner tree problems. Our proposed algorithm employs two 2-approximation algorithms for k-minimum spanning tree problems and prize collecting Steiner tree problems. Also our algorithm framework can be applied to a special… ▽ More This paper studies a 4-approximation algorithm for k-prize collecting Steiner tree problems. This problem generalizes both k-minimum spanning tree problems and prize collecting Steiner tree problems. Our proposed algorithm employs two 2-approximation algorithms for k-minimum spanning tree problems and prize collecting Steiner tree problems. Also our algorithm framework can be applied to a special case of k-prize collecting traveling salesman problems. △ Less

Submitted 19 February, 2018; originally announced February 2018.

Comments: This article is under reviewing

arXiv:1707.04848 [pdf, other]

doi 10.1371/journal.pone.0189326

Do Neural Nets Learn Statistical Laws behind Natural Language?

Authors: Shuntaro Takahashi, Kumiko Tanaka-Ishii

Abstract: The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (… ▽ More The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law, two representative statistical properties underlying natural language. We discuss the quality of reproducibility and the emergence of Zipf's law and Heaps' law as training progresses. We also point out that the neural language model has a limitation in reproducing long-range correlation, another statistical property of natural language. This understanding could provide a direction for improving the architectures of neural networks. △ Less

Submitted 28 November, 2017; v1 submitted 16 July, 2017; originally announced July 2017.

Comments: 21 pages, 11 figures

arXiv:1206.1148 [pdf, other]

From individual to population: Challenges in Medical Visualization

Authors: Charl P. Botha, Bernhard Preim, Arie Kaufman, Shigeo Takahashi, Anders Ynnerman

Abstract: In this paper, we first give a high-level overview of medical visualization development over the past 30 years, focusing on key developments and the trends that they represent. During this discussion, we will refer to a number of key papers that we have also arranged on the medical visualization research timeline. Based on the overview and our observations of the field, we then identify and discus… ▽ More In this paper, we first give a high-level overview of medical visualization development over the past 30 years, focusing on key developments and the trends that they represent. During this discussion, we will refer to a number of key papers that we have also arranged on the medical visualization research timeline. Based on the overview and our observations of the field, we then identify and discuss the medical visualization research challenges that we foresee for the coming decade. △ Less

Submitted 7 August, 2012; v1 submitted 6 June, 2012; originally announced June 2012.

Comments: Improvements based on comments by reviewers: Typos and layout issues fixed. Added two more multi-modal volume rendering references to 2.1. Added more detail on Virtual Colonoscopy to 2.2

Showing 1–43 of 43 results for author: Takahashi, S