Search | arXiv e-print repository

doi 10.18653/v1/2023.emnlp-main.154

Learning Retrieval Augmentation for Personalized Dialogue Generation

Authors: Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang

Abstract: Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the perso… ▽ More Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose $\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for $\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration ($\textbf{LAPDOG}$), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. https://github.com/hqsiswiliam/LAPDOG △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to EMNLP-2023

arXiv:2406.18187 [pdf, other]

Selective Prompting Tuning for Personalized Conversations with LLMs

Authors: Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang

Abstract: In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to… ▽ More In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90\%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (https://github.com/hqsiswiliam/SPT) is publicly available for further exploration. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 findings

arXiv:2405.19312 [pdf, other]

Design-based Causal Inference for Balanced Incomplete Block Designs

Authors: Taehyeon Koo, Nicole E. Pashley

Abstract: Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multi-site trials. However, if the number of treatments under consideration is large it might not be practical or even feasible to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for… ▽ More Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multi-site trials. However, if the number of treatments under consideration is large it might not be practical or even feasible to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for a natural alternative to the complete block design that does not require reducing the number of treatment arms, the balanced incomplete block design (BIBD). This includes deriving the properties of two estimators for BIBDs and proposing conservative variance estimators. To assist practitioners in understanding the trade-offs of using BIBDs over other designs, the precisions of resulting estimators are compared to standard estimators for the complete block, cluster-randomized, and completely randomized designs. Simulations and a data illustration demonstrate the trade-offs of using BIBDs. This work highlights BIBDs as practical and currently underutilized designs. △ Less

Submitted 1 July, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.16835 [pdf]

Superionic surface Li-ion transport in carbonaceous materials

Authors: Jianbin Zhou, Shen Wang, Chaoshan Wu, Ji Qi, Hongli Wan, Shen Lai, Shijie Feng, Tsz Wai Ko, Zhaohui Liang, Ke Zhou, Nimrod Harpak, Nick Solan, Mengchen Liu, Zeyu Hui, Paulina J. Ai, Kent Griffith, Chunsheng Wang, Shyue ** Ong, Yan Yao, ** Liu

Abstract: Unlike Li-ion transport in the bulk of carbonaceous materials, little is known about Li-ion diffusion on their surface. In this study, we have discovered an ultra-fast Li-ion transport phenomenon on the surface of carbonaceous materials, particularly when they have limited Li insertion capacity along with a high surface area. This is exemplified by a carbon black, Ketjen Black (KB). An ionic condu… ▽ More Unlike Li-ion transport in the bulk of carbonaceous materials, little is known about Li-ion diffusion on their surface. In this study, we have discovered an ultra-fast Li-ion transport phenomenon on the surface of carbonaceous materials, particularly when they have limited Li insertion capacity along with a high surface area. This is exemplified by a carbon black, Ketjen Black (KB). An ionic conductivity of 18.1 mS cm-1 at room temperature is observed, far exceeding most solid-state ion conductors. Theoretical calculations reveal a low diffusion barrier for the surface Li species. The species is also identified as Li*, which features a partial positive charge. As a result, lithiated KB functions effectively as an interlayer between Li and solid-state electrolytes (SSE) to mitigate dendrite growth and cell shorting. This function is found to be electrolyte agnostic, effective for both sulfide and halide SSEs. Further, lithiated KB can act as a high-performance mixed ion/electron conductor that is thermodynamically stable at potentials near Li metal. A graphite anode mixed with KB instead of a solid electrolyte demonstrates full utilization with a capacity retention of ~85% over 300 cycles. The discovery of this surface-mediated ultra-fast Li-ion transport mechanism provides new directions for the design of solid-state ion conductors and solid-state batteries. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 21 pages, 6 figures

arXiv:2404.07362 [pdf, other]

doi 10.1145/3613905.3650756

"We Need Structured Output": Towards User-centered Constraints on Large Language Model Output

Authors: Michael Xieyang Liu, Frederick Liu, Alexander J. Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, Carrie J. Cai

Abstract: Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective… ▽ More Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of develo**, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototy** tool. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Journal ref: "We Need Structured Output": Towards User-centered Constraints on LLM Output. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), May 11-16, 2024, Honolulu, HI, USA

arXiv:2402.12647 [pdf, other]

DiffusionNOCS: Managing Symmetry and Uncertainty in Sim2Real Multi-Modal Category-level Pose Estimation

Authors: Takuya Ikeda, Sergey Zakharov, Tianyi Ko, Muhammad Zubair Irshad, Robert Lee, Katherine Liu, Rares Ambrus, Koichi Nishiwaki

Abstract: This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonic… ▽ More This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation. Furthermore, we introduce critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations. We demonstrate the effectiveness of our method by testing it on a range of real datasets. Despite being trained solely on our generated synthetic data, our approach achieves state-of-the-art performance and unprecedented generalization qualities, outperforming baselines, even those specifically trained on the target domain. △ Less

Submitted 5 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: 8 pages. 9 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2401.12487 [pdf]

Radio emission from SN 1181 hosting a white dwarf merger product

Authors: Takatoshi Ko, Daichi Tsuna, Bunyo Hatsukade, Toshikazu Shigeyama

Abstract: The remnant of the historical supernova 1181 is claimed to be associated with a white dwarf merger remnant J005311. The supernova remnant (SNR) shock, and a termination shock expected to be formed by the intense wind of J005311, are potential sites for radio emission via synchrotron emission from shock-accelerated electrons. In this paper, we estimate the radio emission from these two shocks, and… ▽ More The remnant of the historical supernova 1181 is claimed to be associated with a white dwarf merger remnant J005311. The supernova remnant (SNR) shock, and a termination shock expected to be formed by the intense wind of J005311, are potential sites for radio emission via synchrotron emission from shock-accelerated electrons. In this paper, we estimate the radio emission from these two shocks, and find the peak radio flux to be 0.1--10 mJy (at 0.01--1 GHz) in the outer SNR shock and 0.01--0.1 mJy (at 1--10 GHz) in the inner termination shock. We also search for radio emission from this source in the archival data of the Karl G. Jansky Very Large Array (VLA) Sky Survey at 3 GHz, NRAO VLA Sky Survey at 1.4 GHz and the Canadian Galactic Plane Survey at 408 MHz, resulting in no significant detection. While targeted observations with higher sensitivity are desired, we particularly encourage those at higher frequency and angular resolution to probe the inner termination shock and its evolution. △ Less

Submitted 15 April, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

Comments: 8 pages, 4 figures, 1 Japanese movie (https://j005311.com/). Accepted for publication in PASJ

Report number: RESCEU-1/24

arXiv:2312.13585 [pdf, other]

Speech Translation with Large Language Models: An Industrial Practice

Authors: Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li

Abstract: Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au… ▽ More Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/llm-st/. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Technical report. 13 pages. Demo: https://speechtranslation.github.io/llm-st/

arXiv:2312.11804 [pdf, other]

Gravity-aware Grasp Generation with Implicit Grasp Mode Selection for Underactuated Hands

Authors: Tianyi Ko, Takuya Ikeda, Thomas Stewart, Robert Lee, Koichi Nishiwaki

Abstract: Learning-based grasp detectors typically assume a precision grasp, where each finger only has one contact point, and estimate the grasp probability. In this work, we propose a data generation and learning pipeline that can leverage power gras**, which has more contact points with an envelo** configuration and is robust against both positioning error and force disturbance. To train a grasp dete… ▽ More Learning-based grasp detectors typically assume a precision grasp, where each finger only has one contact point, and estimate the grasp probability. In this work, we propose a data generation and learning pipeline that can leverage power gras**, which has more contact points with an envelo** configuration and is robust against both positioning error and force disturbance. To train a grasp detector to prioritize power gras** while still kee** precision gras** as the secondary choice, we propose to train the network against the magnitude of disturbance in the gravity direction a grasp can resist (gravity-rejection score) rather than the binary classification of success. We also provide an efficient data generation pipeline for a dataset with gravity-rejection score annotation. In addition to thorough ablation studies, quantitative evaluation in both simulation and real-robot clarifies the significant improvement in our approach, especially when the objects are heavy. △ Less

Submitted 28 February, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2311.00088 [pdf, other]

Random coordinate descent: a simple alternative for optimizing parameterized quantum circuits

Authors: Zhiyan Ding, Taehee Ko, Jiahao Yao, Lin Lin, Xiantao Li

Abstract: Variational quantum algorithms rely on the optimization of parameterized quantum circuits in noisy settings. The commonly used back-propagation procedure in classical machine learning is not directly applicable in this setting due to the collapse of quantum states after measurements. Thus, gradient estimations constitute a significant overhead in a gradient-based optimization of such quantum circu… ▽ More Variational quantum algorithms rely on the optimization of parameterized quantum circuits in noisy settings. The commonly used back-propagation procedure in classical machine learning is not directly applicable in this setting due to the collapse of quantum states after measurements. Thus, gradient estimations constitute a significant overhead in a gradient-based optimization of such quantum circuits. This paper introduces a random coordinate descent algorithm as a practical and easy-to-implement alternative to the full gradient descent algorithm. This algorithm only requires one partial derivative at each iteration. Motivated by the behavior of measurement noise in the practical optimization of parameterized quantum circuits, this paper presents an optimization problem setting that is amenable to analysis. Under this setting, the random coordinate descent algorithm exhibits the same level of stochastic stability as the full gradient approach, making it as resilient to noise. The complexity of the random coordinate descent method is generally no worse than that of the gradient descent and can be much better for various quantum optimization problems with anisotropic Lipschitz constants. Theoretical analysis and extensive numerical experiments validate our findings. △ Less

Submitted 28 June, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

arXiv:2309.00169 [pdf, other]

RepCodec: A Speech Representation Codec for Speech Tokenization

Authors: Zhichao Huang, Chutong Meng, Tom Ko

Abstract: With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech token… ▽ More With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing. △ Less

Submitted 6 June, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.10785 [pdf, other]

Simulating Hydrogen-poor Interaction-Powered Supernovae with CHIPS

Authors: Yuki Takei, Daichi Tsuna, Takatoshi Ko, Toshikazu Shigeyama

Abstract: We present the updated open-source code Complete History of Interaction-Powered Supernovae (CHIPS) that can be applied to modeling supernovae (SNe) arising from an interaction with massive circumstellar medium (CSM) as well as the formation process of the CSM. Our update mainly concerns with extensions to hydrogen-poor SNe from stripped progenitors, targeting modeling of interaction-powered SNe Ib… ▽ More We present the updated open-source code Complete History of Interaction-Powered Supernovae (CHIPS) that can be applied to modeling supernovae (SNe) arising from an interaction with massive circumstellar medium (CSM) as well as the formation process of the CSM. Our update mainly concerns with extensions to hydrogen-poor SNe from stripped progenitors, targeting modeling of interaction-powered SNe Ibc such as Type Ibn and Icn SNe. We successfully reproduce the basic properties of the light curves of these types of SNe that occur after partial eruption of the outermost layer with a mass of $0.01$--$0.1\,M_\odot$ at $\lesssim 1$ year before explosion. We also find that the luminosity of the observed precursors can be naturally explained by the outburst that creates the dense CSM, given that the energy of the outburst is efficiently dissipated by collision with an external material, possibly generated by a previous mass eruption. We discuss possible scenarios causing eruptive mass-loss based on our results. △ Less

Submitted 18 November, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

Comments: 17 pages, 9 figures, accepted for publication in ApJ. The updates to the CHIPS code have been released as v2.0 (https://github.com/DTsuna/CHIPS)

Report number: RESCEU-25/23

arXiv:2307.13710 [pdf, other]

Robust Training of Machine Learning Interatomic Potentials with Dimensionality Reduction and Stratified Sampling

Authors: Ji Qi, Tsz Wai Ko, Brandon C. Wood, Tuan Anh Pham, Shyue ** Ong

Abstract: Machine learning interatomic potentials (MLIPs) enable the accurate simulation of materials at larger sizes and time scales, and play increasingly important roles in the computational understanding and design of materials. However, MLIPs are only as accurate and robust as the data they are trained on. In this work, we present DImensionality-Reduced Encoded Clusters with sTratified (DIRECT) samplin… ▽ More Machine learning interatomic potentials (MLIPs) enable the accurate simulation of materials at larger sizes and time scales, and play increasingly important roles in the computational understanding and design of materials. However, MLIPs are only as accurate and robust as the data they are trained on. In this work, we present DImensionality-Reduced Encoded Clusters with sTratified (DIRECT) sampling as an approach to select a robust training set of structures from a large and complex configuration space. By applying DIRECT sampling on the Materials Project relaxation trajectories dataset with over one million structures and 89 elements, we develop an improved materials 3-body graph network (M3GNet) universal potential that extrapolate more reliably to unseen structures. We further show that molecular dynamics (MD) simulations with universal potentials such as M3GNet can be used in place of expensive \textit{ab initio} MD to rapidly create a large configuration space for target materials systems. Combined with DIRECT sampling, we develop a highly reliable moment tensor potential for Ti-H system without the need for iterative optimization. This work paves the way towards robust high throughput development of MLIPs across any compositional complexity. △ Less

Submitted 24 July, 2023; originally announced July 2023.

arXiv:2307.07067 [pdf, other]

Implementation of the Density-functional Theory on Quantum Computers with Linear Scaling with respect to the Number of Atoms

Authors: Taehee Ko, Xiantao Li, Chunhao Wang

Abstract: Density-functional theory (DFT) has revolutionized computer simulations in chemistry and material science. A faithful implementation of the theory requires self-consistent calculations. However, this effort involves repeatedly diagonalizing the Hamiltonian, for which a classical algorithm typically requires a computational complexity that scales cubically with respect to the number of electrons. T… ▽ More Density-functional theory (DFT) has revolutionized computer simulations in chemistry and material science. A faithful implementation of the theory requires self-consistent calculations. However, this effort involves repeatedly diagonalizing the Hamiltonian, for which a classical algorithm typically requires a computational complexity that scales cubically with respect to the number of electrons. This limits DFT's applicability to large-scale problems with complex chemical environments and microstructures. This article presents a quantum algorithm that has a linear scaling with respect to the number of atoms, which is much smaller than the number of electrons. Our algorithm leverages the quantum singular value transformation (QSVT) to generate a quantum circuit to encode the density-matrix, and an estimation method for computing the output electron density. In addition, we present a randomized block coordinate fixed-point method to accelerate the self-consistent field calculations by reducing the number of components of the electron density that needs to be estimated. The proposed framework is accompanied by a rigorous error analysis that quantifies the function approximation error, the statistical fluctuation, and the iteration complexity. In particular, the analysis of our self-consistent iterations takes into account the measurement noise from the quantum circuit. These advancements offer a promising avenue for tackling large-scale DFT problems, enabling simulations of complex systems that were previously computationally infeasible. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2306.11646 [pdf, other]

Recent Advances in Direct Speech-to-text Translation

Authors: Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, **gbo Zhu

Abstract: Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati… ▽ More Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and application issues. To tackle the problem of modeling burden, two main structures have been proposed, encoder-decoder framework (Transformer and the variants) and multitask frameworks. For the challenge of data scarcity, recent work resorts to many sophisticated techniques, such as data augmentation, pre-training, knowledge distillation, and multilingual modeling. We analyze and summarize the application issues, which include real-time, segmentation, named entity, gender bias, and code-switching. Finally, we discuss some promising directions for future work. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: An expanded version of the paper accepted by IJCAI2023 survey track

arXiv:2306.10493 [pdf, other]

MOSPC: MOS Prediction Based on Pairwise Comparison

Authors: Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

Abstract: As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech… ▽ More As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC. The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality. △ Less

Submitted 18 June, 2023; originally announced June 2023.

arXiv:2306.08273 [pdf, other]

Beyond potential energy surface benchmarking: a complete application of machine learning to chemical reactivity

Authors: Xingyi Guan, Joseph Heindel, Taehee Ko, Chao Yang, Teresa Head-Gordon

Abstract: We train an equivariant machine learning model to predict energies and forces for a real-world study of hydrogen combustion under conditions of finite temperature and pressure. This challenging case for reactive chemistry illustrates that ML learned potential energy surfaces (PESs) are always incomplete as they are overly reliant on chemical intuition of what data is important for training, i.e. s… ▽ More We train an equivariant machine learning model to predict energies and forces for a real-world study of hydrogen combustion under conditions of finite temperature and pressure. This challenging case for reactive chemistry illustrates that ML learned potential energy surfaces (PESs) are always incomplete as they are overly reliant on chemical intuition of what data is important for training, i.e. stable or metastable energy states. Instead we show here that a negative design data acquisition strategy is necessary to create a more complete ML model of the PES, since it must also learn avoidance of unforeseen high energy intermediates or even unphysical energy configurations. Because this type of data is unintuitive to create, we introduce an active learning workflow based on metadynamics that samples a lower dimensional manifold within collective variables that efficiently creates highly variable energy configurations for further ML training. This strategy more rapidly completes the ML PES such that deviations among query by committee ML models helps to now signal occasional calls to the external ab initio data source to further molecular dynamics in time without need for retraining the ML model. With the hybrid ML-physics model we predict the change in transition state and/or reaction mechanism at finite temperature and pressure for hydrogen combustion, thereby delivering on the promise of real application work using ML trained models of an ab initio PES with two orders of magnitude reduction in cost. △ Less

Submitted 14 June, 2023; originally announced June 2023.

arXiv:2306.02982 [pdf, other]

PolyVoice: Language Models for Speech to Speech Translation

Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yu** Wang, Mingxuan Wang, Yuxuan Wang

Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice. △ Less

Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.17358 [pdf, other]

CTC-based Non-autoregressive Speech Translation

Authors: Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, **gbo Zhu

Abstract: Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders th… ▽ More Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$\times$, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: ACL 2023 Main Conference

arXiv:2305.11411 [pdf, other]

DUB: Discrete Unit Back-translation for Speech Translation

Authors: Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

Abstract: How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to a… ▽ More How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to answer two questions: (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://github.com/0nutation/DUB. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted to Findings of ACL 2023

arXiv:2305.10692 [pdf, other]

Accurate Fourth-Generation Machine Learning Potentials by Electrostatic Embedding

Authors: Tsz Wai Ko, Jonas A. Finkler, Stefan Goedecker, Jörg Behler

Abstract: In recent years, significant progress has been made in the development of machine learning potentials (MLPs) for atomistic simulations with applications in many fields from chemistry to materials science. While most current MLPs are based on environment-dependent atomic energies, the limitations of this locality approximation can be overcome, e.g., in fourth-generation MLPs, which incorporate long… ▽ More In recent years, significant progress has been made in the development of machine learning potentials (MLPs) for atomistic simulations with applications in many fields from chemistry to materials science. While most current MLPs are based on environment-dependent atomic energies, the limitations of this locality approximation can be overcome, e.g., in fourth-generation MLPs, which incorporate long-range electrostatic interactions based on an equilibrated global charge distribution. Apart from the considered interactions, the quality of MLPs crucially depends on the information available about the system, i.e., the descriptors. In this work we show that including -- in addition to structural information -- the electrostatic potential arising from the charge distribution in the atomic environments significantly improves the quality and transferability of the potentials. Moreover, the extended descriptor allows to overcome current limitations of two- and three-body based feature vectors regarding artificially degenerate atomic environments. The capabilities of such an electrostatically embedded fourth-generation high-dimensional neural network potential (ee4G-HDNNP), which is further augmented by pairwise interactions, are demonstrated for NaCl as a benchmark system. Employing a data set containing only neutral and negatively charged NaCl clusters, even small energy differences between different cluster geometries can be resolved, and the potential shows an impressive transferability to positively charged clusters as well as the melt. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 41 pages, 7 figures, accepted

Journal ref: J. Chem. Theory Comput., 2023

arXiv:2305.07198 [pdf, other]

Model Predictive Control of Smart Districts Participating in Frequency Regulation Market: A Case Study of Using Heating Network Storage

Authors: Hikaru Hoshino, T. John Koo, Yun-Chung Chu, Yoshihiko Susuki

Abstract: Flexibility provided by Combined Heat and Power (CHP) units in district heating networks is an important means to cope with increasing penetration of intermittent renewable energy resources, and various methods have been proposed to exploit thermal storage tanks installed in these networks. This paper studies a novel problem motivated by an example of district heating and cooling networks in Japan… ▽ More Flexibility provided by Combined Heat and Power (CHP) units in district heating networks is an important means to cope with increasing penetration of intermittent renewable energy resources, and various methods have been proposed to exploit thermal storage tanks installed in these networks. This paper studies a novel problem motivated by an example of district heating and cooling networks in Japan, where high-temperature steam is used as the heating medium. In steam-based networks, storage tanks are usually absent, and there is a strong need to utilize thermal inertia of the pipeline network as storage. However, this type of use of a heating network directly affects the operating condition of the network, and assuring safety and supply quality at the use side is an open problem. To address this, we formulate a novel control problem to utilize CHP units in frequency regulation market while satisfying physical constraints on a steam network described by a nonlinear model capturing dynamics of heat flows and heat accumulation in the network. Furthermore, a Model Predictive Control (MPC) framework is proposed to solve this problem. By consistently combining several nonlinear control techniques, a computationally efficient MPC controller is obtained and shown to work in real-time. △ Less

Submitted 11 May, 2023; originally announced May 2023.

arXiv:2304.14669 [pdf, other]

A dynamical model for IRAS 00500+6713: the remnant of a type Iax supernova SN 1181 hosting a double degenerate merger product WD J005311

Authors: Takatoshi Ko, Hiromasa Suzuki, Kazumi Kashiyama, Hiroyuki Uchida, Takaaki Tanaka, Daichi Tsuna, Kotaro Fujisawa, Aya Bamba, Toshikazu Shigeyama

Abstract: IRAS 00500+6713 is a hypothesized remnant of a type Iax supernova SN 1181. Multi-wavelength observations have revealed its complicated morphology; a dusty infrared ring is sandwiched by the inner and outer X-ray nebulae. We analyze the archival X-ray data taken by XMM-Newton and Chandra to constrain the {angular radius}, mass, and metal abundance of the X-ray nebulae, and construct a theoretical m… ▽ More IRAS 00500+6713 is a hypothesized remnant of a type Iax supernova SN 1181. Multi-wavelength observations have revealed its complicated morphology; a dusty infrared ring is sandwiched by the inner and outer X-ray nebulae. We analyze the archival X-ray data taken by XMM-Newton and Chandra to constrain the {angular radius}, mass, and metal abundance of the X-ray nebulae, and construct a theoretical model describing the dynamical evolution of IRAS 00500+6713, including the effects of the interaction between the SN ejecta and the intense wind enriched with carbon burning ashes from the central white dwarf (WD) J005311. We show that the inner X-ray nebula corresponds to the wind termination shock while the outer X-ray nebula to the shocked interface between the SN ejecta and the interstellar matter. The observed X-ray properties can be explained by our model with an {ejecta kinetic} energy of $E_\mathrm{ej} = (0.77 \mbox{--} 1.1)\times 10^{48}$~erg, an ejecta mass of $M_\mathrm{ej} = 0.18\mbox{--}0.53~M_\odot$, if the currently observed wind from WD J005311 started to blow $t_\mathrm{w} \gtrsim 810$ yr after the explosion, i.e., approximately after A.D. 1990. The inferred SN properties are compatible with those of Type Iax SNe and the timing of the wind launch may correspond to the Kelvin-Helmholtz contraction of the oxygen-neon core of WD J005311 that triggered a surface carbon burning. Our analysis supports that IRAS 00500+6713 is the remnant of SN Iax 1181 produced by a double degenerate merger of oxygen-neon and carbon-oxygen WDs, and WD J005311 is the surviving merger product. △ Less

Submitted 26 May, 2024; v1 submitted 28 April, 2023; originally announced April 2023.

Comments: 24 pages, 13 figures, 4 tables, accepted by ApJ

Report number: RESCEU-10/23

arXiv:2304.09296 [pdf, other]

Using Diffusion Maps to Analyze Reaction Dynamics for a Hydrogen Combustion Benchmark Dataset

Authors: Taehee Ko, Joseph Heindel, Xingyi Guan, Teresa Head-Gordon, David Williams-Young, Chao Yang

Abstract: We use local diffusion maps to assess the quality of two types of collective variables (CVs) for a recently published hydrogen combustion benchmark dataset~\cite{guan2022benchmark} that contains ab initio molecular dynamics trajectories and normal modes along minimum energy paths. This approach was recently advocated in~\cite{tlldiffmap20} for assessing CVs and analyzing reactions modeled by class… ▽ More We use local diffusion maps to assess the quality of two types of collective variables (CVs) for a recently published hydrogen combustion benchmark dataset~\cite{guan2022benchmark} that contains ab initio molecular dynamics trajectories and normal modes along minimum energy paths. This approach was recently advocated in~\cite{tlldiffmap20} for assessing CVs and analyzing reactions modeled by classical molecular dynamics simulations. We report the effectiveness of this approach to molecular systems modeled by quantum ab initio molecular dynamics. In addition to assessing the quality of CVs, we also use global diffusion maps to perform committor analysis as proposed in~\cite{tlldiffmap20}. We show that the committor function obtained from the global diffusion map allows us to identify transition regions of interest in several hydrogen combustion reaction channels. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2303.17395 [pdf, other]

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 12 pages

arXiv:2302.09755 [pdf, other]

doi 10.1145/3511808.3557324

Finding Heterophilic Neighbors via Confidence-based Subgraph Matching for Semi-supervised Node Classification

Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

Abstract: Graph Neural Networks (GNNs) have proven to be powerful in many graph-based applications. However, they fail to generalize well under heterophilic setups, where neighbor nodes have different labels. To address this challenge, we employ a confidence ratio as a hyper-parameter, assuming that some of the edges are disassortative (heterophilic). Here, we propose a two-phased algorithm. Firstly, we det… ▽ More Graph Neural Networks (GNNs) have proven to be powerful in many graph-based applications. However, they fail to generalize well under heterophilic setups, where neighbor nodes have different labels. To address this challenge, we employ a confidence ratio as a hyper-parameter, assuming that some of the edges are disassortative (heterophilic). Here, we propose a two-phased algorithm. Firstly, we determine edge coefficients through subgraph matching using a supplementary module. Then, we apply GNNs with a modified label propagation mechanism to utilize the edge coefficients effectively. Specifically, our supplementary module identifies a certain proportion of task-irrelevant edges based on a given confidence ratio. Using the remaining edges, we employ the widely used optimal transport to measure the similarity between two nodes with their subgraphs. Finally, using the coefficients as supplementary information on GNNs, we improve the label propagation mechanism which can prevent two nodes with smaller weights from being closer. The experiments on benchmark datasets show that our model alleviates over-smoothing and improves performance. △ Less

Submitted 12 April, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

arXiv:2302.06823 [pdf]

doi 10.1016/j.xcrp.2023.101762

Proximity-induced quasi-one-dimensional superconducting quantum anomalous Hall state: a promising scalable top-down approach towards localized Majorana modes

Authors: Omargeldi Atanov, Wai Ting Tai, Ying-Ming Xie, Yat Hei Ng, Molly A. Hammond, Tin Seng Manfred Ho, Tsin Hei Koo, Hui Li, Sui Lun Ho, Jian Lyu, Sukong Chong, Peng Zhang, Lixuan Tai, Jiannong Wang, Kam Tuen Law, Kang L. Wang, Rolf Lortz

Abstract: In this work, ~100 nm wide quantum anomalous Hall insulator (QAHI) nanoribbons are etched from a two-dimensional QAHI film. One part of the nanoribbon is covered with superconducting Nb, while the other part is connected to an Au lead via two-dimensional QAHI regions. Andreev reflection spectroscopy measurements were performed, and multiple in-gap conductance peaks were observed in three different… ▽ More In this work, ~100 nm wide quantum anomalous Hall insulator (QAHI) nanoribbons are etched from a two-dimensional QAHI film. One part of the nanoribbon is covered with superconducting Nb, while the other part is connected to an Au lead via two-dimensional QAHI regions. Andreev reflection spectroscopy measurements were performed, and multiple in-gap conductance peaks were observed in three different devices. In the presence of an increasing magnetic field perpendicular to the QAHI film, the multiple in-gap peak structure evolves into a single zero-bias conductance peak (ZBCP). Theoretical simulations suggest that the measurements are consistent with the scenario that the increasing magnetic field drives the nanoribbons from a multi-channel occupied regime to a single channel occupied regime, and that the ZBCP may be induced by zero energy Majorana modes as previously predicted [24]. Although further experiments are needed to clarify the nature of the ZBCP, we provide initial evidence that quasi-1D QAHI nanoribbon/superconductor heterostructures are new and promising platforms for realizing zero-energy Majorana modes. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Journal ref: Cell Reports Physical Science 5, 101762 (2024)

arXiv:2301.08918 [pdf, other]

Improving Signed Propagation of Graph Neural Network Under Multiple Classes

Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

Abstract: Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes achieve dismal performance on heterophilic graphs. Various schemes have been proposed to solve this problem, and propagating signed information on heterophilic edges has gained great attention. Recently, some works provided theoretical analysis that signed propagation always leads to performance improvement… ▽ More Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes achieve dismal performance on heterophilic graphs. Various schemes have been proposed to solve this problem, and propagating signed information on heterophilic edges has gained great attention. Recently, some works provided theoretical analysis that signed propagation always leads to performance improvement under a binary class scenario. However, we notice that prior analyses do not align well with multi-class benchmark datasets. This paper provides a new understanding of signed propagation for multi-class scenarios and points out two drawbacks in terms of message-passing and parameter update: (1) Message-passing: if two nodes belong to different classes but have a high similarity, signed propagation can decrease the separability. (2) Parameter update: the prediction uncertainty (e.g., conflict evidence) of signed neighbors increases during training, which can impede the stability of the algorithm. Based on the observation, we introduce two novel strategies for improving signed propagation under multi-class graphs. The proposed scheme combines calibration to secure robustness while reducing uncertainty. We show the efficacy of our theorem through extensive experiments on six benchmark graph datasets. △ Less

Submitted 18 June, 2024; v1 submitted 21 January, 2023; originally announced January 2023.

arXiv:2301.05163 [pdf, other]

Signed Directed Graph Contrastive Learning with Laplacian Augmentation

Authors: Taewook Ko, Yoonhyuk Choi, Chong-Kwon Kim

Abstract: Graph contrastive learning has become a powerful technique for several graph mining tasks. It learns discriminative representation from different perspectives of augmented graphs. Ubiquitous in our daily life, singed-directed graphs are the most complex and tricky to analyze among various graph types. That is why singed-directed graph contrastive learning has not been studied much yet, while there… ▽ More Graph contrastive learning has become a powerful technique for several graph mining tasks. It learns discriminative representation from different perspectives of augmented graphs. Ubiquitous in our daily life, singed-directed graphs are the most complex and tricky to analyze among various graph types. That is why singed-directed graph contrastive learning has not been studied much yet, while there are many contrastive studies for unsigned and undirected. Thus, this paper proposes a novel signed-directed graph contrastive learning, SDGCL. It makes two different structurally perturbed graph views and gets node representations via magnetic Laplacian perturbation. We use a node-level contrastive loss to maximize the mutual information between the two graph views. The model is jointly learned with contrastive and supervised objectives. The graph encoder of SDGCL does not depend on social theories or predefined assumptions. Therefore it does not require finding triads or selecting neighbors to aggregate. It leverages only the edge signs and directions via magnetic Laplacian. To the best of our knowledge, it is the first to introduce magnetic Laplacian perturbation and signed spectral graph contrastive learning. The superiority of the proposed model is demonstrated through exhaustive experiments on four real-world datasets. SDGCL shows better performance than other state-of-the-art on four evaluation metrics. △ Less

Submitted 12 January, 2023; originally announced January 2023.

Comments: Pre-prints

arXiv:2301.04412 [pdf, ps, other]

RobustIV and controlfunctionIV: Causal Inference for Linear and Nonlinear Models with Invalid Instrumental Variables

Authors: Taehyeon Koo, You** Lee, Dylan S. Small, Zijian Guo

Abstract: We present R software packages RobustIV and controlfunctionIV for causal inference with possibly invalid instrumental variables. RobustIV focuses on the linear outcome model. It implements the two-stage hard thresholding method to select valid instrumental variables from a set of candidate instrumental variables and make inferences for the causal effect in both low- and high-dimensional settings.… ▽ More We present R software packages RobustIV and controlfunctionIV for causal inference with possibly invalid instrumental variables. RobustIV focuses on the linear outcome model. It implements the two-stage hard thresholding method to select valid instrumental variables from a set of candidate instrumental variables and make inferences for the causal effect in both low- and high-dimensional settings. Furthermore, RobustIV implements the high-dimensional endogeneity test and the searching and sampling method, a uniformly valid inference method robust to errors in instrumental variable selection. controlfunctionIV considers the nonlinear outcome model and makes inferences about the causal effect based on the control function method. Our packages are demonstrated using two publicly available economic data sets together with applications to the Framingham Heart Study. △ Less

Submitted 20 June, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

arXiv:2212.03657 [pdf, other]

M3ST: Mix at Three Levels for Speech Translation

Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: Submitted to ICASSP 2023

arXiv:2211.15398 [pdf, other]

Leveraging per Image-Token Consistency for Vision-Language Pre-training

Authors: Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang

Abstract: Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-uti… ▽ More Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. The code is released at https://github.com/gyhdog99/epic. △ Less

Submitted 2 September, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

Comments: Accepted by CVPR 2023

arXiv:2211.15081 [pdf, other]

Perturb Initial Features: Generalization of Neural Networks Under Sparse Features for Semi-supervised Node Classification

Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

Abstract: Graph neural networks (GNNs) are commonly used in semi-supervised settings. Previous research has primarily focused on finding appropriate graph filters (e.g. aggregation methods) to perform well on both homophilic and heterophilic graphs. While these methods are effective, they can still suffer from the sparsity of node features, where the initial data contain few non-zero elements. This can lead… ▽ More Graph neural networks (GNNs) are commonly used in semi-supervised settings. Previous research has primarily focused on finding appropriate graph filters (e.g. aggregation methods) to perform well on both homophilic and heterophilic graphs. While these methods are effective, they can still suffer from the sparsity of node features, where the initial data contain few non-zero elements. This can lead to overfitting in certain dimensions in the first projection matrix, as training samples may not cover the entire range of graph filters (hyperplanes). To address this, we propose a novel data augmentation strategy. Specifically, by flip** both the initial features and hyperplane, we create additional space for training, which leads to more precise updates of the learnable parameters and improved robustness for unseen features during inference. To the best of our knowledge, this is the first attempt to mitigate the overfitting caused by the initial features. Extensive experiments on real-world datasets show that our proposed technique increases node classification accuracy by up to 46.5% relatively. △ Less

Submitted 28 May, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.02853 [pdf, other]

doi 10.3847/1538-4357/aca095

Self-Similar solution of rotating eruptive outflows on its equatorial plane

Authors: Takatoshi Ko, Kotaro Fujisawa, Toshikazu Shigeyama

Abstract: We construct axisymmetric self-similar solutions of transonic outflows emanating from a point source including the effect of the rotation. The solutions are constructed exclusively on the equatorial plane. The features of solutions are determined by three parameters; the adiabatic index $γ$, the dimensionless coordinate of the transonic point, and the dimensionless azimuthal velocity at the transo… ▽ More We construct axisymmetric self-similar solutions of transonic outflows emanating from a point source including the effect of the rotation. The solutions are constructed exclusively on the equatorial plane. The features of solutions are determined by three parameters; the adiabatic index $γ$, the dimensionless coordinate of the transonic point, and the dimensionless azimuthal velocity at the transonic point. We classify the solutions into five groups according to the asymptotic behaviors. We find that the behaviors of the self-similar solutions change at $γ= 11/9$. In addition, some solutions show double-power-law density profiles, which are usually seen in ejecta from a binary merger or nova-like explosion. Thus, our self-similar solutions can be applied not only to the outflow blowing from the central spinning objects, but also to the ejecta erupted from the binary merger or nova-like explosion. △ Less

Submitted 5 November, 2022; originally announced November 2022.

Comments: 14 pages, 6 figures, 1 table. Accepted by ApJ

Report number: RESCEU-21/22 MSC Class: 85-10

arXiv:2210.16428 [pdf, other]

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics. △ Less

Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: INTERSPEECH 2023

arXiv:2210.15088 [pdf, other]

doi 10.1609/aaai.v37i11.26518

Personalized Dialogue Generation with Persona-Adaptive Attention

Authors: Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, Lilian Tang

Abstract: Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we… ▽ More Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we propose an effective framework with Persona-Adaptive Attention (PAA), which adaptively integrates the weights from the persona and context information via our designed attention. In addition, a dynamic masking mechanism is applied to the PAA to not only drop redundant information in context and persona but also serve as a regularization mechanism to avoid overfitting. Experimental results demonstrate the superiority of the proposed PAA framework compared to the strong baselines in both automatic and human evaluation. Moreover, the proposed PAA approach can perform equivalently well in a low-resource regime compared to models trained in a full-data setting, which achieve a similar result with only 20% to 30% of data compared to the larger models trained in the full-data setting. To fully exploit the effectiveness of our design, we designed several variants for handling the weighted information in different ways, showing the necessity and sufficiency of our weighting and masking designs. △ Less

Submitted 9 January, 2024; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: 8 pages, 3 figures Accepted by AAAI-2023

arXiv:2210.04062 [pdf, other]

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Authors: Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li

Abstract: Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech… ▽ More Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT. △ Less

Submitted 5 July, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

Comments: Accepted by Interspeech 2023

arXiv:2209.05494 [pdf]

doi 10.1002/adma.202205825

Reversibly controlled ternary polar states and ferroelectric bias promoted by boosting square-tensile-strain

Authors: Jun Han Lee, Nguyen Xuan Duong, Min-Hyoung Jung, Hyun-Jae Lee, Ahyoung Kim, Youngki Yeo, Junhyung Kim, Gye-Hyeon Kim, Byeong-Gwan Cho, Jaegyu Kim, Furqan Ul Hassan Naqvi, Jong-Seong Bae, Jeehoon Kim, Chang Won Ahn, Young-Min Kim, Tae Kwon Song, Jae-Hyeon Ko, Tae-Yeong Koo, Changhee Sohn, Kibog Park, Chan-Ho Yang, Sang Mo Yang, Jun Hee Lee, Hu Young Jeong, Tae Heon Kim , et al. (1 additional authors not shown)

Abstract: Interaction between dipoles often emerges intriguing physical phenomena, such as exchange bias in the magnetic heterostructures and magnetoelectric effect in multiferroics, which lead to advances in multifunctional heterostructures. However, the defect-dipole tends to be considered the undesired to deteriorate the electronic functionality. Here, we report deterministic switching between the ferroe… ▽ More Interaction between dipoles often emerges intriguing physical phenomena, such as exchange bias in the magnetic heterostructures and magnetoelectric effect in multiferroics, which lead to advances in multifunctional heterostructures. However, the defect-dipole tends to be considered the undesired to deteriorate the electronic functionality. Here, we report deterministic switching between the ferroelectric and the pinched states by exploiting a new substrate of cubic perovskite, BaZrO$_{3}$, which boosts square-tensile-strain to BaTiO$_{3}$ and promotes four-variants in-plane spontaneous polarization with oxygen vacancy creation. First-principles calculations propose a complex of an oxygen vacancy and two Ti$^{3+}$ ions coins a charge-neutral defect-dipole. Cooperative control of the defect-dipole and the spontaneous polarization reveals ternary in-plane polar states characterized by biased/pinched hysteresis loops. Furthermore, we experimentally demonstrate that three electrically controlled polar-ordering states lead to switchable and non-volatile dielectric states for application of non-destructive electro-dielectric memory. This discovery opens a new route to develop functional materials via manipulating defect-dipoles and offers a novel platform to advance heteroepitaxy beyond the prevalent perovskite substrates. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: According to the Copyright Policy, the submission version (before peer-review and revision)

Journal ref: Advanced Materials, 2205825 (2022)

arXiv:2208.11511 [pdf, other]

A Graph Convolution for Signed Directed Graphs

Authors: Taewook Ko, Chong-Kwon Kim

Abstract: A signed directed graph is a graph with sign and direction information on the edges. Even though signed directed graphs are more informative than unsigned or undirected graphs, they are more complicated to analyze and have received less research attention. This paper investigates a spectral graph convolution model to fully utilize the information embedded in signed directed edges. We propose a nov… ▽ More A signed directed graph is a graph with sign and direction information on the edges. Even though signed directed graphs are more informative than unsigned or undirected graphs, they are more complicated to analyze and have received less research attention. This paper investigates a spectral graph convolution model to fully utilize the information embedded in signed directed edges. We propose a novel complex Hermitian adjacency matrix that encodes graph information via complex numbers. Compared to a simple connection-based adjacency matrix, the complex Hermitian can represent edge direction, sign, and connectivity via its phases and magnitudes. Then, we define a magnetic Laplacian of the proposed adjacency matrix and prove that it is positive semi-definite (PSD) for the analyses using spectral graph convolution. We perform extensive experiments on four real-world datasets. Our experiments show that the proposed scheme outperforms several state-of-the-art techniques. △ Less

Submitted 16 February, 2023; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: Preprint version

arXiv:2208.02189 [pdf, other]

A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

Authors: Qibing Bai, Tom Ko, Yu Zhang

Abstract: In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of… ▽ More In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted by INTERSPEECH 2022

arXiv:2205.12756 [pdf, other]

Development of a Stereo-Vision Based High-Throughput Robotic System for Mouse Tail Vein Injection

Authors: Tianyi Ko, Koichi Nishiwaki, Koji Terada, Yusuke Tanaka, Shun Mitsumata, Ryuichi Katagiri, Taketo Junko, Naoshi Horiba, Hideyoshi Igata, Kazue Mizuno

Abstract: In this paper, we present a robotic device for mouse tail vein injection. We propose a mouse holding mechanism to realize vein injection without anesthetizing the mouse, which consists of a tourniquet, vacuum port, and adaptive tail-end fixture. The position of the target vein in 3D space is reconstructed from a high-resolution stereo vision. The vein is detected by a simple but robust vein line d… ▽ More In this paper, we present a robotic device for mouse tail vein injection. We propose a mouse holding mechanism to realize vein injection without anesthetizing the mouse, which consists of a tourniquet, vacuum port, and adaptive tail-end fixture. The position of the target vein in 3D space is reconstructed from a high-resolution stereo vision. The vein is detected by a simple but robust vein line detector. Thanks to the proposed two-staged calibration process, the total time for the injection process is limited to 1.5 minutes, despite that the position of needle and tail vein varies for each trial. We performed an injection experiment targeting 40 mice and succeeded to inject saline to 37 of them, resulting 92.5% success ratio. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: accepted to ICRA2022 (7 pages, 11 figures, 2 tables)

arXiv:2205.11772 [pdf]

Multi-Augmentation for Efficient Visual Representation Learning for Self-supervised Pre-training

Authors: Van-Nhiem Tran, Chi-En Huang, Shen-Hsuan Liu, Kai-Lin Yang, Timothy Ko, Yung-Hui Li

Abstract: In recent years, self-supervised learning has been studied to deal with the limitation of available labeled-dataset. Among the major components of self-supervised learning, the data augmentation pipeline is one key factor in enhancing the resulting performance. However, most researchers manually designed the augmentation pipeline, and the limited collections of transformation may cause the lack of… ▽ More In recent years, self-supervised learning has been studied to deal with the limitation of available labeled-dataset. Among the major components of self-supervised learning, the data augmentation pipeline is one key factor in enhancing the resulting performance. However, most researchers manually designed the augmentation pipeline, and the limited collections of transformation may cause the lack of robustness of the learned feature representation. In this work, we proposed Multi-Augmentations for Self-Supervised Representation Learning (MA-SSRL), which fully searched for various augmentation policies to build the entire pipeline to improve the robustness of the learned feature representation. MA-SSRL successfully learns the invariant feature representation and presents an efficient, effective, and adaptable data augmentation pipeline for self-supervised pre-training on different distribution and domain datasets. MA-SSRL outperforms the previous state-of-the-art methods on transfer and semi-supervised benchmarks while requiring fewer training epochs. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2205.08993 [pdf, other]

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

Authors: Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang

Abstract: Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new… ▽ More Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new state-of-the-art result on the Fisher English-to-Spanish test set. Indeed, we exploit the pseudo data with a combination of popular techniques which are not trivial when applied to S2ST. Moreover, we evaluate our approach on both syntactically similar (Spanish-English) and distant (English-Chinese) language pairs. Our implementation is available at https://github.com/fengpeng-yue/speech-to-speech-translation. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2204.03939 [pdf, ps, other]

GigaST: A 10,000-hour Pseudo Speech Translation Corpus

Authors: Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao

Abstract: This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C… ▽ More This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST. △ Less

Submitted 6 June, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted at Interspeech 2023. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST

arXiv:2203.17113 [pdf, other]

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

Authors: Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, **yu Li, Yao Qian, Furu Wei

Abstract: This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language mode… ▽ More This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts. In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text. Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on fine-tuning subsets of 10h and 100h. We release our code and model at https://github.com/microsoft/SpeechT5/tree/main/Speech2C. △ Less

Submitted 20 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted by Interspeech 2022

arXiv:2203.15610 [pdf, other]

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Authors: Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

Abstract: Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pr… ▽ More Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $10^9$ architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/mechanicalsea/lighthubert. △ Less

Submitted 18 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, accepted to Insterspeech 2022

arXiv:2203.10973 [pdf, ps, other]

A Local Convergence Theory for the Stochastic Gradient Descent Method in Non-Convex Optimization With Non-isolated Local Minima

Authors: Taehee Ko, Xiantao Li

Abstract: Loss functions with non-isolated minima have emerged in several machine learning problems, creating a gap between theory and practice. In this paper, we formulate a new type of local convexity condition that is suitable to describe the behavior of loss functions near non-isolated minima. We show that such condition is general enough to encompass many existing conditions. In addition we study the l… ▽ More Loss functions with non-isolated minima have emerged in several machine learning problems, creating a gap between theory and practice. In this paper, we formulate a new type of local convexity condition that is suitable to describe the behavior of loss functions near non-isolated minima. We show that such condition is general enough to encompass many existing conditions. In addition we study the local convergence of the SGD under this mild condition by adopting the notion of stochastic stability. The corresponding concentration inequalities from the convergence analysis help to interpret the empirical observation from some practical training results. △ Less

Submitted 30 May, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2112.14909 [pdf, other]

doi 10.3847/1538-4357/ac67e1

Eruption of the Envelope of Massive Stars by Energy Injection with Finite Duration

Authors: Takatoshi Ko, Daichi Tsuna, Yuki Takei, Toshikazu Shigeyama

Abstract: A significant fraction of supernovae show signatures of dense circumstellar material (CSM). While multiple scenarios for creating a dense CSM exist, mass eruption due to injection of energy at the base of the outer envelope is a likely possibility. We carry out radiation hydrodynamical simulations of eruptive mass loss from a typical red supergiant progenitor with initial mass of $15\ M_\odot$, fo… ▽ More A significant fraction of supernovae show signatures of dense circumstellar material (CSM). While multiple scenarios for creating a dense CSM exist, mass eruption due to injection of energy at the base of the outer envelope is a likely possibility. We carry out radiation hydrodynamical simulations of eruptive mass loss from a typical red supergiant progenitor with initial mass of $15\ M_\odot$, for the first time focusing on the timescale of the injection as well as energy. We find that not only sufficient injection energy but also sufficient rate of energy injection per unit time, $L_{\rm{min}} \sim 8\times 10^{40}$ erg s$^{-1}$ in this particular model, is required for eruption of unbound CSM. This result suggests that the energy injection rate needs to be greater than the binding energy of the envelope divided by the dynamical timescale for the eruption. The density profile of the resulting CSM, whose shape was analytically and numerically predicted in the limit of instantaneous energy injection, similarly holds for a finite injection timescale. We discuss our findings in the framework of proposed mass outburst scenarios, specifically wave-driven outbursts and common envelope ejection. △ Less

Submitted 14 April, 2022; v1 submitted 29 December, 2021; originally announced December 2021.

Comments: 9 pages, 6 figures, 2 tables, Accepted for publication in ApJ

Report number: RESCEU-25/21

arXiv:2110.12648 [pdf, other]

doi 10.1145/3511808.3557434

Review-Based Domain Disentanglement without Duplicate Users or Contexts for Cross-Domain Recommendation

Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Hyungho Byun, Chong-Kwon Kim

Abstract: A cross-domain recommendation has shown promising results in solving data-sparsity and cold-start problems. Despite such progress, existing methods focus on domain-shareable information (overlapped users or same contexts) for a knowledge transfer, and they fail to generalize well without such requirements. To deal with these problems, we suggest utilizing review texts that are general to most e-co… ▽ More A cross-domain recommendation has shown promising results in solving data-sparsity and cold-start problems. Despite such progress, existing methods focus on domain-shareable information (overlapped users or same contexts) for a knowledge transfer, and they fail to generalize well without such requirements. To deal with these problems, we suggest utilizing review texts that are general to most e-commerce systems. Our model (named SER) uses three text analysis modules, guided by a single domain discriminator for disentangled representation learning. Here, we suggest a novel optimization strategy that can enhance the quality of domain disentanglement, and also debilitates detrimental information of a source domain. Also, we extend the encoding network from a single to multiple domains, which has proven to be powerful for review-based recommender systems. Extensive experiments and ablation studies demonstrate that our method is efficient, robust, and scalable compared to the state-of-the-art single and cross-domain recommendation methods. △ Less

Submitted 12 April, 2023; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

arXiv:2110.07205 [pdf, other]

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Authors: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, **yu Li, Furu Wei

Abstract: Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After prepro… ▽ More Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, ho** to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5. △ Less

Submitted 24 May, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: Accepted by ACL 2022 main conference

Showing 1–50 of 117 results for author: Koo, T