Skip to main content

Showing 1–27 of 27 results for author: Hsu, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.14259  [pdf, other

    cs.CL cs.AI

    Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition

    Authors: Chan-Jan Hsu, Yi-Chang Chen, Feng-Ting Liao, Pei-Chen Ho, Yu-Hsiang Wang, Po-Chun Hsu, Da-shan Shiu

    Abstract: We introduce "Generative Fusion Decoding" (GFD), a novel shallow fusion framework, utilized to integrate Large Language Models (LLMs) into multi-modal text recognition systems such as automatic speech recognition (ASR) and optical character recognition (OCR). We derive the formulas necessary to enable GFD to operate across mismatched token spaces of different models by map** text token space to… ▽ More

    Submitted 2 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  2. arXiv:2404.14135  [pdf, other

    cs.CV

    Text in the Dark: Extremely Low-Light Text Image Enhancement

    Authors: Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan, Wan Jun Nah, Xinyu Wang, Jie Long Kew, Pohao Hsu, Shang Hong Lai, Chee Seng Chan, Christopher Zach

    Abstract: Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text t… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: The first two authors contributed equally to this work

  3. arXiv:2403.02712  [pdf, other

    cs.CL

    Breeze-7B Technical Report

    Authors: Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-Shan Shiu

    Abstract: Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on langua… ▽ More

    Submitted 3 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  4. arXiv:2312.04257  [pdf, other

    cs.AR

    Proxima: Near-storage Acceleration for Graph-based Approximate Nearest Neighbor Search in 3D NAND

    Authors: Weihong Xu, Junwei Chen, Po-Kai Hsu, Jaeyoung Kang, Minxuan Zhou, Sumukh **e, Shimeng Yu, Tajana Rosing

    Abstract: Approximate nearest neighbor search (ANNS) plays an indispensable role in a wide variety of applications, including recommendation systems, information retrieval, and semantic search. Among the cutting-edge ANNS algorithms, graph-based approaches provide superior accuracy and scalability on massive datasets. However, the best-performing graph-based ANN search solutions incur tens of hundreds of me… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  5. arXiv:2309.17020  [pdf, other

    eess.AS cs.SD

    Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

    Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

    Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ASRU 2023 SPARKS Workshop

  6. arXiv:2309.08448  [pdf, other

    cs.CL

    Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

    Authors: Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, Da-shan Shiu

    Abstract: The evaluation of large language models is an essential task in the field of language understanding and generation. As language models continue to advance, the need for effective benchmarks to assess their performance has become imperative. In the context of Traditional Chinese, there is a scarcity of comprehensive and diverse benchmarks to evaluate the capabilities of language models, despite the… ▽ More

    Submitted 2 October, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

  7. arXiv:2306.03942  [pdf, other

    cs.CR

    NFT.mine: An xDeepFM-based Recommender System for Non-fungible Token (NFT) Buyers

    Authors: Shuwei Li, Yucheng **, Pin-Lun Hsu, Ya-Sin Luo

    Abstract: Non-fungible token (NFT) is a tradable unit of data stored on the blockchain which can be associated with some digital asset as a certification of ownership. The past several years have witnessed the exponential growth of the NFT market. In 2021, the NFT market reached its peak with more than $40 billion trades. Despite the booming NFT market, most NFT-related studies focus on its technical aspect… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: 6 pages, 8 figures, 2 tables

  8. Federated Deep Reinforcement Learning for THz-Beam Search with Limited CSI

    Authors: Po-Chun Hsu, Li-Hsiang Shen, Chun-Hung Liu, Kai-Ten Feng

    Abstract: Terahertz (THz) communication with ultra-wide available spectrum is a promising technique that can achieve the stringent requirement of high data rate in the next-generation wireless networks, yet its severe propagation attenuation significantly hinders its implementation in practice. Finding beam directions for a large-scale antenna array to effectively overcome severe propagation attenuation of… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Journal ref: IEEE Vehicular Technology Conference (VTC-Fall), 2022

  9. arXiv:2303.04715  [pdf

    cs.CL cs.AI

    Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

    Authors: Philipp Ennen, Po-Chun Hsu, Chan-Jan Hsu, Chang-Le Liu, Yen-Chen Wu, Yin-Hsiang Liao, Chin-Tung Lin, Da-Shan Shiu, Wei-Yun Ma

    Abstract: In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles… ▽ More

    Submitted 23 June, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

  10. arXiv:2207.14568  [pdf, other

    cs.SD cs.CL eess.AS

    Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

    Authors: Da-rong Liu, Po-chun Hsu, Yi-chen Chen, Sung-feng Huang, Shun-po Chuang, Da-yi Wu, Hung-yi Lee

    Abstract: ASR has been shown to achieve great performance recently. However, most of them rely on massive paired data, which is not feasible for low-resource languages worldwide. This paper investigates how to learn directly from unpaired phone sequences and speech utterances. We design a two-stage iterative framework. GAN training is adopted in the first stage to find the map** relationship between unpai… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

  11. arXiv:2207.10643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    STOP: A dataset for Spoken Task Oriented Semantic Parsing

    Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

    Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More

    Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

  12. arXiv:2205.09185  [pdf, other

    physics.ins-det cs.LG hep-ex nucl-ex physics.comp-ph

    AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider

    Authors: C. Fanelli, Z. Papandreou, K. Suresh, J. K. Adkins, Y. Akiba, A. Albataineh, M. Amaryan, I. C. Arsene, C. Ayerbe Gayoso, J. Bae, X. Bai, M. D. Baker, M. Bashkanov, R. Bellwied, F. Benmokhtar, V. Berdnikov, J. C. Bernauer, F. Bock, W. Boeglin, M. Borysova, E. Brash, P. Brindza, W. J. Briscoe, M. Brooks, S. Bueltmann , et al. (258 additional authors not shown)

    Abstract: The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to… ▽ More

    Submitted 19 May, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

    Comments: 16 pages, 18 figures, 2 appendices, 3 tables

  13. arXiv:2205.03759  [pdf, other

    cs.LG cs.SD eess.AS

    Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information

    Authors: Chi-Luen Feng, Po-chun Hsu, Hung-yi Lee

    Abstract: Self-Supervised Learning (SSL) has made great strides recently. SSL speech models achieve decent performance on a wide range of downstream tasks, suggesting that they extract different aspects of information from speech. However, how SSL models store various information in hidden representations without interfering is still poorly understood. Taking the recently successful SSL model, HuBERT, as an… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

  14. Parallel Synthesis for Autoregressive Speech Generation

    Authors: Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

    Abstract: Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to… ▽ More

    Submitted 5 June, 2024; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  15. arXiv:2204.05486  [pdf, other

    cs.CV cs.LG

    Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison

    Authors: Po-Fang Hsu, Chiching Wei

    Abstract: In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing whole edited process. Under many temporal… ▽ More

    Submitted 2 November, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

  16. arXiv:2204.00630  [pdf, other

    eess.IV cs.CV

    Extremely Low-light Image Enhancement with Scene Text Restoration

    Authors: Pohao Hsu, Che-Tsung Lin, Chun Chet Ng, Jie-Long Kew, Mei Yih Tan, Shang-Hong Lai, Chee Seng Chan, Christopher Zach

    Abstract: Deep learning-based methods have made impressive progress in enhancing extremely low-light images - the image quality of the reconstructed images has generally improved. However, we found out that most of these methods could not sufficiently recover the image details, for instance, the texts in the scene. In this paper, a novel image enhancement framework is proposed to precisely restore the scene… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  17. arXiv:2204.00170  [pdf, other

    eess.AS cs.SD

    Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis

    Authors: Fan-Lin Wang, Po-chun Hsu, Da-rong Liu, Hung-yi Lee

    Abstract: Most recent speech synthesis systems are composed of a synthesizer and a vocoder. However, the existing synthesizers and vocoders can only be matched to acoustic features extracted with a specific configuration. Hence, we can't combine arbitrary synthesizers and vocoders together to form a complete system, not to mention apply to a newly developed model. In this paper, we proposed Universal Adapto… ▽ More

    Submitted 29 October, 2022; v1 submitted 31 March, 2022; originally announced April 2022.

  18. arXiv:2107.00309  [pdf, other

    cs.SD cs.LG eess.AS

    Adversarial Sample Detection for Speaker Verification by Neural Vocoders

    Authors: Haibin Wu, Po-chun Hsu, Ji Gao, Shanshan Zhang, Shen Huang, Jian Kang, Zhiyong Wu, Helen Meng, Hung-yi Lee

    Abstract: Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critical applications. However, ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV. We use the neural vocoder… ▽ More

    Submitted 19 May, 2022; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Accepted by ICASSP 2022

  19. arXiv:2103.04088  [pdf, other

    eess.AS cs.LG cs.SD

    Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

    Authors: Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-yi Lee

    Abstract: The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice convers… ▽ More

    Submitted 1 May, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

    Comments: Accepted by ICASSP 2021, in the special session of ICASSP 2021 M2VoC Challenge

  20. arXiv:2005.07412  [pdf, other

    eess.AS cs.SD

    WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

    Authors: Po-chun Hsu, Hung-yi Lee

    Abstract: In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model r… ▽ More

    Submitted 20 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2020

  21. arXiv:2005.03457  [pdf, other

    cs.CV

    NTIRE 2020 Challenge on NonHomogeneous Dehazing

    Authors: Codruta O. Ancuti, Cosmin Ancuti, Florin-Alexandru Vasluianu, Radu Timofte, **g Liu, Haiyan Wu, Yuan Xie, Yanyun Qu, Lizhuang Ma, Ziling Huang, Qili Deng, Ju-Chin Chao, Tsung-Shan Yang, Peng-Wen Chen, Po-Min Hsu, Tzu-Yi Liao, Chung-En Sun, Pei-Yuan Wu, Jeonghyeok Do, Jongmin Park, Munchurl Kim, Kareem Metwaly, Xuelu Li, Tiantong Guo, Vishal Monga , et al. (27 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2020 Challenge on NonHomogeneous Dehazing of images (restoration of rich details in hazy image). We focus on the proposed solutions and their results evaluated on NH-Haze, a novel dataset consisting of 55 pairs of real haze free and nonhomogeneous hazy images recorded outdoor. NH-Haze is the first realistic nonhomogeneous haze dataset that provides ground truth images.… ▽ More

    Submitted 7 May, 2020; originally announced May 2020.

    Comments: CVPR Workshops Proceedings 2020

  22. arXiv:1912.02461  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Towards Robust Neural Vocoding for Speech Generation: A Survey

    Authors: Po-chun Hsu, Chun-hsuan Wang, Andy T. Liu, Hung-yi Lee

    Abstract: Recently, neural vocoders have been widely used in speech synthesis tasks, including text-to-speech and voice conversion. However, when encountering data distribution mismatch between training and inference, neural vocoders trained on real data often degrade in voice quality for unseen scenarios. In this paper, we train four common neural vocoders, including WaveNet, WaveRNN, FFTNet, Parallel Wave… ▽ More

    Submitted 20 August, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Submitted to INTERSPEECH 2020

  23. arXiv:1910.12638  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

    Authors: Andy T. Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, Hung-yi Lee

    Abstract: We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past a… ▽ More

    Submitted 2 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Accepted by ICASSP 2020, Lecture Session

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  24. arXiv:1909.11899  [pdf, ps, other

    q-bio.QM cs.CE math.DS q-bio.NC

    Dynamic Parameter Estimation of Brain Mechanisms

    Authors: Po-Ya Hsu

    Abstract: Demystifying effective connectivity among neuronal populations has become the trend to understand the brain mechanisms of Parkinson's disease, schizophrenia, mild traumatic brain injury, and many other unlisted neurological diseases. Dynamic modeling is a state-of-the-art approach to explore various connectivities among neuronal populations corresponding to different electrophysiological responses… ▽ More

    Submitted 26 September, 2019; originally announced September 2019.

  25. arXiv:1905.11563  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

    Authors: Andy T. Liu, Po-chun Hsu, Hung-yi Lee

    Abstract: We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without using any labels. The discrete subword units are learned under an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained to discover a set of common linguistic units given a variety of speakers, and a TTS-Decoder trained to project the discovered units back to the… ▽ More

    Submitted 20 June, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Accepted by Interspeech 2019, Graz, Austria

    Journal ref: Interspeech 2019

  26. arXiv:1808.03113  [pdf, other

    cs.SD eess.AS

    Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

    Authors: Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, Lin-shan Lee

    Abstract: Speaking rate refers to the average number of phonemes within some unit time, while the rhythmic patterns refer to duration distributions for realizations of different phonemes within different phonetic structures. Both are key components of prosody in speech, which is different for different speakers. Models like cycle-consistent adversarial network (Cycle-GAN) and variational auto-encoder (VAE)… ▽ More

    Submitted 9 August, 2018; originally announced August 2018.

    Comments: 8 pages, 6 figures, Submitted to SLT 2018

  27. arXiv:0710.4645  [pdf

    cs.AR

    At-Speed Logic BIST for IP Cores

    Authors: B. Cheon, E. Lee, L. -T. Wang, X. Wen, P. Hsu, J. Cho, J. Park, H. Chao, S. Wu

    Abstract: This paper describes a flexible logic BIST scheme that features high fault coverage achieved by fault-simulation guided test point insertion, real at-speed test capability for multi-clock designs without clock frequency manipulation, and easy physical implementation due to the use of a low-speed SE signal. Application results of this scheme to two widely used IP cores are also reported.

    Submitted 25 October, 2007; originally announced October 2007.

    Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

    Journal ref: Dans Design, Automation and Test in Europe - DATE'05, Munich : Allemagne (2005)