Skip to main content

Showing 1–50 of 241 results for author: Yang, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.17488  [pdf, other

    eess.SP

    Environmental Variation or Instrumental Drift? A Probabilistic Approach to Gas Sensor Drift Modeling and Evaluation

    Authors: Cheng Yang, Gustav Bohlin, Tobias Oechtering

    Abstract: Drift is a significant issue that undermines the reliability of gas sensors. This paper introduces a probabilistic model to distinguish between environmental variation and instrumental drift, using low-cost non-dispersive infrared (NDIR) CO2 sensors as a case study. Data from a long-term field experiment is analyzed to evaluate both sensor performance and environmental changes over time. Our appro… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: This conference paper has been submitted to IEEE SENSORS 2024

  2. arXiv:2406.16303  [pdf, other

    eess.SP

    Hybrid Precoding With Low-Resolution PSs for Wideband Terahertz Communication Systems in The Face of Beam Squint

    Authors: Yang Wang, Chuang Yang, Mugen Peng

    Abstract: Terahertz (THz) communication is considered one of the most critical technologies for 6G because of its abundant bandwidth. To compensate the high propagation of THz, analog/digital hybrid precoding for THz massive multiple input multiple output (MIMO) is proposed to focus signals and extend communication range. Notably, considering hardware cost and power consumption, infinite and high-resolution… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  3. arXiv:2406.10869  [pdf, other

    eess.IV cs.CV

    Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution

    Authors: Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guo** Qiu

    Abstract: As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI sup… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 12 figures, journal

  4. arXiv:2406.05806  [pdf, other

    cs.CL cs.SD eess.AS

    Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper

    Authors: Chih-Kai Yang, Kuan-Po Huang, Hung-yi Lee

    Abstract: This research explores the interaction between Whisper, a high-performing speech recognition model, and information in prompts. Our results unexpectedly show that Whisper may not fully grasp textual prompts as anticipated. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prom… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: In progress

  5. arXiv:2406.00555  [pdf

    eess.IV cs.CV

    Length-scale study in deep learning prediction for non-small cell lung cancer brain metastasis

    Authors: Haowen Zhou, Steven, Lin, Mark Watson, Cory T. Bernadt, Oumeng Zhang, Ramaswamy Govindan, Richard J. Cote, Changhuei Yang

    Abstract: Deep learning assisted digital pathology has the potential to impact clinical practice in significant ways. In recent studies, deep neural network (DNN) enabled analysis outperforms human pathologists. Increasing sizes and complexity of the DNN architecture generally improves performance at the cost of DNN's explainability. For pathology, this lack of DNN explainability is particularly problematic… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  6. arXiv:2406.00485  [pdf

    eess.IV cs.RO

    TacShade A New 3D-printed Soft Optical Tactile Sensor Based on Light, Shadow and Greyscale for Shape Reconstruction

    Authors: Zhenyu Lu, Jialong Yang, Haoran Li, Yifan Li, Weiyong Si, Nathan Lepora, Chenguang Yang

    Abstract: In this paper, we present the TacShade a newly designed 3D-printed soft optical tactile sensor. The sensor is developed for shape reconstruction under the inspiration of sketch drawing that uses the density of sketch lines to draw light and shadow, resulting in the creation of a 3D-view effect. TacShade, building upon the strengths of the TacTip, a single-camera tactile sensor of large in-depth de… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: This paper has been accepted by ICRA 2024

  7. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  8. arXiv:2405.10463  [pdf, other

    physics.optics eess.IV physics.bio-ph

    Single-shot volumetric fluorescence imaging with neural fields

    Authors: Oumeng Zhang, Haowen Zhou, Brandon Y. Feng, Elin M. Larsson, Reinaldo E. Alcalde, Siyuan Yin, Catherine Deng, Changhuei Yang

    Abstract: Single-shot volumetric fluorescence (SVF) imaging offers a significant advantage over traditional imaging methods that require scanning across multiple axial planes as it can capture biological processes with high temporal resolution across a large field of view. The key challenges in SVF imaging include requiring sparsity constraints to meet the multiplexing requirements of compressed sensing, el… ▽ More

    Submitted 4 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  9. arXiv:2405.06573  [pdf, other

    cs.SD cs.AI eess.AS

    An Investigation of Incorporating Mamba for Speech Enhancement

    Authors: Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

    Abstract: This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  10. arXiv:2405.00077  [pdf, other

    cs.LG eess.SP

    BrainODE: Dynamic Brain Signal Analysis via Graph-Aided Neural Ordinary Differential Equations

    Authors: Kaiqiao Han, Yi Yang, Zijie Huang, Xuan Kan, Yang Yang, Ying Guo, Lifang He, Liang Zhan, Yizhou Sun, Wei Wang, Carl Yang

    Abstract: Brain network analysis is vital for understanding the neural interactions regarding brain structures and functions, and identifying potential biomarkers for clinical phenotypes. However, widely used brain signals such as Blood Oxygen Level Dependent (BOLD) time series generated from functional Magnetic Resonance Imaging (fMRI) often manifest three challenges: (1) missing values, (2) irregular samp… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

  11. arXiv:2404.18418  [pdf, other

    cs.NI eess.SY

    Decomposition Model Assisted Energy-Saving Design in Radio Access Network

    Authors: Xiaoxue Zhao, Yijun Yu, Yexing Li, Dong Li, Yao Wang, Chungang Yang

    Abstract: The continuous emergence of novel services and massive connections involve huge energy consumption towards ultra-dense radio access networks. Moreover, there exist much more number of controllable parameters that can be adjusted to reduce the energy consumption from a network-wide perspective. However, a network-level energy-saving intent usually contains multiple network objectives and constraint… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  12. arXiv:2404.16407  [pdf, other

    cs.CL eess.AS

    U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF

    Authors: Xingchen Song, Di Wu, Binbin Zhang, Dinghao Zhou, Zhendong Peng, Bo Dang, Fu** Pan, Chao Yang

    Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    ACM Class: I.2.7

  13. arXiv:2404.14716  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS

    Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities

    Authors: Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang

    Abstract: Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel Bayes… ▽ More

    Submitted 16 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 17 pages, 6 figures

  14. arXiv:2404.13277  [pdf, other

    eess.IV cs.CV

    Beyond Score Changes: Adversarial Attack on No-Reference Image Quality Assessment from Two Perspectives

    Authors: Chenxi Yang, Yujia Liu, Dingquan Li, Yan Zhong, Tingting Jiang

    Abstract: Deep neural networks have demonstrated impressive success in No-Reference Image Quality Assessment (NR-IQA). However, recent researches highlight the vulnerability of NR-IQA models to subtle adversarial perturbations, leading to inconsistencies between model predictions and subjective ratings. Current adversarial attacks, however, focus on perturbing predicted scores of individual images, neglecti… ▽ More

    Submitted 24 April, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

    Comments: Submitted to a conference

  15. arXiv:2404.09729  [pdf

    eess.SP cs.IT cs.LG stat.ME

    Amplitude-Phase Fusion for Enhanced Electrocardiogram Morphological Analysis

    Authors: Shuaicong Hu, Yanan Wang, Jian Liu, **gyu Lin, Shengmei Qin, Zhenning Nie, Zhifeng Yao, Wenjie Cai, Cuiwei Yang

    Abstract: Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG mor… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 16 pages, 12 figures

    ACM Class: I.5.2

  16. arXiv:2404.09500  [pdf

    physics.optics eess.IV

    On-chip Real-time Hyperspectral Imager with Full CMOS Resolution Enabled by Massively Parallel Neural Network

    Authors: Junren Wen, Haiqi Gao, Weiming Shi, Shuaibo Feng, Lingyun Hao, Yujie Liu, Liang Xu, Yuchuan Shao, Yueguang Zhang, Weidong Shen, Chenying Yang

    Abstract: Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for r… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  17. arXiv:2403.19983  [pdf, other

    eess.IV cs.CV

    A multi-stage semi-supervised learning for ankle fracture classification on CT images

    Authors: Hongzhi Liu, Guicheng Li, Jiacheng Nie, Hui Tang, Chunfeng Yang, Qian** Feng, Hailin Xu, Yang Chen

    Abstract: Because of the complicated mechanism of ankle injury, it is very difficult to diagnose ankle fracture in clinic. In order to simplify the process of fracture diagnosis, an automatic diagnosis model of ankle fracture was proposed. Firstly, a tibia-fibula segmentation network is proposed for the joint tibiofibular region of the ankle joint, and the corresponding segmentation dataset is established o… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  18. arXiv:2403.16797  [pdf, other

    eess.SY

    Privacy Preservation by Intermittent Transmission in Cooperative LQG Control Systems

    Authors: Wenhao Lin, Yuqing Ni, Wen Yang, Chao Yang

    Abstract: In this paper, we study a cooperative linear quadratic Gaussian (LQG) control system with a single user and a server. In this system, the user runs a process and employs the server to meet the needs of computation. However, the user regards its state trajectories as privacy. Therefore, we propose a privacy scheme, in which the user sends data to the server intermittently. By this scheme, the serve… ▽ More

    Submitted 28 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

  19. arXiv:2403.13562  [pdf, other

    eess.SY

    Augmented Labeled Random Finite Sets and Its Application to Group Target Tracking

    Authors: Chaoqun Yang, Mengdie Xu, Xiaowei Liang, Zhiguo Shi, Heng Zhang, Xianghui Cao

    Abstract: This paper addresses the problem of group target tracking (GTT), wherein multiple closely spaced targets within a group pose a coordinated motion. To improve the tracking performance, the labeled random finite sets (LRFSs) theory is adopted, and this paper develops a new kind of LRFSs, i.e., augmented LRFSs, which introduces group information into the definition of LRFSs. Specifically, for each el… ▽ More

    Submitted 16 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  20. arXiv:2403.11397  [pdf, other

    cs.CV eess.IV

    Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization

    Authors: Yujia Liu, Chenxi Yang, Dingquan Li, Jianhao Ding, Tingting Jiang

    Abstract: The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry, aiding in performance evaluation and optimization guidance. However, these models are found to be vulnerable to adversarial attacks, which introduce imperceptible perturbations to input images, resulti… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: accepted by CVPR 2024

  21. arXiv:2403.06463  [pdf, other

    eess.SY

    A prediction-based forward-looking vehicle dispatching strategy for dynamic ride-pooling

    Authors: Xiaolei Wang, Chen Yang, Yuzhen Feng, Luohan Hu, Zhengbing He

    Abstract: For on-demand dynamic ride-pooling services, e.g., Uber Pool and Didi Pinche, a well-designed vehicle dispatching strategy is crucial for platform profitability and passenger experience. Most existing dispatching strategies overlook incoming pairing opportunities, therefore suffer from short-sighted limitations. In this paper, we propose a forward-looking vehicle dispatching strategy, which first… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  22. arXiv:2403.06066  [pdf

    eess.IV cs.CV cs.LG

    CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

    Authors: Dawei Fan, Yifan Gao, Jiaming Yu, Yan** Chen, Wencheng Li, Chuancong Lin, Kaibin Li, Changcai Yang, Riqing Chen, Lifang Wei

    Abstract: Deep learning models have shown promising performance for cell nucleus segmentation in the field of pathology image analysis. However, training a robust model from multiple domains remains a great challenge for cell nucleus segmentation. Additionally, the shortcomings of background noise, highly overlap** between cell nucleus, and blurred edges often lead to poor performance. To address these ch… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: 10 pages, 5 figures, 2 tables, MICCAI

  23. arXiv:2402.18332  [pdf, other

    eess.SP

    Recursive GNNs for Learning Precoding Policies with Size-Generalizability

    Authors: Jia Guo, Chenyang Yang

    Abstract: Graph neural networks (GNNs) have been shown promising in optimizing power allocation and link scheduling with good size generalizability and low training complexity. These merits are important for learning wireless policies under dynamic environments, which partially come from the matched permutation equivariance (PE) properties of the GNNs to the policies to be learned. Nonetheless, it has been… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: 37 pages, 8 figures

  24. arXiv:2402.06894  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

    Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

  25. arXiv:2402.05457  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

    Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

    Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

  26. arXiv:2401.16592  [pdf

    physics.med-ph eess.IV

    A compact and cost-effective laser-powered speckle visibility spectroscopy (SVS) device for measuring cerebral blood flow

    Authors: Yu Xi Huang, Simon Mahler, Maya Dickson, Aidin Abedi, Julian M. Tyszka, Jack Lo Yu Tung, Jonathan Russin, Charles Liu, Changhuei Yang

    Abstract: In the realm of cerebrovascular monitoring, primary metrics typically include blood pressure, which influences cerebral blood flow (CBF) and is contingent upon vessel radius. Measuring CBF non-invasively poses a persistent challenge, primarily attributed to the difficulty of accessing and obtaining signal from the brain. This study aims to introduce a compact speckle visibility spectroscopy (SVS)… ▽ More

    Submitted 8 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

  27. arXiv:2401.16446  [pdf

    eess.SY

    Framework of Resilient Transmission Network Reconfiguration Considering Cyber-Attacks

    Authors: Chao Yang, Gaoqi Liang, Steven R. Weller, Shaoyan Li, Junhua Zhao, Zhaoyang Dong

    Abstract: Fast and reliable transmission network reconfiguration is critical in improving power grid resilience to cyber-attacks. If the network reconfiguration following cyber-attacks is imperfect, secondary incidents may delay or interrupt post-attack restoration of the power grid. This paper proposes a framework of resilient transmission network reconfiguration, taking into account the impacts of cyber-a… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

  28. arXiv:2401.10447  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Tuan Dinh, Sungho Ryu, Jari Kolehmainen, Roger Ren, Denis Filimonov, Prashanth G. Shivakumar, Ankur Gandhe, Ariya Rastow, Jia Xu, Ivan Bulyko, Andreas Stolcke

    Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  29. arXiv:2401.10446  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by e… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

  30. arXiv:2401.05217  [pdf, other

    cs.CV eess.IV

    Exploring Vulnerabilities of No-Reference Image Quality Assessment Models: A Query-Based Black-Box Method

    Authors: Chenxi Yang, Yujia Liu, Dingquan Li, Tingting Jiang

    Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack m… ▽ More

    Submitted 25 April, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

  31. arXiv:2401.00393  [pdf

    cs.CV cs.AI cs.LG cs.MM eess.IV

    Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection

    Authors: Rahatara Ferdousi, Chunsheng Yang, M. Anwar Hossain, Fedwa Laamarti, M. Shamim Hossain, Abdulmotaleb El Saddik

    Abstract: Recent advancements in cognitive computing, with the integration of deep learning techniques, have facilitated the development of intelligent cognitive systems (ICS). This is particularly beneficial in the context of rail defect detection, where the ICS would emulate human-like analysis of image data for defect patterns. Despite the success of Convolutional Neural Networks (CNN) in visual defect c… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: 26 pages, 13 figures, Springer Journal

    MSC Class: 68T05; 94A08; 90B25 ACM Class: I.2.6; I.2.10; I.5.4; I.4.10

  32. arXiv:2401.00273  [pdf, ps, other

    eess.AS cs.CL

    Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision

    Authors: Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-yi Lee

    Abstract: This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: Submitted to ICASSP 2024 Self-supervision in Audio, Speech and Beyond workshop

  33. arXiv:2312.16772  [pdf, other

    eess.IV cs.CV cs.LG

    Unsupversied feature correlation model to predict breast abnormal variation maps in longitudinal mammograms

    Authors: Jun Bai, Annie **, Madison Adams, Clifford Yang, Sheida Nabavi

    Abstract: Breast cancer continues to be a significant cause of mortality among women globally. Timely identification and precise diagnosis of breast abnormalities are critical for enhancing patient prognosis. In this study, we focus on improving the early detection and accurate diagnosis of breast abnormalities, which is crucial for improving patient outcomes and reducing the mortality rate of breast cancer… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

  34. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  35. arXiv:2312.15197  [pdf, other

    cs.SD cs.CL cs.CV eess.AS

    TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

    Authors: Xize Cheng, Rongjie Huang, Linjun Li, Tao **, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, changpeng yang, Zhou Zhao

    Abstract: Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges comp… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

  36. arXiv:2312.14378  [pdf, other

    cs.LG cs.SD eess.AS

    Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

    Authors: Anirudh S. Sundar, Chao-Han Huck Yang, David M. Chan, Shalini Ghosh, Venkatesh Ravichandran, Phani Sankar Nidadavolu

    Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowle… ▽ More

    Submitted 9 February, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond

  37. arXiv:2312.13620  [pdf, other

    cs.CV eess.IV

    A Comprehensive End-to-End Computer Vision Framework for Restoration and Recognition of Low-Quality Engineering Drawings

    Authors: Lvyang Yang, Jiankang Zhang, Huaiqiang Li, Longfei Ren, Chen Yang, **gyu Wang, Dongyuan Shi

    Abstract: The digitization of engineering drawings is crucial for efficient reuse, distribution, and archiving. Existing computer vision approaches for digitizing engineering drawings typically assume the input drawings have high quality. However, in reality, engineering drawings are often blurred and distorted due to improper scanning, storage, and transmission, which may jeopardize the effectiveness of ex… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 20 pages, 13 figures, submitted to Engineering Applications of Artificial Intelligence

  38. arXiv:2312.09580  [pdf, other

    cs.SD cs.AR eess.AS

    A 1.6-mW Sparse Deep Learning Accelerator for Speech Separation

    Authors: Chih-Chyau Yang, Tian-Sheuan Chang

    Abstract: Low power deep learning accelerators on the speech processing enable real-time applications on edge devices. However, most of the existing accelerators suffer from high power consumption and focus on image applications only. This paper presents a low power accelerator for speech separation through algorithm and hardware optimizations. At the algorithm level, the model is compressed with structured… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Journal ref: in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 3, pp. 310-319, March 2023

  39. arXiv:2312.06668  [pdf

    cs.CL cs.SD eess.AS

    Evaluating Self-supervised Speech Models on a Taiwanese Hokkien Corpus

    Authors: Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, Winston Ou, Alice Wen-Hsin Bi, Carol Yang, Bryan Y. Chen, Rong-Wei Pai, Po-Yen Yeh, Jo-Peng Chiang, Iu-Tshian Phoann, Winnie Chang, Chenxuan Cui, Noel Chen, Jiatong Shi

    Abstract: Taiwanese Hokkien is declining in use and status due to a language shift towards Mandarin in Taiwan. This is partly why it is a low resource language in NLP and speech research today. To ensure that the state of the art in speech processing does not leave Taiwanese Hokkien behind, we contribute a 1.5-hour dataset of Taiwanese Hokkien to ML-SUPERB's hidden set. Evaluating ML-SUPERB's suite of self-… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted to ASRU 2023

  40. arXiv:2311.08323  [pdf, other

    cs.CL cs.SD eess.AS

    The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language

    Authors: Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam

    Abstract: In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi… ▽ More

    Submitted 1 April, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Main Conference

  41. arXiv:2311.08153  [pdf, other

    eess.SY cs.AI

    When Mining Electric Locomotives Meet Reinforcement Learning

    Authors: Ying Li, Zhencai Zhu, Xiaoqiang Li, Chunyu Yang, Hao Lu

    Abstract: As the most important auxiliary transportation equipment in coal mines, mining electric locomotives are mostly operated manually at present. However, due to the complex and ever-changing coal mine environment, electric locomotive safety accidents occur frequently these years. A mining electric locomotive control method that can adapt to different complex mining environments is needed. Reinforcemen… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

  42. arXiv:2311.06916  [pdf

    eess.SY cs.AI

    TSViT: A Time Series Vision Transformer for Fault Diagnosis

    Authors: Shouhua Zhang, Jiehan Zhou, Xue Ma, Chenglin Wen, Susanna Pirttikangas, Chen Yu, Weishan Zhang, Chunsheng Yang

    Abstract: Traditional fault diagnosis methods using Convolutional Neural Networks (CNNs) face limitations in capturing temporal features (i.e., the variation of vibration signals over time). To address this issue, this paper introduces a novel model, the Time Series Vision Transformer (TSViT), specifically designed for fault diagnosis. On one hand, TSViT model integrates a convolutional layer to segment vib… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

  43. A novel method of restoration path optimization for the AC-DC bulk power grid after a major blackout

    Authors: Chao Yang, Gaoshen Liang, Tianle Cheng, Yang Li, Shaoyan Li

    Abstract: The restoration control of the modern alternating current-direct current (AC-DC) hybrid power grid after a major blackout is difficult and complex. Taking into account the interaction between the line-commutated converter high-voltage direct current (LCC-HVDC) and the AC power grid, this paper proposes a novel optimization method of restoration path to reconfigure the skeleton network for the blac… ▽ More

    Submitted 27 October, 2023; originally announced November 2023.

    Comments: Accepted by IET Generation, Transmission & Distribution

    Journal ref: IET Generation, Transmission & Distribution 17 (2023) 5240-5251

  44. arXiv:2310.18529  [pdf, other

    physics.optics eess.IV

    FPM-INR: Fourier ptychographic microscopy image stack reconstruction using implicit neural representations

    Authors: Haowen Zhou, Brandon Y. Feng, Haiyun Guo, Siyu Lin, Mingshu Liang, Christopher A. Metzler, Changhuei Yang

    Abstract: Image stacks provide invaluable 3D information in various biological and pathological imaging applications. Fourier ptychographic microscopy (FPM) enables reconstructing high-resolution, wide field-of-view image stacks without z-stack scanning, thus significantly accelerating image acquisition. However, existing FPM methods take tens of minutes to reconstruct and gigabytes of memory to store a hig… ▽ More

    Submitted 31 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: Project Page: https://hwzhou2020.github.io/FPM-INR-Web/

  45. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  46. arXiv:2310.06434  [pdf, other

    cs.CL cs.AI cs.MM cs.SD eess.AS

    Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition

    Authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner

    Abstract: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the exis… ▽ More

    Submitted 16 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 as main paper. 10 pages. Revised math notations. GitHub: https://github.com/Srijith-rkr/Whispering-LLaMA

  47. arXiv:2310.04992  [pdf, other

    eess.IV cs.CV

    VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

    Authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv , et al. (17 additional authors not shown)

    Abstract: We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

  48. arXiv:2310.03018  [pdf, other

    eess.AS cs.CL cs.SD

    Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

    Authors: Kuan-Po Huang, Chih-Kai Yang, Yu-Kuan Fu, Ewan Dunbar, Hung-yi Lee

    Abstract: We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech enco… ▽ More

    Submitted 18 March, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP 2024 (v2)

  49. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  50. arXiv:2309.15649  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

    Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

    Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More

    Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023