Skip to main content

Showing 1–19 of 19 results for author: Ju, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05667  [pdf, other

    eess.IV cs.CV

    VM-DDPM: Vision Mamba Diffusion for Medical Image Synthesis

    Authors: Zhihan Ju, Wanting Zhou

    Abstract: In the realm of smart healthcare, researchers enhance the scale and diversity of medical datasets through medical image synthesis. However, existing methods are limited by CNN local perception and Transformer quadratic complexity, making it difficult to balance structural texture consistency. To this end, we propose the Vision Mamba DDPM (VM-DDPM) based on State Space Model (SSM), fully combining… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  2. arXiv:2405.04902  [pdf, other

    eess.IV cs.CV

    HAGAN: Hybrid Augmented Generative Adversarial Network for Medical Image Synthesis

    Authors: Zhihan Ju, Wanting Zhou, Longteng Kong, Yu Chen, Yi Li, Zhenan Sun, Caifeng Shan

    Abstract: Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Ad… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  3. arXiv:2404.14700  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FlashSpeech: Efficient Zero-Shot Speech Synthesis

    Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

    Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large… ▽ More

    Submitted 24 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Efficient zero-shot speech synthesis

  4. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, **yu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  5. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, **yu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  6. arXiv:2402.17511  [pdf, other

    cs.RO cs.AI

    Rethinking Mutual Information for Language Conditioned Skill Discovery on Imitation Learning

    Authors: Zhaoxun Ju, Chao Yang, Hongbo Wang, Yu Qiao, Fuchun Sun

    Abstract: Language-conditioned robot behavior plays a vital role in executing complex tasks by associating human commands or instructions with perception and actions. The ability to compose long-horizon tasks based on unconstrained language instructions necessitates the acquisition of a diverse set of general-purpose skills. However, acquiring inherent primitive skills in a coupled and long-horizon environm… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 16 pages

    ACM Class: I.2.6

  7. arXiv:2401.14718  [pdf, other

    cs.CV

    A Survey on Video Prediction: From Deterministic to Generative Approaches

    Authors: Ruibo Ming, Zhewei Huang, Zhuoxuan Ju, Jianming Hu, Lihui Peng, Shuchang Zhou

    Abstract: Video prediction, a fundamental task in computer vision, aims to enable models to generate sequences of future frames based on existing video content. This task has garnered widespread application across various domains. In this paper, we comprehensively survey both historical and contemporary works in this field, encompassing the most widely used datasets and algorithms. Our survey scrutinizes th… ▽ More

    Submitted 31 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: under review

  8. arXiv:2310.10413  [pdf, other

    eess.IV cs.CV

    Image super-resolution via dynamic network

    Authors: Chunwei Tian, Xuanyu Zhang, Qi Zhang, Mingming Yang, Zhaojie Ju

    Abstract: Convolutional neural networks (CNNs) depend on deep network architectures to extract accurate information for image super-resolution. However, obtained information of these CNNs cannot completely express predicted high-quality images for complex scenes. In this paper, we present a dynamic network for image super-resolution (DSRNet), which contains a residual enhancement block, wide enhancement blo… ▽ More

    Submitted 22 March, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

  9. arXiv:2309.02285  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS 2: Describing and Generating Voices with Text Prompt

    Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

    Abstract: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text… ▽ More

    Submitted 11 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Demo page: https://speechresearch.github.io/prompttts2

  10. arXiv:2308.01138  [pdf, other

    cs.LG cs.AI eess.SP

    Can We Transfer Noise Patterns? A Multi-environment Spectrum Analysis Model Using Generated Cases

    Authors: Haiwen Du, Zheng Ju, Yu An, Honghui Du, Dongjie Zhu, Zhaoshuo Tian, Aonghus Lawlor, Ruihai Dong

    Abstract: Spectrum analysis systems in online water quality testing are designed to detect types and concentrations of pollutants and enable regulatory agencies to respond promptly to pollution incidents. However, spectral data-based testing devices suffer from complex noise patterns when deployed in non-laboratory environments. To make the analysis model applicable to more environments, we propose a noise… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

  11. arXiv:2304.09116  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

    Authors: Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian

    Abstract: Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skip**/repeating is… ▽ More

    Submitted 30 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: A large-scale text-to-speech and singing voice synthesis system with latent diffusion models. Update: NaturalSpeech 2 extension to voice conversion and speech enhancement

  12. arXiv:2304.00830  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

    Authors: Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao

    Abstract: Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been train… ▽ More

    Submitted 5 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

  13. arXiv:2206.14195  [pdf, other

    cs.CV cs.RO

    Pedestrian 3D Bounding Box Prediction

    Authors: Saeed Saadatnejad, Yi Zhou Ju, Alexandre Alahi

    Abstract: Safety is still the main issue of autonomous driving, and in order to be globally deployed, they need to predict pedestrians' motions sufficiently in advance. While there is a lot of research on coarse-grained (human center prediction) and fine-grained predictions (human body keypoints prediction), we focus on 3D bounding boxes, which are reasonable estimates of humans without modeling complex mot… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: Accepted and published in hEART2022 (the 10th Symposium of the European Association for Research in Transportation): http://www.heart-web.org/

  14. arXiv:2109.09617  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    TeleMelody: Lyric-to-Melody Generation with a Template-Based Two-Stage Method

    Authors: Zeqian Ju, Peiling Lu, Xu Tan, Rui Wang, Chen Zhang, Songruoyao Wu, Kejun Zhang, Xiangyang Li, Tao Qin, Tie-Yan Liu

    Abstract: Lyric-to-melody generation is an important task in automatic songwriting. Previous lyric-to-melody generation systems usually adopt end-to-end models that directly generate melodies from lyrics, which suffer from several issues: 1) lack of paired lyric-melody training data; 2) lack of control on generated melodies. In this paper, we develop TeleMelody, a two-stage lyric-to-melody generation system… ▽ More

    Submitted 13 November, 2022; v1 submitted 20 September, 2021; originally announced September 2021.

  15. arXiv:2106.05630  [pdf, other

    cs.SD cs.CL cs.IR cs.MM eess.AS

    MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training

    Authors: Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, Tie-Yan Liu

    Abstract: Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are beneficial for these applications, the lack of training data hinders representation learning. Inspired by the success o… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Accepted by ACL 2021 Findings

  16. arXiv:2010.10957  [pdf, ps, other

    cs.CV

    2nd Place Solution to Instance Segmentation of IJCAI 3D AI Challenge 2020

    Authors: Kai Jiang, Xiangyue Liu, Zheng Ju, Xiang Luo

    Abstract: Compared with MS-COCO, the dataset for the competition has a larger proportion of large objects which area is greater than 96x96 pixels. As getting fine boundaries is vitally important for large object segmentation, Mask R-CNN with PointRend is selected as the base segmentation framework to output high-quality object boundaries. Besides, a better engine that integrates ResNeSt, FPN and DCNv2, and… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  17. arXiv:2005.05442  [pdf, other

    cs.CL cs.LG

    On the Generation of Medical Dialogues for COVID-19

    Authors: Wenmian Yang, Guangtao Zeng, Bowen Tan, Zeqian Ju, Subrato Chakravorty, Xuehai He, Shu Chen, Xingyi Yang, Qingyang Wu, Zhou Yu, Eric Xing, Pengtao Xie

    Abstract: Under the pandemic of COVID-19, people experiencing COVID19-related symptoms or exposed to risk factors have a pressing need to consult doctors. Due to hospital closure, a lot of consulting services have been moved online. Because of the shortage of medical professionals, many people cannot receive online consultations timely. To address this problem, we aim to develop a medical dialogue system th… ▽ More

    Submitted 17 June, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

  18. arXiv:2004.03329  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    MedDialog: Two Large-scale Medical Dialogue Datasets

    Authors: Xuehai He, Shu Chen, Zeqian Ju, Xiangyu Dong, Hongchao Fang, Sicheng Wang, Yue Yang, Jiaqi Zeng, Ruisi Zhang, Ruoyu Zhang, Meng Zhou, Penghui Zhu, Pengtao Xie

    Abstract: Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations… ▽ More

    Submitted 7 July, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

  19. Design and implementation of smart cooking based on amazon echo

    Authors: Lin Xiaoguang, Yang Yong, Zhang Ju

    Abstract: Smart cooking based on Amazon Echo uses the internet of things and cloud computing to assist in cooking food. People may speak to Amazon Echo during the cooking in order to get the information and situation of the cooking. Amazon Echo recognizes what people say, then transfers the information to the cloud services, and speaks to people the results that cloud services make by querying the embedded… ▽ More

    Submitted 4 December, 2018; originally announced December 2018.

    Comments: 9 pages, 3 figures, 2 tables

    Journal ref: IJSCAI 4 (2018) 37-45