Search | arXiv e-print repository

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Authors: Zixuan Wang, Qinkai Duan, Yu-Wing Tai, Chi-Keung Tang

Abstract: We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions… ▽ More We introduce C3LLM (Conditioned-on-Three-Modalities Large Language Models), a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we use the discrete representation for audio generation and compress their semantic meanings into acoustic tokens, similar to adding "acoustic vocabulary" to LLM. Third, our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model, providing more versatility in an end-to-end fashion. Our C3LLM achieves improved results through various automated evaluation metrics, providing better semantic alignment compared to previous methods. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2308.15144 [pdf, other]

TKwinFormer: Top k Window Attention in Vision Transformers for Feature Matching

Authors: Yun Liao, Yide Di, Hao Zhou, Kaijun Zhu, Mingyu Lu, Yijia Zhang, Qing Duan, Junhui Liu

Abstract: Local feature matching remains a challenging task, primarily due to difficulties in matching sparse keypoints and low-texture regions. The key to solving this problem lies in effectively and accurately integrating global and local information. To achieve this goal, we introduce an innovative local feature matching method called TKwinFormer. Our approach employs a multi-stage matching strategy to o… ▽ More Local feature matching remains a challenging task, primarily due to difficulties in matching sparse keypoints and low-texture regions. The key to solving this problem lies in effectively and accurately integrating global and local information. To achieve this goal, we introduce an innovative local feature matching method called TKwinFormer. Our approach employs a multi-stage matching strategy to optimize the efficiency of information interaction. Furthermore, we propose a novel attention mechanism called Top K Window Attention, which facilitates global information interaction through window tokens prior to patch-level matching, resulting in improved matching accuracy. Additionally, we design an attention block to enhance attention between channels. Experimental results demonstrate that TKwinFormer outperforms state-of-the-art methods on various benchmarks. Code is available at: https://github.com/LiaoYun0x0/TKwinFormer. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: 11 pages, 7 figures

ACM Class: I.4.7

arXiv:2202.03424 [pdf, other]

Reinforcement learning for multi-item retrieval in the puzzle-based storage system

Authors: **g He, Xinglu Liu, Qiyao Duan, Wai Kin Victor Chan, Mingyao Qi

Abstract: Nowadays, fast delivery services have created the need for high-density warehouses. The puzzle-based storage system is a practical way to enhance the storage density, however, facing difficulties in the retrieval process. In this work, a deep reinforcement learning algorithm, specifically the Double&Dueling Deep Q Network, is developed to solve the multi-item retrieval problem in the system with g… ▽ More Nowadays, fast delivery services have created the need for high-density warehouses. The puzzle-based storage system is a practical way to enhance the storage density, however, facing difficulties in the retrieval process. In this work, a deep reinforcement learning algorithm, specifically the Double&Dueling Deep Q Network, is developed to solve the multi-item retrieval problem in the system with general settings, where multiple desired items, escorts, and I/O points are placed randomly. Additionally, we propose a general compact integer programming model to evaluate the solution quality. Extensive numerical experiments demonstrate that the reinforcement learning approach can yield high-quality solutions and outperforms three related state-of-the-art heuristic algorithms. Furthermore, a conversion algorithm and a decomposition framework are proposed to handle simultaneous movement and large-scale instances respectively, thus improving the applicability of the PBS system. △ Less

Submitted 5 February, 2022; originally announced February 2022.

Comments: 32 pages, 13 figures, 5 tables, journal

arXiv:2107.05011 [pdf, other]

doi 10.1109/TSP.2022.3150953

Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient Descent

Authors: Qiyou Duan, Hadi Ghauch, Taejoon Kim

Abstract: Data representation techniques have made a substantial contribution to advancing data processing and machine learning (ML). Improving predictive power was the focus of previous representation techniques, which unfortunately perform rather poorly on the interpretability in terms of extracting underlying insights of the data. Recently, the Kolmogorov model (KM) was studied, which is an interpretable… ▽ More Data representation techniques have made a substantial contribution to advancing data processing and machine learning (ML). Improving predictive power was the focus of previous representation techniques, which unfortunately perform rather poorly on the interpretability in terms of extracting underlying insights of the data. Recently, the Kolmogorov model (KM) was studied, which is an interpretable and predictable representation approach to learning the underlying probabilistic structure of a set of random variables. The existing KM learning algorithms using semi-definite relaxation with randomization (SDRwR) or discrete monotonic optimization (DMO) have, however, limited utility to big data applications because they do not scale well computationally. In this paper, we propose a computationally scalable KM learning algorithm, based on the regularized dual optimization combined with enhanced gradient descent (GD) method. To make our method more scalable to large-dimensional problems, we propose two acceleration schemes, namely, the eigenvalue decomposition (EVD) elimination strategy and an approximate EVD algorithm. Furthermore, a thresholding technique by exploiting the error bound analysis and leveraging the normalized Minkowski $\ell_1$-norm, is provided for the selection of the number of iterations of the approximate EVD algorithm. When applied to big data applications, it is demonstrated that the proposed method can achieve compatible training/prediction performance with significantly reduced computational complexity; roughly two orders of magnitude improvement in terms of the time overhead, compared to the existing KM learning algorithms. Furthermore, it is shown that the accuracy of logical relation mining for interpretability by using the proposed KM learning algorithm exceeds $80\%$. △ Less

Submitted 20 May, 2022; v1 submitted 11 July, 2021; originally announced July 2021.

Comments: Published in the IEEE Transactions on Signal Processing (15 pages, 11 figures, and 6 tables)

arXiv:2007.13299 [pdf, other]

Enhanced Beam Alignment for Millimeter Wave MIMO Systems: A Kolmogorov Model

Authors: Qiyou Duan, Taejoon Kim, Hadi Ghauch

Abstract: We present an enhancement to the problem of beam alignment in millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems, based on a modification of the machine learning-based criterion, called Kolmogorov model (KM), previously applied to the beam alignment problem. Unlike the previous KM, whose computational complexity is not scalable with the size of the problem, a new approach, cent… ▽ More We present an enhancement to the problem of beam alignment in millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems, based on a modification of the machine learning-based criterion, called Kolmogorov model (KM), previously applied to the beam alignment problem. Unlike the previous KM, whose computational complexity is not scalable with the size of the problem, a new approach, centered on discrete monotonic optimization (DMO), is proposed, leading to significantly reduced complexity. We also present a Kolmogorov-Smirnov (KS) criterion for the advanced hypothesis testing, which does not require any subjective threshold setting compared to the frequency estimation (FE) method developed for the conventional KM. Simulation results that demonstrate the efficacy of the proposed KM learning for mmWave beam alignment are presented. △ Less

Submitted 26 July, 2020; originally announced July 2020.

Comments: Submitted to the 2020 IEEE Globecom

arXiv:2004.07031 [pdf, other]

SenseCare: A Research Platform for Medical Image Informatics and Interactive 3D Visualization

Authors: Qi Duan, Guotai Wang, Rui Wang, Chao Fu, Xinjun Li, Na Wang, Yechong Huang, Xiaodi Huang, Tao Song, Liang Zhao, Xinglong Liu, Qing Xia, Zhiqiang Hu, Yinan Chen, Shaoting Zhang

Abstract: Clinical research on smart health has an increasing demand for intelligent and clinic-oriented medical image computing algorithms and platforms that support various applications. To this end, we have developed SenseCare research platform, which is designed to facilitate translational research on intelligent diagnosis and treatment planning in various clinical scenarios. To enable clinical research… ▽ More Clinical research on smart health has an increasing demand for intelligent and clinic-oriented medical image computing algorithms and platforms that support various applications. To this end, we have developed SenseCare research platform, which is designed to facilitate translational research on intelligent diagnosis and treatment planning in various clinical scenarios. To enable clinical research with Artificial Intelligence (AI), SenseCare provides a range of AI toolkits for different tasks, including image segmentation, registration, lesion and landmark detection from various image modalities ranging from radiology to pathology. In addition, SenseCare is clinic-oriented and supports a wide range of clinical applications such as diagnosis and surgical planning for lung cancer, pelvic tumor, coronary artery disease, etc. SenseCare provides several appealing functions and features such as advanced 3D visualization, concurrent and efficient web-based access, fast data synchronization and high data security, multi-center deployment, support for collaborative research, etc. In this report, we present an overview of SenseCare as an efficient platform providing comprehensive toolkits and high extensibility for intelligent image analysis and clinical research in different application scenarios. We also summarize the research outcome through the collaboration with multiple hospitals. △ Less

Submitted 2 September, 2022; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: 15 pages, 16 figures

arXiv:1910.03729 [pdf, other]

Large-scale Gastric Cancer Screening and Localization Using Multi-task Deep Neural Network

Authors: Hong Yu, Xiaofan Zhang, Lingjun Song, Liren Jiang, Xiaodi Huang, Wen Chen, Chenbin Zhang, Jiahui Li, Jiji Yang, Zhiqiang Hu, Qi Duan, Wanyuan Chen, Xianglei He, **shuang Fan, Weihai Jiang, Li Zhang, Chengmin Qiu, Minmin Gu, Weiwei Sun, Yangqiong Zhang, Guangyin Peng, Weiwei Shen, Guohui Fu

Abstract: Gastric cancer is one of the most common cancers, which ranks third among the leading causes of cancer death. Biopsy of gastric mucosa is a standard procedure in gastric cancer screening test. However, manual pathological inspection is labor-intensive and time-consuming. Besides, it is challenging for an automated algorithm to locate the small lesion regions in the gigapixel whole-slide image and… ▽ More Gastric cancer is one of the most common cancers, which ranks third among the leading causes of cancer death. Biopsy of gastric mucosa is a standard procedure in gastric cancer screening test. However, manual pathological inspection is labor-intensive and time-consuming. Besides, it is challenging for an automated algorithm to locate the small lesion regions in the gigapixel whole-slide image and make the decision correctly.To tackle these issues, we collected large-scale whole-slide image dataset with detailed lesion region annotation and designed a whole-slide image analyzing framework consisting of 3 networks which could not only determine the screening result but also present the suspicious areas to the pathologist for reference. Experiments demonstrated that our proposed framework achieves sensitivity of 97.05% and specificity of 92.72% in screening task and Dice coefficient of 0.8331 in segmentation task. Furthermore, we tested our best model in real-world scenario on 10,315 whole-slide images collected from 4 medical centers. △ Less

Submitted 19 September, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: under minor revision

arXiv:1909.07616 [pdf, ps, other]

doi 10.1109/LSP.2019.2942737

Coherence Statistics of Structured Random Ensembles and Support Detection Bounds for OMP

Authors: Qiyou Duan, Taejoon Kim, Lin Dai, Erik Perrins

Abstract: A structured random matrix ensemble that maintains constant modulus entries and unit-norm columns, often called a random phase-rotated (RPR) matrix, is considered in this paper. We analyze the coherence statistics of RPR measurement matrices and apply them to acquire probabilistic performance guarantees of orthogonal matching pursuit (OMP) for support detection (SD). It is revealed via numerical s… ▽ More A structured random matrix ensemble that maintains constant modulus entries and unit-norm columns, often called a random phase-rotated (RPR) matrix, is considered in this paper. We analyze the coherence statistics of RPR measurement matrices and apply them to acquire probabilistic performance guarantees of orthogonal matching pursuit (OMP) for support detection (SD). It is revealed via numerical simulations that the SD performance guarantee provides a tight characterization, especially when the signal is sparse. △ Less

Submitted 17 September, 2019; originally announced September 2019.

Comments: Accepted for publication in the IEEE Signal Processing Letters

Showing 1–8 of 8 results for author: Duan, Q