-
BLSP-Emo: Towards Empathetic Large Speech-Language Models
Authors:
Chen Wang,
Minpeng Liao,
Zhongqiang Huang,
Junhong Wu,
Chengqing Zong,
Jiajun Zhang
Abstract:
The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we pr…
▽ More
The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to develo** an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
BLSP-KD: Bootstrap** Language-Speech Pre-training via Knowledge Distillation
Authors:
Chen Wang,
Minpeng Liao,
Zhongqiang Huang,
Jiajun Zhang
Abstract:
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrap** Language-Speech Pretraining via Knowledge Distillation, which addresses these li…
▽ More
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrap** Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification
Authors:
Minghui Liao,
Guojia Wan,
Bo Du
Abstract:
Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we…
▽ More
Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracy of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.
△ Less
Submitted 25 March, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
Transferred Thin Film Lithium Niobate as Millimeter Wave Acoustic Filter Platforms
Authors:
Omar Barrera,
Sinwoo Cho,
Kenny Hyunh,
Jack Kramer,
Michael Liao,
Vakhtang Chulukhadze,
Lezli Matto,
Mark S. Goorsky,
Ruochen Lu
Abstract:
This paper reports the first high-performance acoustic filters toward millimeter wave (mmWave) bands using transferred single-crystal thin film lithium niobate (LiNbO3). By transferring LiNbO3 on the top of silicon (Si) and sapphire (Al2O3) substrates with an intermediate amorphous Si (aSi) bonding and sacrificial layer, we demonstrate compact acoustic filters with record-breaking performance beyo…
▽ More
This paper reports the first high-performance acoustic filters toward millimeter wave (mmWave) bands using transferred single-crystal thin film lithium niobate (LiNbO3). By transferring LiNbO3 on the top of silicon (Si) and sapphire (Al2O3) substrates with an intermediate amorphous Si (aSi) bonding and sacrificial layer, we demonstrate compact acoustic filters with record-breaking performance beyond 20 GHz. In the LN-aSi-Al2O3 platform, the third-order ladder filter exhibits low insertion loss (IL) of 1.62 dB and 3-dB fractional bandwidth (FBW) of 19.8% at 22.1 GHz, while in the LN-aSi-Si platform, the filter shows low IL of 2.38 dB and FBW of 18.2% at 23.5 GHz. Material analysis validates the great crystalline quality of the stacks. The high-resolution x-ray diffraction (HRXRD) shows full width half maximum (FWHM) of 53 arcsec for Al2O3 and 206 arcsec for Si, both remarkably low compared to piezoelectric thin films of similar thickness. The reported results bring the state-of-the-art (SoA) of compact acoustic filters to much higher frequencies, and highlight transferred LiNbO3 as promising platforms for mmWave filters in future wireless front ends.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
BLSP: Bootstrap** Language-Speech Pre-training via Behavior Alignment of Continuation Writing
Authors:
Chen Wang,
Minpeng Liao,
Zhongqiang Huang,
**liang Lu,
Junhong Wu,
Yuchen Liu,
Chengqing Zong,
Jiajun Zhang
Abstract:
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are use…
▽ More
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
△ Less
Submitted 28 May, 2024; v1 submitted 2 September, 2023;
originally announced September 2023.
-
Thin-Film Lithium Niobate Acoustic Resonator with High Q of 237 and k2 of 5.1% at 50.74 GHz
Authors:
Jack Kramer,
Vakhtang Chulukhadze,
Kenny Huynh,
Omar Barrera,
Michael Liao,
Sinwoo Cho,
Lezli Matto,
Mark S. Goorsky,
Ruochen Lu
Abstract:
This work reports a 50.74 GHz lithium niobate (LiNbO3) acoustic resonator with a high quality factor (Q) of 237 and an electromechanical coupling (k2) of 5.17% resulting in a figure of merit (FoM, Q x k2) of 12.2. The LiNbO3 resonator employs a novel bilayer periodically poled piezoelectric film (P3F) 128 Y-cut LiNbO3 on amorphous silicon (a-Si) on sapphire stack to achieve low losses and high cou…
▽ More
This work reports a 50.74 GHz lithium niobate (LiNbO3) acoustic resonator with a high quality factor (Q) of 237 and an electromechanical coupling (k2) of 5.17% resulting in a figure of merit (FoM, Q x k2) of 12.2. The LiNbO3 resonator employs a novel bilayer periodically poled piezoelectric film (P3F) 128 Y-cut LiNbO3 on amorphous silicon (a-Si) on sapphire stack to achieve low losses and high coupling at millimeter wave (mm-wave). The device also shows a Q of 159, k2 of 65.06%, and FoM of 103.4 for the 16.99 GHz tone. This result shows promising prospects of P3F LiNbO3 towards mm-wave front-end filters.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary
Authors:
Sibo Zhang,
Jiahong Yuan,
Miao Liao,
Liangjun Zhang
Abstract:
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video g…
▽ More
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs a fraction of the training data used by an audio-driven approach; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing, training and inference time. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.
△ Less
Submitted 22 January, 2022; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses
Authors:
Miao Liao,
Sibo Zhang,
Peng Wang,
Hao Zhu,
Xinxin Zuo,
Ruigang Yang
Abstract:
In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional gen…
▽ More
In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.
△ Less
Submitted 8 October, 2020; v1 submitted 17 July, 2020;
originally announced July 2020.
-
DVI: Depth Guided Video Inpainting for Autonomous Driving
Authors:
Miao Liao,
Feixiang Lu,
Dingfu Zhou,
Sibo Zhang,
Wei Li,
Ruigang Yang
Abstract:
To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target…
▽ More
To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video. To our knowledge, we are the first to fuse multiple videos for video inpainting. To verify the effectiveness of our approach, we build a large inpainting dataset in the real urban road environment with synchronized images and Lidar data including many challenge scenes, e.g., long time occlusion. The experimental results show that the proposed approach outperforms the state-of-the-art approaches for all the criteria, especially the RMSE (Root Mean Squared Error) has been reduced by about 13%.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
Authors:
Alexey Bochkovskiy,
Chien-Yao Wang,
Hong-Yuan Mark Liao
Abstract:
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normal…
▽ More
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Optimization Algorithms for Catching Data Manipulators in Power System Estimation Loops
Authors:
Mang Liao,
Aranya Chakrabortty
Abstract:
In this paper we develop a set of algorithms that can detect the identities of malicious data-manipulators in distributed optimization loops for estimating oscillation modes in large power system models. The estimation is posed in terms of a consensus problem among multiple local estimators that jointly solve for the characteristic polynomial of the network model. If any of these local estimates a…
▽ More
In this paper we develop a set of algorithms that can detect the identities of malicious data-manipulators in distributed optimization loops for estimating oscillation modes in large power system models. The estimation is posed in terms of a consensus problem among multiple local estimators that jointly solve for the characteristic polynomial of the network model. If any of these local estimates are compromised by a malicious attacker, resulting in an incorrect value of the consensus variable, then the entire estimation loop can be destabilized. We present four iterative algorithms by which this instability can be quickly detected, and the identities of the compromised estimators can be revealed. The algorithms are solely based on the computed values of the estimates, and do not need any information about the model of the power system. Both large and covert attacks are considered. Results are illustrated using simulations of a IEEE 68-bus power system model.
△ Less
Submitted 28 March, 2017; v1 submitted 31 July, 2016;
originally announced August 2016.