-
Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Authors:
Ui-Hyeop Shin,
Sangyoun Lee,
Taehan Kim,
Hyung-Min Park
Abstract:
Since the success of a time-domain speech separation, further improvements have been made by expanding the length and channel of a feature sequence to increase the amount of computation. When temporally expanded to a long sequence, the feature is segmented into chunks as a dual-path model in most studies of speech separation. In particular, it is common for the process of separating features corre…
▽ More
Since the success of a time-domain speech separation, further improvements have been made by expanding the length and channel of a feature sequence to increase the amount of computation. When temporally expanded to a long sequence, the feature is segmented into chunks as a dual-path model in most studies of speech separation. In particular, it is common for the process of separating features corresponding to each speaker to be located in the final stage of the network. However, it is more advantageous and intuitive to proactively expand the feature sequence to include the number of speakers as an extra dimension. In this paper, we present an asymmetric strategy in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, as Siamese network, in addition to cross-speaker processing. By using the Siamese network in the decoder, without using speaker information, the network directly learns to discriminate the features using a separation objective. With a common split layer, intermediate encoder features for skip connections are also split for the reconstruction decoder based on the U-Net structure. In addition, instead of segmenting the feature into chunks as dual-path, we design global and local Transformer blocks to directly process long sequences. The experimental results demonstrated that this separation-and-reconstruction framework is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model including both of these achieved state-of-the-art performance with less computation than before in various benchmark datasets.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Intelligent and Miniaturized Neural Interfaces: An Emerging Era in Neurotechnology
Authors:
Mahsa Shoaran,
Uisub Shin,
MohammadAli Shaeri
Abstract:
Integrating smart algorithms on neural devices presents significant opportunities for various brain disorders. In this paper, we review the latest advancements in the development of three categories of intelligent neural prostheses featuring embedded signal processing on the implantable or wearable device. These include: 1) Neural interfaces for closed-loop symptom tracking and responsive stimulat…
▽ More
Integrating smart algorithms on neural devices presents significant opportunities for various brain disorders. In this paper, we review the latest advancements in the development of three categories of intelligent neural prostheses featuring embedded signal processing on the implantable or wearable device. These include: 1) Neural interfaces for closed-loop symptom tracking and responsive stimulation; 2) Neural interfaces for emerging network-related conditions, such as psychiatric disorders; and 3) Intelligent BMI SoCs for movement recovery following paralysis.
△ Less
Submitted 31 May, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Learning to Control Camera Exposure via Reinforcement Learning
Authors:
Kyunghyun Lee,
Ukcheol Shin,
Byeong-Uk Lee
Abstract:
Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes, making them unsuitable for dynamic lighting conditions. In t…
▽ More
Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes, making them unsuitable for dynamic lighting conditions. In this paper, we propose a new camera exposure control framework that rapidly controls camera exposure while performing real-time processing by exploiting deep reinforcement learning. The proposed framework consists of four contributions: 1) a simplified training ground to simulate real-world's diverse and dynamic lighting changes, 2) flickering and image attribute-aware reward design, along with lightweight state design for real-time processing, 3) a static-to-dynamic lighting curriculum to gradually improve the agent's exposure-adjusting capability, and 4) domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the wild.As a result, our proposed method rapidly reaches a desired exposure level within five steps with real-time processing (1 ms). Also, the acquired images are well-exposed and show superiority in various computer vision tasks, such as feature extraction and object detection.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification
Authors:
Hyun-Jun Heo,
Ui-Hyeop Shin,
Ran Lee,
YoungJu Cheon,
Hyung-Min Park
Abstract:
In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in…
▽ More
In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.
△ Less
Submitted 14 December, 2023; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Statistical Beamformer Exploiting Non-stationarity and Sparsity with Spatially Constrained ICA for Robust Speech Recognition
Authors:
Ui-Hyeop Shin,
Hyung-Min Park
Abstract:
In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method f…
▽ More
In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.
△ Less
Submitted 5 January, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning
Authors:
Ukcheol Shin,
Kyunghyun Lee,
In So Kweon
Abstract:
In this paper, we propose a multi-objective camera ISP framework that utilizes Deep Reinforcement Learning (DRL) and camera ISP toolbox that consist of network-based and conventional ISP tools. The proposed DRL-based camera ISP framework iteratively selects a proper tool from the toolbox and applies it to the image to maximize a given vision task-specific reward function. For this purpose, we impl…
▽ More
In this paper, we propose a multi-objective camera ISP framework that utilizes Deep Reinforcement Learning (DRL) and camera ISP toolbox that consist of network-based and conventional ISP tools. The proposed DRL-based camera ISP framework iteratively selects a proper tool from the toolbox and applies it to the image to maximize a given vision task-specific reward function. For this purpose, we implement total 51 ISP tools that include exposure correction, color-and-tone correction, white balance, sharpening, denoising, and the others. We also propose an efficient DRL network architecture that can extract the various aspects of an image and make a rigid map** relationship between images and a large number of actions. Our proposed DRL-based ISP framework effectively improves the image quality according to each vision task such as RAW-to-RGB image restoration, 2D object detection, and monocular depth estimation.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
A 16-Channel Low-Power Neural Connectivity Extraction and Phase-Locked Deep Brain Stimulation SoC
Authors:
Uisub Shin,
Cong Ding,
Virginia Woods,
Alik S. Widge,
Mahsa Shoaran
Abstract:
Growing evidence suggests that phase-locked deep brain stimulation (DBS) can effectively regulate abnormal brain connectivity in neurological and psychiatric disorders. This letter therefore presents a low-power SoC with both neural connectivity extraction and phase-locked DBS capabilities. A 16-channel low-noise analog front-end (AFE) records local field potentials (LFPs) from multiple brain regi…
▽ More
Growing evidence suggests that phase-locked deep brain stimulation (DBS) can effectively regulate abnormal brain connectivity in neurological and psychiatric disorders. This letter therefore presents a low-power SoC with both neural connectivity extraction and phase-locked DBS capabilities. A 16-channel low-noise analog front-end (AFE) records local field potentials (LFPs) from multiple brain regions with precise gain matching. A novel low-complexity phase estimator and neural connectivity processor subsequently enable energy-efficient, yet accurate measurement of the instantaneous phase and cross-regional synchrony measures. Through flexible combination of neural biomarkers such as phase synchrony and spectral energy, a four-channel charge-balanced neurostimulator is triggered to treat various pathological brain conditions. Fabricated in 65nm CMOS, the SoC occupies a silicon area of 2.24mm2 and consumes 60uW, achieving over 60% power saving in neural connectivity extraction compared to the state-of-the-art. Extensive in-vivo measurements demonstrate multi-channel LFP recording, real-time extraction of phase and neural connectivity measures, and phase-locked stimulation in rats.
△ Less
Submitted 3 July, 2022;
originally announced July 2022.
-
NeuralTree: A 256-Channel 0.227-$μ$J/Class Versatile Neural Activity Classification and Closed-Loop Neuromodulation SoC
Authors:
Uisub Shin,
Cong Ding,
Bingzhao Zhu,
Yashwanth Vyza,
Alix Trouillet,
Emilie C. M. Revol,
Stéphanie P. Lacour,
Mahsa Shoaran
Abstract:
Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic effica…
▽ More
Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic efficacy. This work presents a highly scalable and versatile closed-loop neural interface SoC that can overcome these limitations. A 256-channel time-division multiplexed (TDM) front-end with a two-step fast-settling mixed-signal DC servo loop (DSL) is proposed to record high-spatial-resolution neural activity and perform channel-selective brain-state inference. A tree-structured neural network (NeuralTree) classification processor extracts a rich set of neural biomarkers in a patient- and disease-specific manner. Trained with an energy-aware learning algorithm, the NeuralTree classifier detects the symptoms of underlying disorders (e.g., epilepsy and movement disorders) at an optimal energy-accuracy tradeoff. A 16-channel high-voltage (HV) compliant neurostimulator closes the therapeutic loop by delivering charge-balanced biphasic current pulses to the brain. The proposed SoC was fabricated in 65-nm CMOS and achieved a 0.227-$μ$J/class energy efficiency in a compact area of 0.014mm$^2$/channel. The SoC was extensively verified on human electroencephalography (EEG) and intracranial EEG (iEEG) epilepsy datasets, obtaining 95.6%/94% sensitivity and 96.8%/96.9% specificity, respectively. In vivo neural recordings using soft $μ$ECoG arrays and multi-domain biomarker extraction were further performed on a rat model of epilepsy. In addition, for the first time in literature, on-chip classification of rest-state tremor in Parkinson's disease (PD) from human local field potentials (LFPs) was demonstrated.
△ Less
Submitted 8 December, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Closed-Loop Neural Prostheses with On-Chip Intelligence: A Review and A Low-Latency Machine Learning Model for Brain State Detection
Authors:
Bingzhao Zhu,
Uisub Shin,
Mahsa Shoaran
Abstract:
The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for develo** novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to…
▽ More
The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for develo** novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to enabling such novel treatments. However, the existing closed-loop neuromodulation devices are too simplistic and lack sufficient on-chip processing and intelligence. In this paper, we first discuss both commercial and investigational closed-loop neuromodulation devices for brain disorders. Next, we review state-of-the-art neural prostheses with on-chip machine learning, focusing on application-specific integrated circuits (ASIC). System requirements, performance and hardware comparisons, design trade-offs, and hardware optimization techniques are discussed. To facilitate a fair comparison and guide design choices among various on-chip classifiers, we propose a new energy-area (E-A) efficiency figure of merit that evaluates hardware efficiency and multi-channel scalability. Finally, we present several techniques to improve the key design metrics of tree-based on-chip classifiers, both in the context of ensemble methods and oblique structures.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.