Search | arXiv e-print repository

SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation

Authors: Shuangrui Ding, Zihan Liu, Xiaoyi Dong, Pan Zhang, Rui Qian, Conghui He, Dahua Lin, Jiaqi Wang

Abstract: We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song represen… ▽ More We present SongComposer, an innovative LLM designed for song composition. It could understand and generate melodies and lyrics in symbolic song representations, by leveraging the capability of LLM. Existing music-related LLM treated the music as quantized audio signals, while such implicit encoding leads to inefficient encoding and poor flexibility. In contrast, we resort to symbolic song representation, the mature and efficient way humans designed for music, and enable LLM to explicitly compose songs like humans. In practice, we design a novel tuple design to format lyric and three note attributes (pitch, duration, and rest duration) in the melody, which guarantees the correct LLM understanding of musical symbols and realizes precise alignment between lyrics and melody. To impart basic music understanding to LLM, we carefully collected SongCompose-PT, a large-scale song pretraining dataset that includes lyrics, melodies, and paired lyrics-melodies in either Chinese or English. After adequate pre-training, 10K carefully crafted QA pairs are used to empower the LLM with the instruction-following capability and solve diverse tasks. With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation, outperforming advanced LLMs like GPT-4. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: project page: https://pjlab-songcomposer.github.io/ code: https://github.com/pjlab-songcomposer/songcomposer

arXiv:2312.15633 [pdf, other]

MuLA-GAN: Multi-Level Attention GAN for Enhanced Underwater Visibility

Authors: Ahsan Baidar Bakht, Zikai Jia, Muhayy ud Din, Waseem Akram, Lyes Saad Soud, Lakmal Seneviratne, Defu Lin, Shaoming He, Irfan Hussain

Abstract: The underwater environment presents unique challenges, including color distortions, reduced contrast, and blurriness, hindering accurate analysis. In this work, we introduce MuLA-GAN, a novel approach that leverages the synergistic power of Generative Adversarial Networks (GANs) and Multi-Level Attention mechanisms for comprehensive underwater image enhancement. The integration of Multi-Level Atte… ▽ More The underwater environment presents unique challenges, including color distortions, reduced contrast, and blurriness, hindering accurate analysis. In this work, we introduce MuLA-GAN, a novel approach that leverages the synergistic power of Generative Adversarial Networks (GANs) and Multi-Level Attention mechanisms for comprehensive underwater image enhancement. The integration of Multi-Level Attention within the GAN architecture significantly enhances the model's capacity to learn discriminative features crucial for precise image restoration. By selectively focusing on relevant spatial and multi-level features, our model excels in capturing and preserving intricate details in underwater imagery, essential for various applications. Extensive qualitative and quantitative analyses on diverse datasets, including UIEB test dataset, UIEB challenge dataset, U45, and UCCS dataset, highlight the superior performance of MuLA-GAN compared to existing state-of-the-art methods. Experimental evaluations on a specialized dataset tailored for bio-fouling and aquaculture applications demonstrate the model's robustness in challenging environmental conditions. On the UIEB test dataset, MuLA-GAN achieves exceptional PSNR (25.59) and SSIM (0.893) scores, surpassing Water-Net, the second-best model, with scores of 24.36 and 0.885, respectively. This work not only addresses a significant research gap in underwater image enhancement but also underscores the pivotal role of Multi-Level Attention in enhancing GANs, providing a novel and comprehensive framework for restoring underwater image quality. △ Less

Submitted 25 December, 2023; originally announced December 2023.

arXiv:2312.09452 [pdf, other]

Efficient Multi-Pair IoT Communication with Holographically Enhanced Meta-Surfaces Leveraging OAM Beams: Bridging Theory and Prototype

Authors: Yufei Zhao, Yong Liang Guan, Afkar Mohamed Ismail, Gaohua Ju, Deyu Lin, Yilong Lu, Chau Yuen

Abstract: Meta-surfaces, also known as Reconfigurable Intelligent Surfaces (RIS), have emerged as a cost-effective, low power consumption, and flexible solution for enabling multiple applications in Internet of Things (IoT). However, in the context of meta-surface-assisted multi-pair IoT communications, significant interference issues often arise amount multiple channels. This issue is particularly pronounc… ▽ More Meta-surfaces, also known as Reconfigurable Intelligent Surfaces (RIS), have emerged as a cost-effective, low power consumption, and flexible solution for enabling multiple applications in Internet of Things (IoT). However, in the context of meta-surface-assisted multi-pair IoT communications, significant interference issues often arise amount multiple channels. This issue is particularly pronounced in scenarios characterized by Line-of-Sight (LoS) conditions, where the channels exhibit low rank due to the significant correlation in propagation paths. These challenges pose a considerable threat to the quality of communication when multiplexing data streams. In this paper, we introduce a meta-surface-aided communication scheme for multi-pair interactions in IoT environments. Inspired by holographic technology, a novel compensation method on the whole meta-surface has been proposed, which allows for independent multi-pair direct data streams transmission with low interference. To further reduce correlation under LoS channel conditions, we propose a vortex beam-based solution that leverages the low correlation property between distinct topological modes. We use different vortex beams to carry distinct data streams, thereby enabling distinct receivers to capture their intended signal with low interference, aided by holographic meta-surfaces. Moreover, a prototype has been performed successfully to demonstrate two-pair multi-node communication scenario operating at 10 GHz with QPSK/16-QAM modulation. △ Less

Submitted 18 November, 2023; originally announced December 2023.

Comments: Meta-surface, RIS, Internet-of-Things (IoT), Line-of-Sight (LoS), Orbital Angular Momentum (OAM), holographic communications, multi-user

arXiv:2311.05609 [pdf, other]

What Do I Hear? Generating Sounds for Visuals with ChatGPT

Authors: David Chuan-En Lin, Nikolas Martelaro

Abstract: This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities o… ▽ More This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities of language models, such as ChatGPT. In this paper, we describe our workflow, which includes creating a scene context, brainstorming sounds, and generating the sounds. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Demo: http://soundify.cc

arXiv:2309.07178 [pdf]

CloudBrain-NMR: An Intelligent Cloud Computing Platform for NMR Spectroscopy Processing, Reconstruction and Analysis

Authors: Di Guo, Si** Li, Jun Liu, Zhangren Tu, Tianyu Qiu, **g**g Xu, Liubin Feng, Donghai Lin, Qing Hong, Mei** Lin, Yanqin Lin, Xiaobo Qu

Abstract: Nuclear Magnetic Resonance (NMR) spectroscopy has served as a powerful analytical tool for studying molecular structure and dynamics in chemistry and biology. However, the processing of raw data acquired from NMR spectrometers and subsequent quantitative analysis involves various specialized tools, which necessitates comprehensive knowledge in programming and NMR. Particularly, the emerging deep l… ▽ More Nuclear Magnetic Resonance (NMR) spectroscopy has served as a powerful analytical tool for studying molecular structure and dynamics in chemistry and biology. However, the processing of raw data acquired from NMR spectrometers and subsequent quantitative analysis involves various specialized tools, which necessitates comprehensive knowledge in programming and NMR. Particularly, the emerging deep learning tools is hard to be widely used in NMR due to the sophisticated setup of computation. Thus, NMR processing is not an easy task for chemist and biologists. In this work, we present CloudBrain-NMR, an intelligent online cloud computing platform designed for NMR data reading, processing, reconstruction, and quantitative analysis. The platform is conveniently accessed through a web browser, eliminating the need for any program installation on the user side. CloudBrain-NMR uses parallel computing with graphics processing units and central processing units, resulting in significantly shortened computation time. Furthermore, it incorporates state-of-the-art deep learning-based algorithms offering comprehensive functionalities that allow users to complete the entire processing procedure without relying on additional software. This platform has empowered NMR applications with advanced artificial intelligence processing. CloudBrain-NMR is openly accessible for free usage at https://csrc.xmu.edu.cn/CloudBrain.html △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 11 pages, 13 figures

arXiv:2306.07263 [pdf, other]

Enlarging Stability Region of Urban Networks with Imminent Supply Prediction

Authors: Dianchao Lin, Li Li

Abstract: Stability region is a key index to characterize a dynamic processing system's ability to handle incoming demands. It is a multidimensional space when the system has multiple OD pairs where their service rates interact. Urban traffic network is such a system. Traffic congestion appears when its demand approaches or exceeds the upper frontier of its stability region. In this decade, with the rapid d… ▽ More Stability region is a key index to characterize a dynamic processing system's ability to handle incoming demands. It is a multidimensional space when the system has multiple OD pairs where their service rates interact. Urban traffic network is such a system. Traffic congestion appears when its demand approaches or exceeds the upper frontier of its stability region. In this decade, with the rapid development of traffic sense technology, real-time traffic operations, e.g., BackPressure (BP) control, have gained lots of research attention. Urban network's mobility could be further improved with these timely demand-responding strategies. However, most studies on real-time controls continue with traditional supply assumptions and ignore an important fact -- imminent saturation flow rate (I-SFR), i.e., the system's real-time service rate under green, is neither fixed nor given, but hard to be precisely known. It is unknown how the knowledge level of I-SFR would influence the stability region. This paper proves that knowing more accurate I-SFR can enlarge the upper frontier of the network's stability region. Furthermore, BP policy with predicted I-SFR can stabilize the network within the enlarged stability region and relieve the congestion level of the traffic network. Therefore, improving the I-SFR's prediction accuracy is meaningful for traffic operations. △ Less

Submitted 8 April, 2024; v1 submitted 12 June, 2023; originally announced June 2023.

arXiv:2305.17865 [pdf, other]

doi 10.1109/TVT.2023.3257048

An Efficient Safety-oriented Car-following Model for Connected Automated Vehicles Considering Discrete Signals

Authors: Dianchao Lin, Li Li

Abstract: With the rapid development of Connected and Automated Vehicle (CAV) technology, limited self-driving vehicles have been commercially available in certain leading intelligent transportation system countries. When formulating the car-following model for CAVs, safety is usually the basic constraint. Safety-oriented car-following models seek to specify a safe following distance that can guarantee safe… ▽ More With the rapid development of Connected and Automated Vehicle (CAV) technology, limited self-driving vehicles have been commercially available in certain leading intelligent transportation system countries. When formulating the car-following model for CAVs, safety is usually the basic constraint. Safety-oriented car-following models seek to specify a safe following distance that can guarantee safety if the preceding vehicle were to brake hard suddenly. The discrete signals of CAVs bring a series of phenomena, including discrete decision-making, phase difference, and discretely distributed communication delay. The influences of these phenomena on the car-following safety of CAVs are rarely considered in the literature. This paper proposes an efficient safety-oriented car-following model for CAVs considering the impact of discrete signals. The safety constraints during both normal driving and a sudden hard brake are incorporated into one integrated model to eliminate possible collisions during the whole driving process. The mechanical delay information of the preceding vehicle is used to improve car-following efficiency. Four modules are designed to enhance driving comfort and string stability in case of heavy packet losses. Simulations of a platoon with diversified vehicle types demonstrate the safety, efficiency, and string stability of the proposed model. Tests with different packet loss rates imply that the model could guarantee safety and driving comfort in even poor communication environments. △ Less

Submitted 28 May, 2023; originally announced May 2023.

arXiv:2305.17183 [pdf]

ProGroTrack: Deep Learning-Assisted Tracking of Intracellular Protein Growth Dynamics

Authors: Kai San Chan, Huimiao Chen, Chenyu **, Yuxuan Tian, Dingchang Lin

Abstract: Accurate tracking of cellular and subcellular structures, along with their dynamics, plays a pivotal role in understanding the underlying mechanisms of biological systems. This paper presents a novel approach, ProGroTrack, that combines the You Only Look Once (YOLO) and ByteTrack algorithms within the detection-based tracking (DBT) framework to track intracellular protein nanostructures. Focusing… ▽ More Accurate tracking of cellular and subcellular structures, along with their dynamics, plays a pivotal role in understanding the underlying mechanisms of biological systems. This paper presents a novel approach, ProGroTrack, that combines the You Only Look Once (YOLO) and ByteTrack algorithms within the detection-based tracking (DBT) framework to track intracellular protein nanostructures. Focusing on iPAK4 protein fibers as a representative case study, we conducted a comprehensive evaluation of YOLOv5 and YOLOv8 models, revealing the superior performance of YOLOv5 on our dataset. Notably, YOLOv5x achieved an impressive mAP50 of 0.839 and F-score of 0.819. To further optimize detection capabilities, we incorporated semi-supervised learning for model improvement, resulting in enhanced performances in all metrics. Subsequently, we successfully applied our approach to track the growth behavior of iPAK4 protein fibers, revealing their two distinct growth phases consistent with a previously reported kinetic model. This research showcases the promising potential of our approach, extending beyond iPAK4 fibers. It also offers a significant advancement in precise tracking of dynamic processes in live cells, and fostering new avenues for biomedical research. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2304.14920 [pdf, other]

An EEG Channel Selection Framework for Driver Drowsiness Detection via Interpretability Guidance

Authors: Xinliang Zhou, Dan Lin, Ziyu Jia, Chenyu Liu, Liming Zhai, Yang Liu

Abstract: Drowsy driving has a crucial influence on driving safety, creating an urgent demand for driver drowsiness detection. Electroencephalogram (EEG) signal can accurately reflect the mental fatigue state and thus has been widely studied in drowsiness monitoring. However, the raw EEG data is inherently noisy and redundant, which is neglected by existing works that just use single-channel EEG data or ful… ▽ More Drowsy driving has a crucial influence on driving safety, creating an urgent demand for driver drowsiness detection. Electroencephalogram (EEG) signal can accurately reflect the mental fatigue state and thus has been widely studied in drowsiness monitoring. However, the raw EEG data is inherently noisy and redundant, which is neglected by existing works that just use single-channel EEG data or full-head channel EEG data for model training, resulting in limited performance of driver drowsiness detection. In this paper, we are the first to propose an Interpretability-guided Channel Selection (ICS) framework for the driver drowsiness detection task. Specifically, we design a two-stage training strategy to progressively select the key contributing channels with the guidance of interpretability. We first train a teacher network in the first stage using full-head channel EEG data. Then we apply the class activation map** (CAM) to the trained teacher model to highlight the high-contributing EEG channels and further propose a channel voting scheme to select the top N contributing EEG channels. Finally, we train a student network with the selected channels of EEG data in the second stage for driver drowsiness detection. Experiments are designed on a public dataset, and the results demonstrate that our method is highly applicable and can significantly improve the performance of cross-subject driver drowsiness detection. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2301.12688 [pdf, other]

Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production

Authors: Anyi Rao, Xuekun Jiang, Yuwei Guo, Linning Xu, Lei Yang, Libiao **, Dahua Lin, Bo Dai

Abstract: Amateurs working on mini-films and short-form videos usually spend lots of time and effort on the multi-round complicated process of setting and adjusting scenes, plots, and cameras to deliver satisfying video shots. We present Virtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual environments, where the filming staff can easily test the settings of shots before the actual… ▽ More Amateurs working on mini-films and short-form videos usually spend lots of time and effort on the multi-round complicated process of setting and adjusting scenes, plots, and cameras to deliver satisfying video shots. We present Virtual Dynamic Storyboard (VDS) to allow users storyboarding shots in virtual environments, where the filming staff can easily test the settings of shots before the actual filming. VDS runs on a "propose-simulate-discriminate" mode: Given a formatted story script and a camera script as input, it generates several character animation and camera movement proposals following predefined story and cinematic rules to allow an off-the-shelf simulation engine to render videos. To pick up the top-quality dynamic storyboard from the candidates, we equip it with a shot ranking discriminator based on shot quality criteria learned from professional manual-created data. VDS is comprehensively validated via extensive experiments and user studies, demonstrating its efficiency, effectiveness, and great potential in assisting amateur video production. △ Less

Submitted 21 July, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: Project page: https://virtualfilmstudio.github.io/

arXiv:2209.08500 [pdf, other]

A Map-matching Algorithm with Extraction of Multi-group Information for Low-frequency Data

Authors: Jie Fang, Xiongwei Wu, Dianchao Lin, Mengyun Xu, Huahua Wu, Xuesong Wu, Ting Bi

Abstract: The growing use of probe vehicles generates a huge number of GNSS data. Limited by the satellite positioning technology, further improving the accuracy of map-matching is challenging work, especially for low-frequency trajectories. When matching a trajectory, the ego vehicle's spatial-temporal information of the present trip is the most useful with the least amount of data. In addition, there are… ▽ More The growing use of probe vehicles generates a huge number of GNSS data. Limited by the satellite positioning technology, further improving the accuracy of map-matching is challenging work, especially for low-frequency trajectories. When matching a trajectory, the ego vehicle's spatial-temporal information of the present trip is the most useful with the least amount of data. In addition, there are a large amount of other data, e.g., other vehicles' state and past prediction results, but it is hard to extract useful information for matching maps and inferring paths. Most map-matching studies only used the ego vehicle's data and ignored other vehicles' data. Based on it, this paper designs a new map-matching method to make full use of "Big data". We first sort all data into four groups according to their spatial and temporal distance from the present matching probe which allows us to sort for their usefulness. Then we design three different methods to extract valuable information (scores) from them: a score for speed and bearing, a score for historical usage, and a score for traffic state using the spectral graph Markov neutral network. Finally, we use a modified top-K shortest-path method to search the candidate paths within an ellipse region and then use the fused score to infer the path (projected location). We test the proposed method against baseline algorithms using a real-world dataset in China. The results show that all scoring methods can enhance map-matching accuracy. Furthermore, our method outperforms the others, especially when GNSS probing frequency is less than 0.01 Hz. △ Less

Submitted 18 September, 2022; originally announced September 2022.

Comments: 10 pages, 11 figures, 4 tables

arXiv:2209.04547 [pdf, other]

doi 10.1109/TDSC.2023.3289446

Defend Data Poisoning Attacks on Voice Authentication

Authors: Ke Li, Cameron Baird, Dan Lin

Abstract: With the advances in deep learning, speaker verification has achieved very high accuracy and is gaining popularity as a type of biometric authentication option in many scenes of our daily life, especially the growing market of web services. Compared to traditional passwords, "vocal passwords" are much more convenient as they relieve people from memorizing different passwords. However, new machine… ▽ More With the advances in deep learning, speaker verification has achieved very high accuracy and is gaining popularity as a type of biometric authentication option in many scenes of our daily life, especially the growing market of web services. Compared to traditional passwords, "vocal passwords" are much more convenient as they relieve people from memorizing different passwords. However, new machine learning attacks are putting these voice authentication systems at risk. Without a strong security guarantee, attackers could access legitimate users' web accounts by fooling the deep neural network (DNN) based voice recognition models. In this paper, we demonstrate an easy-to-implement data poisoning attack to the voice authentication system, which can hardly be captured by existing defense mechanisms. Thus, we propose a more robust defense method, called Guardian, which is a convolutional neural network-based discriminator. The Guardian discriminator integrates a series of novel techniques including bias reduction, input augmentation, and ensemble learning. Our approach is able to distinguish about 95% of attacked accounts from normal accounts, which is much more effective than existing approaches with only 60% accuracy. △ Less

Submitted 7 July, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

Journal ref: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 14, NO. 8, AUGUST 2022

arXiv:2206.14940 [pdf, other]

Physics-Inspired Unsupervised Classification for Region of Interest in X-Ray Ptychography

Authors: Dergan Lin, Yi Jiang, Jun**g Deng, Zichao Wendy Di

Abstract: X-ray ptychography allows for large fields to be imaged at high resolution at the cost of additional computational expense due to the large volume of data. Given limited information regarding the object, the acquired data often has an excessive amount of information that is outside the region of interest (RoI). In this work we propose a physics-inspired unsupervised learning algorithm to identify… ▽ More X-ray ptychography allows for large fields to be imaged at high resolution at the cost of additional computational expense due to the large volume of data. Given limited information regarding the object, the acquired data often has an excessive amount of information that is outside the region of interest (RoI). In this work we propose a physics-inspired unsupervised learning algorithm to identify the RoI of an object using only diffraction patterns from a ptychography dataset before committing computational resources to reconstruction. Obtained diffraction patterns that are automatically identified as not within the RoI are filtered out, allowing efficient reconstruction by focusing only on important data within the RoI while preserving image quality. △ Less

Submitted 29 June, 2022; originally announced June 2022.

arXiv:2204.11669 [pdf]

doi 10.1038/s41746-023-00859-y

Deep-learning-enabled Brain Hemodynamic Map** Using Resting-state fMRI

Authors: Xirui Hou, Pengfei Guo, Puyang Wang, Peiying Liu, Doris D. M. Lin, Hongli Fan, Yang Li, Zhiliang Wei, Zixuan Lin, Dengrong Jiang, ** **, Catherine Kelly, Jay J. Pillai, Judy Huang, Marco C. Pinho, Binu P. Thomas, Babu G. Welch, Denise C. Park, Vishal M. Patel, Argye E. Hillis, Hanzhang Lu

Abstract: Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for mappin… ▽ More Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for map** neural activity, is available in most hospitals. Here we show that rs-fMRI can be used to map cerebral hemodynamic function and delineate impairment. By exploiting time variations in breathing pattern during rs-fMRI, deep learning enables reproducible map** of cerebrovascular reactivity (CVR) and bolus arrive time (BAT) of the human brain using resting-state CO2 fluctuations as a natural 'contrast media'. The deep-learning network was trained with CVR and BAT maps obtained with a reference method of CO2-inhalation MRI, which included data from young and older healthy subjects and patients with Moyamoya disease and brain tumors. We demonstrate the performance of deep-learning cerebrovascular map** in the detection of vascular abnormalities, evaluation of revascularization effects, and vascular alterations in normal aging. In addition, cerebrovascular maps obtained with the proposed method exhibited excellent reproducibility in both healthy volunteers and stroke patients. Deep-learning resting-state vascular imaging has the potential to become a useful tool in clinical cerebrovascular imaging. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Journal ref: npj Digital Medicine (2023) 116

arXiv:2201.02366 [pdf, other]

Uncertainty-Aware Cascaded Dilation Filtering for High-Efficiency Deraining

Authors: Qing Guo, **gyang Sun, Felix Juefei-Xu, Lei Ma, Di Lin, Wei Feng, Song Wang

Abstract: Deraining is a significant and fundamental computer vision task, aiming to remove the rain streaks and accumulations in an image or video captured under a rainy day. Existing deraining methods usually make heuristic assumptions of the rain model, which compels them to employ complex optimization or iterative refinement for high recovery quality. This, however, leads to time-consuming methods and a… ▽ More Deraining is a significant and fundamental computer vision task, aiming to remove the rain streaks and accumulations in an image or video captured under a rainy day. Existing deraining methods usually make heuristic assumptions of the rain model, which compels them to employ complex optimization or iterative refinement for high recovery quality. This, however, leads to time-consuming methods and affects the effectiveness for addressing rain patterns deviated from from the assumptions. In this paper, we propose a simple yet efficient deraining method by formulating deraining as a predictive filtering problem without complex rain model assumptions. Specifically, we identify spatially-variant predictive filtering (SPFilt) that adaptively predicts proper kernels via a deep network to filter different individual pixels. Since the filtering can be implemented via well-accelerated convolution, our method can be significantly efficient. We further propose the EfDeRain+ that contains three main contributions to address residual rain traces, multi-scale, and diverse rain patterns without harming the efficiency. First, we propose the uncertainty-aware cascaded predictive filtering (UC-PFilt) that can identify the difficulties of reconstructing clean pixels via predicted kernels and remove the residual rain traces effectively. Second, we design the weight-sharing multi-scale dilated filtering (WS-MS-DFilt) to handle multi-scale rain streaks without harming the efficiency. Third, to eliminate the gap across diverse rain patterns, we propose a novel data augmentation method (i.e., RainMix) to train our deep models. By combining all contributions with sophisticated analysis on different variants, our final method outperforms baseline methods on four single-image deraining datasets and one video deraining dataset in terms of both recovery quality and speed. △ Less

Submitted 7 January, 2022; originally announced January 2022.

Comments: 14 pages, 10 figures, 10 tables. This is the extention of our conference version https://github.com/tsingqguo/efficientderain

arXiv:2201.00317 [pdf, other]

Recurrent Feature Propagation and Edge Skip-Connections for Automatic Abdominal Organ Segmentation

Authors: Zefan Yang, Di Lin, Dong Ni, Yi Wang

Abstract: Automatic segmentation of abdominal organs in computed tomography (CT) images can support radiation therapy and image-guided surgery workflows. Develo** of such automatic solutions remains challenging mainly owing to complex organ interactions and blurry boundaries in CT images. To address these issues, we focus on effective spatial context modeling and explicit edge segmentation priors. Accordi… ▽ More Automatic segmentation of abdominal organs in computed tomography (CT) images can support radiation therapy and image-guided surgery workflows. Develo** of such automatic solutions remains challenging mainly owing to complex organ interactions and blurry boundaries in CT images. To address these issues, we focus on effective spatial context modeling and explicit edge segmentation priors. Accordingly, we propose a 3D network with four main components trained end-to-end including shared encoder, edge detector, decoder with edge skip-connections (ESCs) and recurrent feature propagation head (RFP-Head). To capture wide-range spatial dependencies, the RFP-Head propagates and harvests local features through directed acyclic graphs (DAGs) formulated with recurrent connections in an efficient slice-wise manner, with regard to spatial arrangement of image units. To leverage edge information, the edge detector learns edge prior knowledge specifically tuned for semantic segmentation by exploiting intermediate features from the encoder with the edge supervision. The ESCs then aggregate the edge knowledge with multi-level decoder features to learn a hierarchy of discriminative features explicitly modeling complementarity between organs' interiors and edges for segmentation. We conduct extensive experiments on two challenging abdominal CT datasets with eight annotated organs. Experimental results show that the proposed network outperforms several state-of-the-art models, especially for the segmentation of small and complicated structures (gallbladder, esophagus, stomach, pancreas and duodenum). The code will be publicly available. △ Less

Submitted 19 May, 2023; v1 submitted 2 January, 2022; originally announced January 2022.

arXiv:2112.09726 [pdf, other]

doi 10.1145/3586183.3606823

Soundify: Matching Sound Effects to Video

Authors: David Chuan-En Lin, Anastasis Germanidis, Cristóbal Valenzuela, Yining Shi, Nikolas Martelaro

Abstract: In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sou… ▽ More In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in hel** video editors match sounds to video with lighter workload, reduced task completion time, and improved usability. △ Less

Submitted 25 June, 2024; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: https://soundify.cc

arXiv:2104.06162 [pdf, other]

Visually Informed Binaural Audio Generation without Binaural Audios

Authors: Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, Dahua Lin

Abstract: Stereophonic audio, especially binaural audio, plays an essential role in immersive viewing environments. Recent research has explored generating visually guided stereophonic audios supervised by multi-channel audio collections. However, due to the requirement of professional recording devices, existing datasets are limited in scale and variety, which impedes the generalization of supervised metho… ▽ More Stereophonic audio, especially binaural audio, plays an essential role in immersive viewing environments. Recent research has explored generating visually guided stereophonic audios supervised by multi-channel audio collections. However, due to the requirement of professional recording devices, existing datasets are limited in scale and variety, which impedes the generalization of supervised methods in real-world scenarios. In this work, we propose PseudoBinaural, an effective pipeline that is free of binaural recordings. The key insight is to carefully build pseudo visual-stereo pairs with mono data for training. Specifically, we leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received binaural audios. Then in the visual modality, corresponding visual cues of the mono data are manually placed at sound source positions to form the pairs. Compared to fully-supervised paradigms, our binaural-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference. Moreover, combined with binaural recordings, our method is able to further boost the performance of binaural audio generation under supervised settings. △ Less

Submitted 13 April, 2021; originally announced April 2021.

Comments: Accepted by CVPR 2021. Code, models, and demo video are available on our webpage: \<https://sheldontsui.github.io/projects/PseudoBinaural>

arXiv:2012.14830 [pdf]

doi 10.1109/TNNLS.2022.3144580

A Sparse Model-inspired Deep Thresholding Network for Exponential Signal Reconstruction -- Application in Fast Biological Spectroscopy

Authors: Zi Wang, Di Guo, Zhangren Tu, Yihui Huang, Yirong Zhou, Jian Wang, Liubin Feng, Donghai Lin, Yongfu You, Tatiana Agback, Vladislav Orekhov, Xiaobo Qu

Abstract: The non-uniform sampling is a powerful approach to enable fast acquisition but requires sophisticated reconstruction algorithms. Faithful reconstruction from partial sampled exponentials is highly expected in general signal processing and many applications. Deep learning has shown astonishing potential in this field but many existing problems, such as lack of robustness and explainability, greatly… ▽ More The non-uniform sampling is a powerful approach to enable fast acquisition but requires sophisticated reconstruction algorithms. Faithful reconstruction from partial sampled exponentials is highly expected in general signal processing and many applications. Deep learning has shown astonishing potential in this field but many existing problems, such as lack of robustness and explainability, greatly limit its applications. In this work, by combining merits of the sparse model-based optimization method and data-driven deep learning, we propose a deep learning architecture for spectra reconstruction from undersampled data, called MoDern. It follows the iterative reconstruction in solving a sparse model to build the neural network and we elaborately design a learnable soft-thresholding to adaptively eliminate the spectrum artifacts introduced by undersampling. Extensive results on both synthetic and biological data show that MoDern enables more robust, high-fidelity, and ultra-fast reconstruction than the state-of-the-art methods. Remarkably, MoDern has a small number of network parameters and is trained on solely synthetic data while generalizing well to biological data in various scenarios. Furthermore, we extend it to an open-access and easy-to-use cloud computing platform (XCloud-MoDern), contributing a promising strategy for further development of biological applications. △ Less

Submitted 17 January, 2022; v1 submitted 29 December, 2020; originally announced December 2020.

Comments: 30 pages

arXiv:2010.10298 [pdf]

The Detection of Thoracic Abnormalities ChestX-Det10 Challenge Results

Authors: Jie Lian, **gyu Liu, Yizhou Yu, Mengyuan Ding, Yaoci Lu, Yi Lu, Jie Cai, Deshou Lin, Miao Zhang, Zhe Wang, Kai He, Yijie Yu

Abstract: The detection of thoracic abnormalities challenge is organized by the Deepwise AI Lab. The challenge is divided into two rounds. In this paper, we present the results of 6 teams which reach the second round. The challenge adopts the ChestX-Det10 dateset proposed by the Deepwise AI Lab. ChestX-Det10 is the first chest X-Ray dataset with instance-level annotations, including 10 categories of disease… ▽ More The detection of thoracic abnormalities challenge is organized by the Deepwise AI Lab. The challenge is divided into two rounds. In this paper, we present the results of 6 teams which reach the second round. The challenge adopts the ChestX-Det10 dateset proposed by the Deepwise AI Lab. ChestX-Det10 is the first chest X-Ray dataset with instance-level annotations, including 10 categories of disease/abnormality of 3,543 images. The annotations are located at https://github.com/Deepwise-AILab/ChestX-Det10-Dataset. In the challenge, we randomly split all data into 3001 images for training and 542 images for testing. △ Less

Submitted 21 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2010.00472 [pdf, other]

doi 10.1109/IGARSS.2018.8518855

High Quality Remote Sensing Image Super-Resolution Using Deep Memory Connected Network

Authors: Wenjia Xu, Guangluan Xu, Yang Wang, Xian Sun, Daoyu Lin, Yirong Wu

Abstract: Single image super-resolution is an effective way to enhance the spatial resolution of remote sensing image, which is crucial for many applications such as target detection and image classification. However, existing methods based on the neural network usually have small receptive fields and ignore the image detail. We propose a novel method named deep memory connected network (DMCN) based on a co… ▽ More Single image super-resolution is an effective way to enhance the spatial resolution of remote sensing image, which is crucial for many applications such as target detection and image classification. However, existing methods based on the neural network usually have small receptive fields and ignore the image detail. We propose a novel method named deep memory connected network (DMCN) based on a convolutional neural network to reconstruct high-quality super-resolution images. We build local and global memory connections to combine image detail with environmental information. To further reduce parameters and ease time-consuming, we propose downsampling units, shrinking the spatial size of feature maps. We test DMCN on three remote sensing datasets with different spatial resolution. Experimental results indicate that our method yields promising improvements in both accuracy and visual performance over the current state-of-the-art. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium

arXiv:2008.03548 [pdf, other]

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Authors: Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, Dahua Lin

Abstract: Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classi… ▽ More Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods. △ Less

Submitted 8 August, 2020; originally announced August 2020.

Comments: ECCV2020. Project page: https://anyirao.com/projects/ShotType.html

arXiv:2008.03546 [pdf, other]

Online Multi-modal Person Search in Videos

Authors: Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, Dahua Lin

Abstract: The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require… ▽ More The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require real-time responses. In this paper, we propose an online person search framework, which can recognize people in a video on the fly. This framework maintains a multimodal memory bank at its heart as the basis for person recognition, and updates it dynamically with a policy obtained by reinforcement learning. Our experiments on a large movie dataset show that the proposed method is effective, not only achieving remarkable improvements over online schemes but also outperforming offline methods. △ Less

Submitted 8 August, 2020; originally announced August 2020.

Comments: ECCV2020. Project page: http://movienet.site/projects/eccv20onlineperson.html

arXiv:2007.09902 [pdf, other]

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Authors: Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu

Abstract: Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expen… ▽ More Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: To appear in Proceedings of the European Conference on Computer Vision (ECCV), 2020. Code, models, and video results are available on our webpage: https://hangz-nju-cuhk.github.io/projects/Sep-Stereo

arXiv:2006.08773 [pdf, other]

doi 10.1109/ITSC45102.2020.9294641

Comparative Analysis of Economic Instruments in Intersection Operation: A User-Based Perspective

Authors: DianChao Lin, Saif Eddin Jabari

Abstract: Focusing on different economic instruments implemented in intersection operations under a connected environment, this paper analyzes their advantages and disadvantages from the travelers' perspective. Travelers' concerns revolve around whether a new instrument is easy to learn and operate, whether it can save time or money, and whether it can reduce the rich-poor gap. After a comparative analysis,… ▽ More Focusing on different economic instruments implemented in intersection operations under a connected environment, this paper analyzes their advantages and disadvantages from the travelers' perspective. Travelers' concerns revolve around whether a new instrument is easy to learn and operate, whether it can save time or money, and whether it can reduce the rich-poor gap. After a comparative analysis, we found that both credit and free-market schemes can benefit users. Second-price auctions can only benefit high VOT vehicles. From the perspective of technology deployment and adoption, a credit scheme is not easy to learn and operate for travelers. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 6 pages, 8 figures, 6 tables, IEEE-ITSC2020

Report number: 2020

Journal ref: The 23rd IEEE International Conference on Intelligent Transportation Systems, 2020

arXiv:2006.08766 [pdf, other]

doi 10.1109/ITSC45102.2020.9294680

A User-Based Charge and Subsidy Scheme for Single O-D Network Mobility Management

Authors: Li Li, Dianchao Lin, Saif Eddin Jabari

Abstract: We propose a path guidance system with a user-based charge and subsidy (UBCS) scheme for single O-D network mobility management. Users who are willing to join the scheme (subscribers) can submit travel requests along with their VOTs to the system before traveling. Those who are not willing to join (outsiders) only need to submit travel requests to the system. Our system will give all users path gu… ▽ More We propose a path guidance system with a user-based charge and subsidy (UBCS) scheme for single O-D network mobility management. Users who are willing to join the scheme (subscribers) can submit travel requests along with their VOTs to the system before traveling. Those who are not willing to join (outsiders) only need to submit travel requests to the system. Our system will give all users path guidance from their origins to their destinations, and collect a \emph{path payment} from the UBCS subscribers. Subscribers will be charged or subsided in a way that renders the UBCS strategy-proof, revenue-neutral, and Pareto-improving. A numerical example shows that the UBCS scheme is equitable and progressive. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: 6 pages, 3 figures, 2 tables, IEEE ITSC 2020

Report number: 2020

Journal ref: The 23rd IEEE International Conference on Intelligent Transportation Systems, 2020

arXiv:2003.13659 [pdf, other]

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

Authors: Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, Chen Change Loy, ** Luo

Abstract: Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a… ▽ More Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig.1, the deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images. It also enables diverse image manipulation including random jittering, image morphing, and category transfer. Such highly flexible restoration and manipulation are made possible through relaxing the assumption of existing GAN-inversion methods, which tend to fix the generator. Notably, we allow the generator to be fine-tuned on-the-fly in a progressive manner regularized by feature distance obtained by the discriminator in GAN. We show that these easy-to-implement and practical changes help preserve the reconstruction to remain in the manifold of nature image, and thus lead to more precise and faithful reconstruction for real images. Code is available at https://github.com/XingangPan/deep-generative-prior. △ Less

Submitted 20 July, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Comments: Accepted to ECCV2020 as oral. 1) Precise GAN-inversion by discriminator-guided generator finetuning. 2) A versatile way for high-quality image restoration and manipulation. Code: https://github.com/XingangPan/deep-generative-prior

arXiv:2002.05512 [pdf, other]

Real or Not Real, that is the Question

Authors: Yuanbo Xiangli, Yubin Deng, Bo Dai, Chen Change Loy, Dahua Lin

Abstract: While generative adversarial networks (GAN) have been widely adopted in various topics, in this paper we generalize the standard GAN to a new perspective by treating realness as a random variable that can be estimated from multiple angles. In this generalized framework, referred to as RealnessGAN, the discriminator outputs a distribution as the measure of realness. While RealnessGAN shares similar… ▽ More While generative adversarial networks (GAN) have been widely adopted in various topics, in this paper we generalize the standard GAN to a new perspective by treating realness as a random variable that can be estimated from multiple angles. In this generalized framework, referred to as RealnessGAN, the discriminator outputs a distribution as the measure of realness. While RealnessGAN shares similar theoretical guarantees with the standard GAN, it provides more insights on adversarial learning. Compared to multiple baselines, RealnessGAN provides stronger guidance for the generator, achieving improvements on both synthetic and real-world datasets. Moreover, it enables the basic DCGAN architecture to generate realistic images at 1024*1024 resolution when trained from scratch. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: ICLR2020 spotlight. 1) train GAN by maximizing kl-divergence. 2) train non-progressive GAN (DCGAN) architecture at 1024*1024 resolution

arXiv:2001.01813 [pdf, other]

doi 10.1109/TITS.2020.3048475

Pay for Intersection Priority: A Free Market Mechanism for Connected Vehicles

Authors: DianChao Lin, Saif Eddin Jabari

Abstract: The rapid development and deployment of vehicle technologies offer opportunities to re-think the way traffic is managed. This paper capitalizes on vehicle connectivity and proposes an economic instrument and corresponding cooperative framework for allocating priority at intersections. The framework is compatible with a variety of existing intersection control approaches. Similar to free markets, o… ▽ More The rapid development and deployment of vehicle technologies offer opportunities to re-think the way traffic is managed. This paper capitalizes on vehicle connectivity and proposes an economic instrument and corresponding cooperative framework for allocating priority at intersections. The framework is compatible with a variety of existing intersection control approaches. Similar to free markets, our framework allows vehicles to trade their time based on their (disclosed) value of time. We design the framework based on transferable utility games, where winners (time buyers) pay losers (time sellers) in each game. We conduct simulation experiments of both isolated intersections and an arterial setting. The results show that the proposed approach benefits the majority of users when compared to other mechanisms both ones that employ an economic instrument and ones that do not. We also show that it drives travelers to estimate their value of time correctly, and it naturally dissuades travelers from attempting to cheat. △ Less

Submitted 29 December, 2020; v1 submitted 6 January, 2020; originally announced January 2020.

Journal ref: IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 6, 2022

arXiv:1912.00191 [pdf, other]

Learning a Decision Module by Imitating Driver's Control Behaviors

Authors: Junning Huang, Sirui Xie, Jiankai Sun, Qiurui Ma, Chunxiao Liu, Jian** Shi, Dahua Lin, Bolei Zhou

Abstract: Autonomous driving systems have a pipeline of perception, decision, planning, and control. The decision module processes information from the perception module and directs the execution of downstream planning and control modules. On the other hand, the recent success of deep learning suggests that this pipeline could be replaced by end-to-end neural control policies, however, safety cannot be well… ▽ More Autonomous driving systems have a pipeline of perception, decision, planning, and control. The decision module processes information from the perception module and directs the execution of downstream planning and control modules. On the other hand, the recent success of deep learning suggests that this pipeline could be replaced by end-to-end neural control policies, however, safety cannot be well guaranteed for the data-driven neural networks. In this work, we propose a hybrid framework to learn neural decisions in the classical modular pipeline through end-to-end imitation learning. This hybrid framework can preserve the merits of the classical pipeline such as the strict enforcement of physical and logical constraints while learning complex driving decisions from data. To circumvent the ambiguous annotation of human driving decisions, our method learns high-level driving decisions by imitating low-level control behaviors. We show in the simulation experiments that our modular driving agent can generalize its driving decision and control to various complex scenarios where the rule-based programs fail. It can also generate smoother and safer driving trajectories than end-to-end neural policies. △ Less

Submitted 5 May, 2021; v1 submitted 30 November, 2019; originally announced December 2019.

Comments: Proceedings of the Conference on Robot Learning (CoRL) 2020

arXiv:1908.11602 [pdf, other]

Recursive Visual Sound Separation Using Minus-Plus Net

Authors: Xudong Xu, Bo Dai, Dahua Lin

Abstract: Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy, removing the separated sound from the mixture at t… ▽ More Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy, removing the separated sound from the mixture at the end of each prediction, until the mixture becomes empty or contains only noise. In this way, MP-Net could be applied to sound mixtures with arbitrary numbers and types of sounds. Moreover, while MP-Net keeps removing sounds with large energy from the mixture, sounds with small energy could emerge and become clearer, so that the separation is more accurate. Compared to previous methods, MP-Net obtains state-of-the-art results on two large scale datasets, across mixtures with different types and numbers of sounds. △ Less

Submitted 23 October, 2019; v1 submitted 30 August, 2019; originally announced August 2019.

Comments: accepted by ICCV2019

arXiv:1906.07155 [pdf, other]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Authors: Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, **gdong Wang, Jian** Shi, Wanli Ouyang, Chen Change Loy, Dahua Lin

Abstract: We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not onl… ▽ More We present MMDetection, an object detection toolbox that contains a rich set of object detection and instance segmentation methods as well as related components and modules. The toolbox started from a codebase of MMDet team who won the detection track of COCO Challenge 2018. It gradually evolves into a unified platform that covers many popular detection methods and contemporary modules. It not only includes training and inference codes, but also provides weights for more than 200 network models. We believe this toolbox is by far the most complete detection toolbox. In this paper, we introduce the various features of this toolbox. In addition, we also conduct a benchmarking study on different methods, components, and their hyper-parameters. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new detectors. Code and models are available at https://github.com/open-mmlab/mmdetection. The project is under active development and we will keep this document updated. △ Less

Submitted 17 June, 2019; originally announced June 2019.

Comments: Technical report of MMDetection. 11 pages

arXiv:1806.02692 [pdf, other]

doi 10.1016/j.trb.2018.07.004

Traffic state estimation using stochastic Lagrangian dynamics

Authors: Fangfang Zheng, Saif Eddin Jabari, Henry X. Liu, DianChao Lin

Abstract: This paper proposes a new stochastic model of traffic dynamics in Lagrangian coordinates. The source of uncertainty is heterogeneity in driving behavior, captured using driver-specific speed-spacing relations, i.e., parametric uncertainty. It also results in smooth vehicle trajectories in a stochastic context, which is in agreement with real-world traffic dynamics and, thereby, overcoming issues w… ▽ More This paper proposes a new stochastic model of traffic dynamics in Lagrangian coordinates. The source of uncertainty is heterogeneity in driving behavior, captured using driver-specific speed-spacing relations, i.e., parametric uncertainty. It also results in smooth vehicle trajectories in a stochastic context, which is in agreement with real-world traffic dynamics and, thereby, overcoming issues with aggressive oscillation typically observed in sample paths of stochastic traffic flow models. We utilize ensemble filtering techniques for data assimilation (traffic state estimation), but derive the mean and covariance dynamics as the ensemble sizes go to infinity, thereby bypassing the need to sample from the parameter distributions while estimating the traffic states. As a result, the estimation algorithm is just a standard Kalman-Bucy algorithm, which renders the proposed approach amenable to real-time applications using recursive data. Data assimilation examples are performed and our results indicate good agreement with out-of-sample data. △ Less

Submitted 31 May, 2018; originally announced June 2018.

Journal ref: Transportation Research Part B: Methodological Volume 115, September 2018, Pages 143-165

Showing 1–33 of 33 results for author: Lin, D