Search | arXiv e-print repository

MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Authors: Tanvir Mahmud, Shentong Mo, Yapeng Tian, Diana Marculescu

Abstract: Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignmen… ▽ More Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted in Efficient Deep Learning for Computer Vision CVPR Workshop 2024

arXiv:2405.17995 [pdf, other]

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Authors: Shentong Mo, Sukmin Yun

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can… ▽ More The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.07202 [pdf, other]

Unified Video-Language Pre-training with Synchronized Audio

Authors: Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

Abstract: Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two m… ▽ More Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2403.07938 [pdf, other]

Text-to-Audio Generation Synchronized with Videos

Authors: Shentong Mo, **g Shi, Yapeng Tian

Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often… ▽ More In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet that adeptly merges temporal visual representations with text embeddings. Further enhancing this integration, we weave in a contrastive learning objective, designed to ensure that the visual-aligned text embeddings resonate closely with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.12903

arXiv:2311.15080 [pdf, other]

Weakly-Supervised Audio-Visual Segmentation

Authors: Shentong Mo, Bhiksha Raj

Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotat… ▽ More Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2305.19458 [pdf, other]

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Authors: Shentong Mo, Pedro Morgado

Abstract: The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as t… ▽ More The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task. Historically, these abilities were tackled separately, with several methods developed independently for each task. However, given the interconnected nature of source localization, separation, and recognition, independent models are likely to yield suboptimal performance as they fail to capture the interdependence between these tasks. To address this problem, we propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition. OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives. The first objective aligns audio and visual representations through a localized audio-visual correspondence loss. The second tackles visual source separation using a traditional mix-and-separate framework. Finally, the third objective reinforces visual feature separation and localization by mixing images in pixel space and aligning their representations with those of all corresponding sound sources. Extensive experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks, audio-visual source localization, separation, and nearest neighbor recognition, and empirically demonstrate a strong positive transfer between them. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.01836 [pdf, other]

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Authors: Shentong Mo, Yapeng Tian

Abstract: Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that ca… ▽ More Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio. Specifically, our AV-SAM simply leverages pixel-wise audio-visual fusion across audio features and visual features from the pre-trained image encoder in SAM to aggregate cross-modal representations. Then, the aggregated cross-modal features are fed into the prompt encoder and mask decoder to generate the final audio-visual segmentation masks. We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets. The results demonstrate that the proposed AV-SAM can achieve competitive performance on sounding object localization and segmentation. △ Less

Submitted 2 May, 2023; originally announced May 2023.

arXiv:2205.01679 [pdf, other]

Physics to the Rescue: Deep Non-line-of-sight Reconstruction for High-speed Imaging

Authors: Fangzhou Mu, Sicheng Mo, Jiayong Peng, Xiaochun Liu, Ji Hyun Nam, Siddeshwar Raghavan, Andreas Velten, Yin Li

Abstract: Computational approach to imaging around the corner, or non-line-of-sight (NLOS) imaging, is becoming a reality thanks to major advances in imaging hardware and reconstruction algorithms. A recent development towards practical NLOS imaging, Nam et al. demonstrated a high-speed non-confocal imaging system that operates at 5Hz, 100x faster than the prior art. This enormous gain in acquisition rate,… ▽ More Computational approach to imaging around the corner, or non-line-of-sight (NLOS) imaging, is becoming a reality thanks to major advances in imaging hardware and reconstruction algorithms. A recent development towards practical NLOS imaging, Nam et al. demonstrated a high-speed non-confocal imaging system that operates at 5Hz, 100x faster than the prior art. This enormous gain in acquisition rate, however, necessitates numerous approximations in light transport, breaking many existing NLOS reconstruction methods that assume an idealized image formation model. To bridge the gap, we present a novel deep model that incorporates the complementary physics priors of wave propagation and volume rendering into a neural network for high-quality and robust NLOS reconstruction. This orchestrated design regularizes the solution space by relaxing the image formation model, resulting in a deep model that generalizes well on real captures despite being exclusively trained on synthetic data. Further, we devise a unified learning framework that enables our model to be flexibly trained using diverse supervision signals, including target intensity images or even raw NLOS transient measurements. Once trained, our model renders both intensity and depth images at inference time in a single forward pass, capable of processing more than 5 captures per second on a high-end GPU. Through extensive qualitative and quantitative experiments, we show that our method outperforms prior physics and learning based approaches on both synthetic and real measurements. We anticipate that our method along with the fast capturing system will accelerate future development of NLOS imaging for real world applications that require high-speed imaging. △ Less

Submitted 5 August, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: ICCP 2022 (TPAMI Special Issue on Computational Photography). Project page: https://pages.cs.wisc.edu/~fmu/nlos3d/

arXiv:2203.09324 [pdf, other]

Localizing Visual Sounds the Easy Way

Authors: Shentong Mo, Pedro Morgado

Abstract: Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging witho… ▽ More Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time for improved precision. Our simple and effective framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source. In particular, we improve the CIoU of the Flickr SoundNet test set from 76.80% to 83.94%, and on the VGG-Sound Source dataset from 34.60% to 38.85%. The code is available at https://github.com/stoneMo/EZ-VSL. △ Less

Submitted 29 March, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

arXiv:2111.04146 [pdf, other]

Optimization of the Model Predictive Control Meta-Parameters Through Reinforcement Learning

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: Model predictive control (MPC) is increasingly being considered for control of fast systems and embedded applications. However, the MPC has some significant challenges for such systems. Its high computational complexity results in high power consumption from the control algorithm, which could account for a significant share of the energy resources in battery-powered embedded systems. The MPC param… ▽ More Model predictive control (MPC) is increasingly being considered for control of fast systems and embedded applications. However, the MPC has some significant challenges for such systems. Its high computational complexity results in high power consumption from the control algorithm, which could account for a significant share of the energy resources in battery-powered embedded systems. The MPC parameters must be tuned, which is largely a trial-and-error process that affects the control performance, the robustness and the computational complexity of the controller to a high degree. In this paper, we propose a novel framework in which any parameter of the control algorithm can be jointly tuned using reinforcement learning(RL), with the goal of simultaneously optimizing the control performance and the power usage of the control algorithm. We propose the novel idea of optimizing the meta-parameters of MPCwith RL, i.e. parameters affecting the structure of the MPCproblem as opposed to the solution to a given problem. Our control algorithm is based on an event-triggered MPC where we learn when the MPC should be re-computed, and a dual mode MPC and linear state feedback control law applied in between MPC computations. We formulate a novel mixture-distribution policy and show that with joint optimization we achieve improvements that do not present themselves when optimizing the same parameters in isolation. We demonstrate our framework on the inverted pendulum control task, reducing the total computation time of the control system by 36% while also improving the control performance by 18.4% over the best-performing MPC baseline. △ Less

Submitted 7 November, 2021; originally announced November 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2107.08651 [pdf, other]

doi 10.1109/TAC.2022.3174032

Delay-Compensated Distributed PDE Control of Traffic with Connected/Automated Vehicles

Authors: Jie Qi, Shurong Mo, Miroslav Krstic

Abstract: We develop an input delay-compensating design for stabilization of an Aw-Rascle-Zhang (ARZ) traffic model in congested regime which is governed by a $2\times 2$ first-order hyperbolic nonlinear PDE. The traffic flow consists of both adaptive cruise control-equipped (ACC-equipped) and manually-driven vehicles. The control input is the time gap of ACC-equipped and connected vehicles, which is subjec… ▽ More We develop an input delay-compensating design for stabilization of an Aw-Rascle-Zhang (ARZ) traffic model in congested regime which is governed by a $2\times 2$ first-order hyperbolic nonlinear PDE. The traffic flow consists of both adaptive cruise control-equipped (ACC-equipped) and manually-driven vehicles. The control input is the time gap of ACC-equipped and connected vehicles, which is subject to delays resulting from communication lag. For the linearized system, a novel three-branch bakcstep** transformation with explicit kernel functions is introduced to compensate the input delay. The transformation is proved æto be bounded, continuous and invertible, with explicit inverse transformation derived. Based on the transformation, we obtain the explicit predictor-feedback controller. We prove exponential stability of the closed-loop system with the delay compensator in $L_2$ norm. The performance improvement of the closed-loop system under the proposed controller is illustrated in simulation. △ Less

Submitted 2 September, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

arXiv:2104.07667 [pdf]

Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer

Authors: Meng Zhou, Shanglin Mo

Abstract: Shoulder replacement surgery, also called total shoulder replacement, is a common and complex surgery in Orthopedics discipline. It involves replacing a dead shoulder joint with an artificial implant. In the market, there are many artificial implant manufacturers and each of them may produce different implants with different structures compares to other providers. The problem arises in the followi… ▽ More Shoulder replacement surgery, also called total shoulder replacement, is a common and complex surgery in Orthopedics discipline. It involves replacing a dead shoulder joint with an artificial implant. In the market, there are many artificial implant manufacturers and each of them may produce different implants with different structures compares to other providers. The problem arises in the following situation: a patient has some problems with the shoulder implant accessory and the manufacturer of that implant maybe unknown to either the patient or the doctor, therefore, correctly identification of the manufacturer is the key prior to the treatment. In this paper, we will demonstrate different methods for classifying the manufacturer of a shoulder implant. We will use Vision Transformer approach to this task for the first time ever △ Less

Submitted 21 April, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

Comments: 11 pages, 12 figures

arXiv:2102.11122 [pdf, other]

Reinforcement Learning of the Prediction Horizon in Model Predictive Control

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: Model predictive control (MPC) is a powerful trajectory optimization control technique capable of controlling complex nonlinear systems while respecting system constraints and ensuring safe operation. The MPC's capabilities come at the cost of a high online computational complexity, the requirement of an accurate model of the system dynamics, and the necessity of tuning its parameters to the speci… ▽ More Model predictive control (MPC) is a powerful trajectory optimization control technique capable of controlling complex nonlinear systems while respecting system constraints and ensuring safe operation. The MPC's capabilities come at the cost of a high online computational complexity, the requirement of an accurate model of the system dynamics, and the necessity of tuning its parameters to the specific control application. The main tunable parameter affecting the computational complexity is the prediction horizon length, controlling how far into the future the MPC predicts the system response and thus evaluates the optimality of its computed trajectory. A longer horizon generally increases the control performance, but requires an increasingly powerful computing platform, excluding certain control applications.The performance sensitivity to the prediction horizon length varies over the state space, and this motivated the adaptive horizon model predictive control (AHMPC), which adapts the prediction horizon according to some criteria. In this paper we propose to learn the optimal prediction horizon as a function of the state using reinforcement learning (RL). We show how the RL learning problem can be formulated and test our method on two control tasks, showing clear improvements over the fixed horizon MPC scheme, while requiring only minutes of learning. △ Less

Submitted 22 February, 2021; originally announced February 2021.

Comments: This work has been submitted to IFAC NMPC 2021 for possible publication

arXiv:2012.08095 [pdf, other]

Automatic Speech Verification Spoofing Detection

Authors: Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi

Abstract: Automatic speech verification (ASV) is the technology to determine the identity of a person based on their voice. While being convenient for identity verification, we should aim for the highest system security standard given that it is the safeguard of valuable digital assets. Bearing this in mind, we follow the setup in ASVSpoof 2019 competition to develop potential countermeasures that are robus… ▽ More Automatic speech verification (ASV) is the technology to determine the identity of a person based on their voice. While being convenient for identity verification, we should aim for the highest system security standard given that it is the safeguard of valuable digital assets. Bearing this in mind, we follow the setup in ASVSpoof 2019 competition to develop potential countermeasures that are robust and efficient. Two metrics, EER and t-DCF, will be used for system evaluation. △ Less

Submitted 15 December, 2020; originally announced December 2020.

arXiv:2011.13365 [pdf, other]

Optimization of the Model Predictive Control Update Interval Using Reinforcement Learning

Authors: Eivind Bøhn, Sebastien Gros, Signe Moe, Tor Arne Johansen

Abstract: In control applications there is often a compromise that needs to be made with regards to the complexity and performance of the controller and the computational resources that are available. For instance, the typical hardware platform in embedded control applications is a microcontroller with limited memory and processing power, and for battery powered applications the control system can account f… ▽ More In control applications there is often a compromise that needs to be made with regards to the complexity and performance of the controller and the computational resources that are available. For instance, the typical hardware platform in embedded control applications is a microcontroller with limited memory and processing power, and for battery powered applications the control system can account for a significant portion of the energy consumption. We propose a controller architecture in which the computational cost is explicitly optimized along with the control objective. This is achieved by a three-part architecture where a high-level, computationally expensive controller generates plans, which a computationally simpler controller executes by compensating for prediction errors, while a recomputation policy decides when the plan should be recomputed. In this paper, we employ model predictive control (MPC) as the high-level plan-generating controller, a linear state feedback controller as the simpler compensating controller, and reinforcement learning (RL) to learn the recomputation policy. Simulation results for two examples showcase the architecture's ability to improve upon the MPC approach and find reasonable compromises weighing the performance on the control objective and the computational resources expended. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: Submitted to 3rd Annual Learning for Dynamics and Control Conference (L4DC 2021)

arXiv:2004.07812 [pdf, other]

doi 10.1007/978-3-030-59416-9_12

Bus Frequency Optimization: When Waiting Time Matters in User Satisfaction

Authors: Songsong Mo, Zhifeng Bao, Baihua Zheng, Zhiyong Peng

Abstract: Reorganizing bus frequency to cater for the actual travel demand can save the cost of the public transport system significantly. Many, if not all, existing studies formulate this as a bus frequency optimization problem which tries to minimize passengers' average waiting time. However, many investigations have confirmed that the user satisfaction drops faster as the waiting time increases. Conseque… ▽ More Reorganizing bus frequency to cater for the actual travel demand can save the cost of the public transport system significantly. Many, if not all, existing studies formulate this as a bus frequency optimization problem which tries to minimize passengers' average waiting time. However, many investigations have confirmed that the user satisfaction drops faster as the waiting time increases. Consequently, this paper studies the bus frequency optimization problem considering the user satisfaction. Specifically, for the first time to our best knowledge, we study how to schedule the buses such that the total number of passengers who could receive their bus services within the waiting time threshold is maximized. We prove that this problem is NP-hard, and present an index-based algorithm with $(1-1/e)$ approximation ratio. By exploiting the locality property of routes in a bus network, we propose a partition-based greedy method which achieves a $(1-ρ)(1-1/e)$ approximation ratio. Then we propose a progressive partition-based greedy method to further improve the efficiency while achieving a $(1-ρ)(1-1/e-\varepsilon)$ approximation ratio. Experiments on a real city-wide bus dataset in Singapore verify the efficiency, effectiveness, and scalability of our methods. △ Less

Submitted 23 March, 2020; originally announced April 2020.

Journal ref: International Conference on Database Systems for Advanced Applications 2020

arXiv:2003.04949 [pdf, other]

LC-GAN: Image-to-image Translation Based on Generative Adversarial Network for Endoscopic Images

Authors: Shan Lin, Fangbo Qin, Yangming Li, Randall A. Bly, Kris S. Moe, Blake Hannaford

Abstract: Intelligent vision is appealing in computer-assisted and robotic surgeries. Vision-based analysis with deep learning usually requires large labeled datasets, but manual data labeling is expensive and time-consuming in medical problems. We investigate a novel cross-domain strategy to reduce the need for manual data labeling by proposing an image-to-image translation model live-cadaver GAN (LC-GAN)… ▽ More Intelligent vision is appealing in computer-assisted and robotic surgeries. Vision-based analysis with deep learning usually requires large labeled datasets, but manual data labeling is expensive and time-consuming in medical problems. We investigate a novel cross-domain strategy to reduce the need for manual data labeling by proposing an image-to-image translation model live-cadaver GAN (LC-GAN) based on generative adversarial networks (GANs). We consider a situation when a labeled cadaveric surgery dataset is available while the task is instrument segmentation on an unlabeled live surgery dataset. We train LC-GAN to learn the map**s between the cadaveric and live images. For live image segmentation, we first translate the live images to fake-cadaveric images with LC-GAN and then perform segmentation on the fake-cadaveric images with models trained on the real cadaveric dataset. The proposed method fully makes use of the labeled cadaveric dataset for live image segmentation without the need to label the live dataset. LC-GAN has two generators with different architectures that leverage the deep feature representation learned from the cadaveric image based segmentation task. Moreover, we propose the structural similarity loss and segmentation consistency loss to improve the semantic consistency during translation. Our model achieves better image-to-image translation and leads to improved segmentation performance in the proposed cross-domain segmentation task. △ Less

Submitted 13 August, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

Comments: Accepted by 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:1911.05478 [pdf, other]

doi 10.1109/ICUAS.2019.8798254

Deep Reinforcement Learning Attitude Control of Fixed-Wing UAVs Using Proximal Policy Optimization

Authors: Eivind Bøhn, Erlend M. Coates, Signe Moe, Tor Arne Johansen

Abstract: Contemporary autopilot systems for unmanned aerial vehicles (UAVs) are far more limited in their flight envelope as compared to experienced human pilots, thereby restricting the conditions UAVs can operate in and the types of missions they can accomplish autonomously. This paper proposes a deep reinforcement learning (DRL) controller to handle the nonlinear attitude control problem, enabling exten… ▽ More Contemporary autopilot systems for unmanned aerial vehicles (UAVs) are far more limited in their flight envelope as compared to experienced human pilots, thereby restricting the conditions UAVs can operate in and the types of missions they can accomplish autonomously. This paper proposes a deep reinforcement learning (DRL) controller to handle the nonlinear attitude control problem, enabling extended flight envelopes for fixed-wing UAVs. A proof-of-concept controller using the proximal policy optimization (PPO) algorithm is developed, and is shown to be capable of stabilizing a fixed-wing UAV from a large set of initial conditions to reference roll, pitch and airspeed values. The training process is outlined and key factors for its progression rate are considered, with the most important factor found to be limiting the number of variables in the observation vector, and including values for several previous time steps for these variables. The trained reinforcement learning (RL) controller is compared to a proportional-integral-derivative (PID) controller, and is found to converge in more cases than the PID controller, with comparable performance. Furthermore, the RL controller is shown to generalize well to unseen disturbances in the form of wind and turbulence, even in severe disturbance conditions. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Comments: 11 pages, 3 figures, 2019 International Conference on Unmanned Aircraft Systems (ICUAS)

Journal ref: In 2019 International Conference on Unmanned Aircraft Systems (ICUAS) (pp. 523-533). IEEE

arXiv:1612.09105 [pdf, other]

Set-based Control for Autonomous Spray Painting

Authors: Signe Moe, Jan Tommy Gravdahl, Kristin Y. Pettersen

Abstract: In this paper, a method is presented for lowering the energy consumption and/or increasing the speed of a standard manipulator spray painting a surface. The approach is based on the observation that a small angle between the spray direction and the surface normal does not affect the quality of the paint job. Recent results in set-based kinematic control are utilized to develop a switched control s… ▽ More In this paper, a method is presented for lowering the energy consumption and/or increasing the speed of a standard manipulator spray painting a surface. The approach is based on the observation that a small angle between the spray direction and the surface normal does not affect the quality of the paint job. Recent results in set-based kinematic control are utilized to develop a switched control system, where this angle is defined as a set-based task with a maximum allowed limit. Four different set-based methods are implemented and tested on a UR5 manipulator from Universal Robots. Experimental results verify the correctness of the method, and demonstrate that the set-based approaches can substantially lower the paint time and energy consumption compared to the current standard solution. △ Less

Submitted 29 December, 2016; originally announced December 2016.

Showing 1–19 of 19 results for author: Mo, S