Search | arXiv e-print repository

Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

Authors: Hongkai Chen, Zixin Luo, Yurun Tian, Xuyang Bai, Ziyu Wang, Lei Zhou, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

Abstract: Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cr… ▽ More Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cross-view deformations. Secondly, we present selective fusion to merge local and global messages from cross attention. Apart from network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted by previous works. Based on these augmentations, our network demonstrate strong matching capacity under different settings. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a similar cost to LoFTR, while the slim version reaches LoFTR baseline's performance with only 15% computation cost and 18% parameters. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Accepted to CVPR2024 Image Matching Workshop

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2311.15980 [pdf, other]

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Authors: Yuanxun Lu, **gyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao

Abstract: Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fin… ▽ More Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25. △ Less

Submitted 21 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: CVPR 2024 camera ready, including more evaluations and discussions. Project webpage: https://nju-3dv.github.io/projects/direct25

arXiv:2310.06347 [pdf, other]

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

Authors: **gyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao

Abstract: We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fin… ▽ More We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2303.17147 [pdf, other]

NeILF++: Inter-Reflectable Light Fields for Geometry and Material Estimation

Authors: **gyang Zhang, Yao Yao, Shiwei Li, **gbo Liu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

Abstract: We present a novel differentiable rendering framework for joint geometry, material, and lighting estimation from multi-view images. In contrast to previous methods which assume a simplified environment map or co-located flashlights, in this work, we formulate the lighting of a static scene as one neural incident light field (NeILF) and one outgoing neural radiance field (NeRF). The key insight of… ▽ More We present a novel differentiable rendering framework for joint geometry, material, and lighting estimation from multi-view images. In contrast to previous methods which assume a simplified environment map or co-located flashlights, in this work, we formulate the lighting of a static scene as one neural incident light field (NeILF) and one outgoing neural radiance field (NeRF). The key insight of the proposed method is the union of the incident and outgoing light fields through physically-based rendering and inter-reflections between surfaces, making it possible to disentangle the scene geometry, material, and lighting from image observations in a physically-based manner. The proposed incident light and inter-reflection framework can be easily applied to other NeRF systems. We show that our method can not only decompose the outgoing radiance into incident lights and surface materials, but also serve as a surface refinement module that further improves the reconstruction detail of the neural surface. We demonstrate on several datasets that the proposed method is able to achieve state-of-the-art results in terms of geometry reconstruction quality, material estimation accuracy, and the fidelity of novel view rendering. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: Project page: \url{https://yoyo000.github.io/NeILF_pp}

arXiv:2208.14201 [pdf, other]

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Authors: Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, Long Quan

Abstract: Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. T… ▽ More Generating robust and reliable correspondences across images is a fundamental task for a diversity of applications. To capture context at both global and local granularity, we propose ASpanFormer, a Transformer-based detector-free matcher that is built on hierarchical attention structure, adopting a novel attention operation which is capable of adjusting attention span in a self-adaptive manner. To achieve this goal, first, flow maps are regressed in each cross attention phase to locate the center of search region. Next, a sampling grid is generated around the center, whose size, instead of being empirically configured as fixed, is adaptively computed from a pixel uncertainty estimated along with the flow map. Finally, attention is computed across two images within derived regions, referred to as attention span. By these means, we are able to not only maintain long-range dependencies, but also enable fine-grained attention among pixels of high relevance that compensates essential locality and piece-wise smoothness in matching tasks. State-of-the-art accuracy on a wide range of evaluation benchmarks validates the strong matching capability of our method. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: Accepted to ECCV2022, project page at https://aspanformer.github.io/

arXiv:2206.03087 [pdf, other]

Critical Regularizations for Neural Surface Reconstruction in the Wild

Authors: **gyang Zhang, Yao Yao, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

Abstract: Neural implicit functions have recently shown promising results on surface reconstructions from multiple views. However, current methods still suffer from excessive time complexity and poor robustness when reconstructing unbounded or complex scenes. In this paper, we present RegSDF, which shows that proper point cloud supervisions and geometry regularizations are sufficient to produce high-quality… ▽ More Neural implicit functions have recently shown promising results on surface reconstructions from multiple views. However, current methods still suffer from excessive time complexity and poor robustness when reconstructing unbounded or complex scenes. In this paper, we present RegSDF, which shows that proper point cloud supervisions and geometry regularizations are sufficient to produce high-quality and robust reconstruction results. Specifically, RegSDF takes an additional oriented point cloud as input, and optimizes a signed distance field and a surface light field within a differentiable rendering framework. We also introduce the two critical regularizations for this optimization. The first one is the Hessian regularization that smoothly diffuses the signed distance values to the entire distance field given noisy and incomplete input. And the second one is the minimal surface regularization that compactly interpolates and extrapolates the missing geometry. Extensive experiments are conducted on DTU, BlendedMVS, and Tanks and Temples datasets. Compared with recent neural surface reconstruction approaches, RegSDF is able to reconstruct surfaces with fine details even for open scenes with complex topologies and unstructured camera trajectories. △ Less

Submitted 7 June, 2022; originally announced June 2022.

Comments: CVPR 2022

arXiv:2203.07182 [pdf, other]

NeILF: Neural Incident Light Field for Physically-based Material Estimation

Authors: Yao Yao, **gyang Zhang, **gbo Liu, Yihang Qu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

Abstract: We present a differentiable rendering framework for material and lighting estimation from multi-view images and a reconstructed geometry. In the framework, we represent scene lightings as the Neural Incident Light Field (NeILF) and material properties as the surface BRDF modelled by multi-layer perceptrons. Compared with recent approaches that approximate scene lightings as the 2D environment map,… ▽ More We present a differentiable rendering framework for material and lighting estimation from multi-view images and a reconstructed geometry. In the framework, we represent scene lightings as the Neural Incident Light Field (NeILF) and material properties as the surface BRDF modelled by multi-layer perceptrons. Compared with recent approaches that approximate scene lightings as the 2D environment map, NeILF is a fully 5D light field that is capable of modelling illuminations of any static scenes. In addition, occlusions and indirect lights can be handled naturally by the NeILF representation without requiring multiple bounces of ray tracing, making it possible to estimate material properties even for scenes with complex lightings and geometries. We also propose a smoothness regularization and a Lambertian assumption to reduce the material-lighting ambiguity during the optimization. Our method strictly follows the physically-based rendering equation, and jointly optimizes material and lighting through the differentiable rendering process. We have intensively evaluated the proposed method on our in-house synthetic dataset, the DTU MVS dataset, and real-world BlendedMVS scenes. Our method is able to outperform previous methods by a significant margin in terms of novel view rendering quality, setting a new state-of-the-art for image-based material and lighting estimation. △ Less

Submitted 18 March, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

arXiv:1810.06681 [pdf, other]

Learn Fast, Forget Slow: Safe Predictive Learning Control for Systems with Unknown and Changing Dynamics Performing Repetitive Tasks

Authors: Christopher D. McKinnon, Angela P. Schoellig

Abstract: We present a control method for improved repetitive path following for a ground vehicle that is geared towards long-term operation where the operating conditions can change over time and are initially unknown. We use weighted Bayesian Linear Regression (wBLR) to model the unknown dynamics, and show how this simple model is more accurate in both its estimate of the mean behaviour and model uncertai… ▽ More We present a control method for improved repetitive path following for a ground vehicle that is geared towards long-term operation where the operating conditions can change over time and are initially unknown. We use weighted Bayesian Linear Regression (wBLR) to model the unknown dynamics, and show how this simple model is more accurate in both its estimate of the mean behaviour and model uncertainty than Gaussian Process Regression and generalizes to novel operating conditions with little or no tuning. In addition, wBLR allows us to use fast adaptation and long-term learning in one, unified framework, to adapt quickly to new operating conditions and learn repetitive model errors over time. This comes with the added benefit of lower computational cost, longer look-ahead, and easier optimization when the model is used in a stochastic Model Predictive Controller (MPC). In order to fully capitalize on the long prediction horizons that are possible with this new approach, we use Tube MPC to reduce the growth of predicted uncertainty. We demonstrate the effectiveness of our approach in experiment on a 900\,kg ground robot showing results over 3.0\,km of driving with both physical and artificial changes to the robot's dynamics. All of our experiments are conducted using a stereo camera for localization. △ Less

Submitted 9 April, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

arXiv:1803.04065 [pdf, other]

Experience Recommendation for Long Term Safe Learning-based Model Predictive Control in Changing Operating Conditions

Authors: Christopher D. McKinnon, Angela P. Schoellig

Abstract: Learning has propelled the cutting edge of performance in robotic control to new heights, allowing robots to operate with high performance in conditions that were previously unimaginable. The majority of the work, however, assumes that the unknown parts are static or slowly changing. This limits them to static or slowly changing environments. However, in the real world, a robot may experience vari… ▽ More Learning has propelled the cutting edge of performance in robotic control to new heights, allowing robots to operate with high performance in conditions that were previously unimaginable. The majority of the work, however, assumes that the unknown parts are static or slowly changing. This limits them to static or slowly changing environments. However, in the real world, a robot may experience various unknown conditions. This paper presents a method to extend an existing single mode GP-based safe learning controller to learn an increasing number of non-linear models for the robot dynamics. We show that this approach enables a robot to re-use past experience from a large number of previously visited operating conditions, and to safely adapt when a new and distinct operating condition is encountered. This allows the robot to achieve safety and high performance in an large number of operating conditions that do not have to be specified ahead of time. Our approach runs independently from the controller, imposing no additional computation time on the control loop regardless of the number of previous operating conditions considered. We demonstrate the effectiveness of our approach in experiment on a 900\,kg ground robot with both physical and artificial changes to its dynamics. All of our experiments are conducted using vision for localization. △ Less

Submitted 11 March, 2018; originally announced March 2018.

arXiv:1603.02772 [pdf, other]

Unscented External Force and Torque Estimation for Quadrotors

Authors: Christopher D. McKinnon, Angela P. Schoellig

Abstract: In this paper, we describe an algorithm, based on the well-known Unscented Quaternion Estimator, to estimate external forces and torques acting on a quadrotor. This formulation uses a non-linear model for the quadrotor dynamics, naturally incorporates process and measurement noise, requires only a few parameters to be tuned manually, and uses singularity-free unit quaternions to represent attitude… ▽ More In this paper, we describe an algorithm, based on the well-known Unscented Quaternion Estimator, to estimate external forces and torques acting on a quadrotor. This formulation uses a non-linear model for the quadrotor dynamics, naturally incorporates process and measurement noise, requires only a few parameters to be tuned manually, and uses singularity-free unit quaternions to represent attitude. We demonstrate in simulation that the proposed algorithm can outperform existing methods. We then highlight how our approach can be used to generate force and torque profiles from experimental data, and how this information can later be used for controller design. Finally, we show how the resulting controllers enable a quadrotor to stay in the wind field of a moving fan. △ Less

Submitted 2 August, 2016; v1 submitted 8 March, 2016; originally announced March 2016.

Showing 1–11 of 11 results for author: Mckinnon, D