-
SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images
Authors:
Ming Chen,
Yuxuan Cheng,
Xinwei He,
Xinyue Wang,
Yan Aze,
**hai Xiang
Abstract:
Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a sim…
▽ More
Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at \href{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
QQQ: Quality Quattuor-Bit Quantization for Large Language Models
Authors:
Ying Zhang,
Peng Zhang,
Mincong Huang,
**gyang Xiang,
Yujie Wang,
Chao Wang,
Yineng Zhang,
Lei Yu,
Chuan Liu,
Wei Lin
Abstract:
Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, w…
▽ More
Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$\times$ and 3.29 $\times$ over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 $\times$, 2.10$\times$, and 1.25$\times$ compared to FP16, W8A8, and W4A16, respectively.
△ Less
Submitted 28 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Pandora: Towards General World Model with Natural Language Actions and Video States
Authors:
Jiannan Xiang,
Guangyi Liu,
Yi Gu,
Qiyue Gao,
Yuting Ning,
Yuheng Zha,
Zeyu Feng,
Tianhua Tao,
Shibo Hao,
Yemin Shi,
Zhengzhong Liu,
Eric P. Xing,
Zhiting Hu
Abstract:
World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the…
▽ More
World models simulate future states of the world in response to different actions. They facilitate interactive content creation and provides a foundation for grounded, long-horizon reasoning. Current foundation models do not fully meet the capabilities of general world models: large language models (LLMs) are constrained by their reliance on language modality and their limited understanding of the physical world, while video models lack interactive action control over the world simulations. This paper makes a step towards building a general world model by introducing Pandora, a hybrid autoregressive-diffusion model that simulates world states by generating videos and allows real-time control with free-text actions. Pandora achieves domain generality, video consistency, and controllability through large-scale pretraining and instruction tuning. Crucially, Pandora bypasses the cost of training-from-scratch by integrating a pretrained LLM (7B) and a pretrained video model, requiring only additional lightweight finetuning. We illustrate extensive outputs by Pandora across diverse domains (indoor/outdoor, natural/urban, human/robot, 2D/3D, etc.). The results indicate great potential of building stronger general world models with larger-scale training.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Compressed Meta-Optical Encoder for Image Classification
Authors:
Anna Wirth-Singh,
**lin Xiang,
Minho Choi,
Johannes E. Fröch,
Luocheng Huang,
Shane Colburn,
Eli Shlizerman,
Arka Majumdar
Abstract:
Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modif…
▽ More
Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modified AlexNet to a single linear convolutional layer and an electronic backend (two fully connected layers). We obtain comparable performance to a purely electronic CNN with five convolutional layers and three fully connected layers. We implement the convolution optically via engineering the point spread function of an inverse-designed meta-optic. Using this hybrid approach, we estimate a reduction in multiply-accumulate operations from 17M in a conventional electronic modified AlexNet to only 86K in the hybrid compressed network enabled by the optical frontend. This constitutes over two orders of magnitude reduction in latency and power consumption. Furthermore, we experimentally demonstrate that the classification accuracy of the system exceeds 93% on the MNIST dataset.
△ Less
Submitted 14 June, 2024; v1 submitted 22 April, 2024;
originally announced June 2024.
-
An Efficient Trajectory Generation for Bi-copter Flight in Tight Space
Authors:
Xin Dong,
Yangjie Cui,
**gwu Xiang,
Daochun Li,
Zhan Tu
Abstract:
Unlike squared (or alike) quadrotors, elongated bi-copters leverage natural superiority in crossing tight spaces. To date, extensive works have focused on the design, modeling, and control of bi-copters. Besides, a proper motion planner utilizing bi-copters' shape characteristics is essential to efficiently and safely traverse tight spaces, yet it has rarely been studied. Current motion planning m…
▽ More
Unlike squared (or alike) quadrotors, elongated bi-copters leverage natural superiority in crossing tight spaces. To date, extensive works have focused on the design, modeling, and control of bi-copters. Besides, a proper motion planner utilizing bi-copters' shape characteristics is essential to efficiently and safely traverse tight spaces, yet it has rarely been studied. Current motion planning methods will significantly compromise their ability to traverse narrow spaces if the map is inflated based on the long dimension of the bi-copter. In this paper, we propose an efficient motion planning method that enables the safe navigation of bi-copters through narrow spaces. We first adapt a dynamic, feasible path-finding algorithm with whole-body collision checks to generate a collision-free path. Subsequently, we jointly optimize the position and rotation of the bi-copter to produce a trajectory that is safe, dynamically feasible, and smooth. Extensive simulations and real-world experiments have been conducted to verify the reliability and robustness of the proposed method.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Dual Inflation and Bounce Cosmologies Interpretation of Pulsar Timing Array Data
Authors:
Changhong Li,
Junrong Lai,
**jie Xiang,
Chaofan Wu
Abstract:
We explore a dual scenario of generalized inflation and bounce cosmologies, producing a scale-invariant curvature perturbation spectrum. Bayesian analysis with pulsar timing array data identifies, for the first time, viable regions from inflation and bounce that simultaneously explain stochastic gravitational wave background (SGWB) signals and CMB anisotropies. Bayes factor calculations strongly f…
▽ More
We explore a dual scenario of generalized inflation and bounce cosmologies, producing a scale-invariant curvature perturbation spectrum. Bayesian analysis with pulsar timing array data identifies, for the first time, viable regions from inflation and bounce that simultaneously explain stochastic gravitational wave background (SGWB) signals and CMB anisotropies. Bayes factor calculations strongly favor this dual scenario over conventional sources and provide initial evidence of a duality between inflation and bounce regarding SGWB, offering new insights for early universe model-building and future observations.
△ Less
Submitted 6 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Evaluating large language models in medical applications: a survey
Authors:
Xiaolan Chen,
Jiayang Xiang,
Shanfu Lu,
Yexin Liu,
Mingguang He,
Danli Shi
Abstract:
Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains, including healthcare and medicine. In the medical domain, LLMs hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medic…
▽ More
Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains, including healthcare and medicine. In the medical domain, LLMs hold promise for tasks ranging from clinical decision support to patient education. However, evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information. This paper provides a comprehensive overview of the landscape of medical LLM evaluation, synthesizing insights from existing studies and highlighting evaluation data sources, task scenarios, and evaluation methods. Additionally, it identifies key challenges and opportunities in medical LLM evaluation, emphasizing the need for continued research and innovation to ensure the responsible integration of LLMs into clinical practice.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Symmetry Strategy for Rapid Discovery of Abundant Fractional Quantum Ferroelectrics
Authors:
Guoliang Yu,
Junyi Ji,
Changsong Xu,
H. J. Xiang
Abstract:
Traditional ferroelectrics are limited by Neumann's principle, which confines exploration of ferroelectrics within polar point groups. Our recent work [Nat. Commun. 15, 135, (2024)] proposes the concept of fractional quantum ferroelectricity (FQFE) that extend the playground of ferroelectricity to non-polar point groups. Here, we apply group theory and introduce an efficient symmetry strategy to i…
▽ More
Traditional ferroelectrics are limited by Neumann's principle, which confines exploration of ferroelectrics within polar point groups. Our recent work [Nat. Commun. 15, 135, (2024)] proposes the concept of fractional quantum ferroelectricity (FQFE) that extend the playground of ferroelectricity to non-polar point groups. Here, we apply group theory and introduce an efficient symmetry strategy to identify FQFE candidates. Integrated with a high-throughput screening scheme, we go through 171,527 materials and identify 202 potential FQFE candidates, which are already experimentally synthesized. In addition, we point out that the essence of FQFE is fractional atomic displacements with respect to lattice vectors, which can actually result in both fractional (type-I) and integer (type-II) quantized polarization, respectively. Through performing first-principles calculations, we verify the symmetry-predicted switchable FQFE properties in bulk AlAgS2 and monolayer HgI2. Notably, AlAgS2 exhibits an ultra-low switching barrier of 23 meV/f.u. and interlocked in-plane/out-of-plane polarization, while HgI2 demonstrates large spontaneous polarization of 42 μC/cm2. Our findings not only advance the understanding on FQFE, but also offer guidance for experimental exploration and design of novel ferroelectric materials.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Spin Supersolid Phase and Double Magnon-Roton Excitations in a Cobalt-based Triangular Lattice
Authors:
Yuan Gao,
Chuandi Zhang,
Junsen Xiang,
Dehong Yu,
Xingye Lu,
Peijie Sun,
Wentao **,
Gang Su,
Wei Li
Abstract:
Supersolid is an exotic quantum state of matter that hosts spontaneously the features of both solid and superfluidity, which breaks the lattice translational symmetry and U(1) gauge symmetry. Here we conduct inelastic neutron scattering (INS) measurements and tensor-network calculations on the triangular-lattice cobaltate Na$_2$BaCo(PO$_4$)$_2$, which is proposed in [Xiang ${\it et al.}$, Nature 6…
▽ More
Supersolid is an exotic quantum state of matter that hosts spontaneously the features of both solid and superfluidity, which breaks the lattice translational symmetry and U(1) gauge symmetry. Here we conduct inelastic neutron scattering (INS) measurements and tensor-network calculations on the triangular-lattice cobaltate Na$_2$BaCo(PO$_4$)$_2$, which is proposed in [** the helium thermodynamics, the intriguing magnetic excitations also strongly influence the low-temperature thermodynamics of spin supersolid down to sub-Kelvin regime, explaining the recently observed giant magnetocaloric effect in Na$_2$BaCo(PO$_4$)$_2$.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
How Far Can We Go with Practical Function-Level Program Repair?
Authors:
Jiahong Xiang,
Xiaoyang Xu,
Fanchu Kong,
Mingyuan Wu,
Haotian Zhang,
Yuqun Zhang
Abstract:
Recently, multiple Automated Program Repair (APR) techniques based on Large Language Models (LLMs) have been proposed to enhance the repair performance. While these techniques mainly focus on the single-line or hunk-level repair, they face significant challenges in real-world application due to the limited repair task scope and costly statement-level fault localization. However, the more practical…
▽ More
Recently, multiple Automated Program Repair (APR) techniques based on Large Language Models (LLMs) have been proposed to enhance the repair performance. While these techniques mainly focus on the single-line or hunk-level repair, they face significant challenges in real-world application due to the limited repair task scope and costly statement-level fault localization. However, the more practical function-level APR, which broadens the scope of APR task to fix entire buggy functions and requires only cost-efficient function-level fault localization, remains underexplored. In this paper, we conduct the first comprehensive study of LLM-based function-level APR including investigating the effect of the few-shot learning mechanism and the auxiliary repair-relevant information. Specifically, we adopt six widely-studied LLMs and construct a benchmark in both the Defects4J 1.2 and 2.0 datasets. Our study demonstrates that LLMs with zero-shot learning are already powerful function-level APR techniques, while applying the few-shot learning mechanism leads to disparate repair performance. Moreover, we find that directly applying the auxiliary repair-relevant information to LLMs significantly increases function-level repair performance. Inspired by our findings, we propose an LLM-based function-level APR technique, namely SRepair, which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information for advancing the repair performance. The evaluation results demonstrate that SRepair can correctly fix 300 single-function bugs in the Defects4J dataset, largely surpassing all previous APR techniques by at least 85%, without the need for the costly statement-level fault location information. Furthermore, SRepair successfully fixes 32 multi-function bugs in the Defects4J dataset, which is the first time achieved by any APR technique ever to our best knowledge.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Streamlined Photoacoustic Image Processing with Foundation Models: A Training-Free Solution
Authors:
Handi Deng,
Yucheng Zhou,
Jiaxuan Xiang,
Liujie Gu,
Yan Luo,
Hai Feng,
Mingyuan Liu,
Cheng Ma
Abstract:
Foundation models have rapidly evolved and have achieved significant accomplishments in computer vision tasks. Specifically, the prompt mechanism conveniently allows users to integrate image prior information into the model, making it possible to apply models without any training. Therefore, we propose a method based on foundation models and zero training to solve the tasks of photoacoustic (PA) i…
▽ More
Foundation models have rapidly evolved and have achieved significant accomplishments in computer vision tasks. Specifically, the prompt mechanism conveniently allows users to integrate image prior information into the model, making it possible to apply models without any training. Therefore, we propose a method based on foundation models and zero training to solve the tasks of photoacoustic (PA) image segmentation. We employed the segment anything model (SAM) by setting simple prompts and integrating the model's outputs with prior knowledge of the imaged objects to accomplish various tasks, including: (1) removing the skin signal in three-dimensional PA image rendering; (2) dual speed-of-sound reconstruction, and (3) segmentation of finger blood vessels. Through these demonstrations, we have concluded that deep learning can be directly applied in PA imaging without the requirement for network design and training. This potentially allows for a hands-on, convenient approach to achieving efficient and accurate segmentation of PA images. This letter serves as a comprehensive tutorial, facilitating the mastery of the technique through the provision of code and sample datasets.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Uniqueness to inverse acoustic and elastic medium scattering problems with hyper-singular source method
Authors:
Chun Liu,
Guanghui Hu,
Jianli Xiang,
Jiayi Zhang
Abstract:
This paper is concerned with inverse scattering problems of determining the support of an isotropic and homogeneous penetrable body from knowledge of multi-static far-field patterns in acoustics and in linear elasticity. The normal derivative of the total fields admits no jump on the interface of the scatterer in the trace sense. If the contrast function of the refractive index function or the den…
▽ More
This paper is concerned with inverse scattering problems of determining the support of an isotropic and homogeneous penetrable body from knowledge of multi-static far-field patterns in acoustics and in linear elasticity. The normal derivative of the total fields admits no jump on the interface of the scatterer in the trace sense. If the contrast function of the refractive index function or the density function has a positive lower bound near the boundary, we propose a hyper-singular source method to prove uniqueness of inverse scattering with all incoming plane waves at a fixed energy. It is based on subtle analysis on the leading part of the scattered field when hyper-singular sources caused by the first derivative of the fundamental solution approach to a boundary point. As a by-product, we show that this hyper-singular method can be also used to determine the boundary value of a Holder continuous refractive index function in acoustics or a Holder continuous density function in linear elasticity.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space
Authors:
Jianxiang Xiang,
Zhenhua Liu,
Haodong Liu,
Yin Bai,
Jia Cheng,
Wenliang Chen
Abstract:
In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natur…
▽ More
In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce continuous latent variables into the diffusion model. The problem of using latent variables in the dialog task is how to build both an effective prior of the latent space and an inferring process to obtain the proper latent given the context. By combining the encoder and latent-based diffusion model, we encode the response's latent representation in a continuous space as the prior, instead of fixed Gaussian distribution or simply discrete ones. We then infer the latent by denoising step by step with the diffusion model. The experimental results show that our model greatly enhances the diversity of dialog responses while maintaining coherence. Furthermore, in further analysis, we find that our diffusion model achieves high inference efficiency, which is the main challenge of applying diffusion models in natural language processing.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
C-type antiferromagnetic structure of topological semimetal CaMnSb$_2$
Authors:
Bo Li,
Xu-Tao Zeng,
Qianhui Xu,
Fan Yang,
Junsen Xiang,
Hengyang Zhong,
Sihao Deng,
Lunhua He,
Ju** Xu,
Wen Yin,
Xingye Lu,
Huiying Liu,
Xian-Lei Sheng,
Wentao **
Abstract:
Determination of the magnetic structure and confirmation of the presence or absence of inversion ($\mathcal{P}$) and time reversal ($\mathcal{T}$) symmetry is imperative for correctly understanding the topological magnetic materials. Here high-quality single crystals of the layered manganese pnictide CaMnSb$_2$ are synthesized using the self-flux method. De Haas-van Alphen oscillations indicate a…
▽ More
Determination of the magnetic structure and confirmation of the presence or absence of inversion ($\mathcal{P}$) and time reversal ($\mathcal{T}$) symmetry is imperative for correctly understanding the topological magnetic materials. Here high-quality single crystals of the layered manganese pnictide CaMnSb$_2$ are synthesized using the self-flux method. De Haas-van Alphen oscillations indicate a nontrivial Berry phase of $\sim$ $π$ and a notably small cyclotron effective mass, supporting the Dirac semimetal nature of CaMnSb$_2$. Neutron diffraction measurements identify a C-type antiferromagnetic (AFM) structure below $T\rm_{N}$ = 303(1) K with the Mn moments aligned along the $a$ axis, which is well supported by the density functional theory (DFT) calculations. The corresponding magnetic space group is $Pn'm'a'$, preserving a $\mathcal{P}\times\mathcal{T}$ symmetry. Adopting the experimentally determined magnetic structure, band crossings near the Y point in momentum space and linear dispersions of the Sb $5p_{y,z}$ bands are revealed by the DFT calculations. Furthermore, our study predicts the possible existence of an intrinsic second-order nonlinear Hall effect in CaMnSb$_2$, offering a promising platform to study the impact of topological properties on nonlinear electrical transports in antiferromagnets.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Structural, magnetic and magnetocaloric properties of triangular-lattice transition-metal phosphates
Authors:
Chuandi Zhang,
Junsen Xiang,
Quanliang Zhu,
Longfei Wu,
Shanfeng Zhang,
Ju** Xu,
Wen Yin,
Peijie Sun,
Wei Li,
Gang Su,
Wentao **
Abstract:
The recent discovery of the spin supersolid candidate Na$_2$BaCo(PO$_4$)$_2$ stimulates numerous research interest on the triangular-lattice transition-metal phosphates. Here we report a comprehensive study on the structural, magnetic and magnetocaloric properties of polycrystalline Na$_2$$A$$T$(PO$_4$)$_2$ ($A$ = Ba, Sr; $T$ = Co, Ni, Mn). X-ray and neutron diffraction measurements confirm that N…
▽ More
The recent discovery of the spin supersolid candidate Na$_2$BaCo(PO$_4$)$_2$ stimulates numerous research interest on the triangular-lattice transition-metal phosphates. Here we report a comprehensive study on the structural, magnetic and magnetocaloric properties of polycrystalline Na$_2$$A$$T$(PO$_4$)$_2$ ($A$ = Ba, Sr; $T$ = Co, Ni, Mn). X-ray and neutron diffraction measurements confirm that Na$_2$Ba$T$(PO$_4$)$_2$ (NB$T$P) crystallizes in a trigonal structure, while Na$_2$Sr$T$(PO$_4$)$_2$ (NS$T$P) forms a monoclinic structure with a slight distortion of the triangular network of $T^{2+}$ ions. The dc magnetization data show that all six compounds order antiferromagnetically below 2 K, and the Néel temperatures of NS$T$P are consistently higher than those of NB$T$P for $T$ = Co, Ni, and Mn, due to the release of geometrical frustration by monoclinic distortions. Further magnetocaloric measurements show that trigonal NB$T$P can reach a lower temperature in the quasi-adiabatic demagnetization process and thus shows a better performance in the magnetic refrigeration, compared with monoclinic NS$T$P. Our findings highlight the outstanding magnetocaloric performances of the trigonal transition-metal phosphates, and disclose two necessary ingredients for a superior magnetic coolant that can reach an ultra-low temperature, including a perfect geometrically frustrated lattice and a small effective spin number associated with the magnetic ions.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation
Authors:
Zhenhua Liu,
Tong Zhu,
Jianxiang Xiang,
Wenliang Chen
Abstract:
Data augmentation (DA) is crucial to mitigate model training instability and over-fitting problems in low-resource open-domain dialogue generation. However, traditional DA methods often neglect semantic data diversity, restricting the overall quality. Recently, large language models (LLM) have been used for DA to generate diversified dialogues. However, they have limited controllability and tend t…
▽ More
Data augmentation (DA) is crucial to mitigate model training instability and over-fitting problems in low-resource open-domain dialogue generation. However, traditional DA methods often neglect semantic data diversity, restricting the overall quality. Recently, large language models (LLM) have been used for DA to generate diversified dialogues. However, they have limited controllability and tend to generate dialogues with a distribution shift compared to the seed dialogues. To maximize the augmentation diversity and address the controllability problem, we propose \textbf{S}ummary-based \textbf{D}ialogue \textbf{A}ugmentation with LLM (SDA). Our approach enhances the controllability of LLM by using dialogue summaries as a planning tool. Based on summaries, SDA can generate high-quality and diverse dialogue data even with a small seed dataset. To evaluate the efficacy of data augmentation methods for open-domain dialogue, we designed a clustering-based metric to characterize the semantic diversity of the augmented dialogue data. The experimental results show that SDA can augment high-quality and semantically diverse dialogues given a small seed dataset and an LLM, and the augmented data can boost the performance of open-domain dialogue models.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Authors:
Ruicheng Wang,
Jianfeng Xiang,
Jiaolong Yang,
Xin Tong
Abstract:
We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method dire…
▽ More
We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
AutoDFP: Automatic Data-Free Pruning via Channel Similarity Reconstruction
Authors:
Siqi Li,
Jun Chen,
**gyang Xiang,
Chengrui Zhu,
Yong Liu
Abstract:
Structured pruning methods are developed to bridge the gap between the massive scale of neural networks and the limited hardware resources. Most current structured pruning methods rely on training datasets to fine-tune the compressed model, resulting in high computational burdens and being inapplicable for scenarios with stringent requirements on privacy and security. As an alternative, some data-…
▽ More
Structured pruning methods are developed to bridge the gap between the massive scale of neural networks and the limited hardware resources. Most current structured pruning methods rely on training datasets to fine-tune the compressed model, resulting in high computational burdens and being inapplicable for scenarios with stringent requirements on privacy and security. As an alternative, some data-free methods have been proposed, however, these methods often require handcraft parameter tuning and can only achieve inflexible reconstruction. In this paper, we propose the Automatic Data-Free Pruning (AutoDFP) method that achieves automatic pruning and reconstruction without fine-tuning. Our approach is based on the assumption that the loss of information can be partially compensated by retaining focused information from similar channels. Specifically, We formulate data-free pruning as an optimization problem, which can be effectively addressed through reinforcement learning. AutoDFP assesses the similarity of channels for each layer and provides this information to the reinforcement learning agent, guiding the pruning and reconstruction process of the network. We evaluate AutoDFP with multiple networks on multiple datasets, achieving impressive compression results. For instance, on the CIFAR-10 dataset, AutoDFP demonstrates a 2.87\% reduction in accuracy loss compared to the recently proposed data-free pruning method DFPC with fewer FLOPs on VGG-16. Furthermore, on the ImageNet dataset, AutoDFP achieves 43.17\% higher accuracy than the SOTA method with the same 80\% preserved ratio on MobileNet-V1.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Measuring Robustness in Cyber-Physical Systems under Sensor Attacks
Authors:
Jian Xiang,
Ruggero Lanotte,
Simone Tini,
Stephen Chong,
Massimo Merro
Abstract:
This paper contributes a formal framework for quantitative analysis of bounded sensor attacks on cyber-physical systems, using the formalism of differential dynamic logic. Given a precondition and postcondition of a system, we formalize two quantitative safety notions, quantitative forward and backward safety, which respectively express (1) how strong the strongest postcondition of the system is w…
▽ More
This paper contributes a formal framework for quantitative analysis of bounded sensor attacks on cyber-physical systems, using the formalism of differential dynamic logic. Given a precondition and postcondition of a system, we formalize two quantitative safety notions, quantitative forward and backward safety, which respectively express (1) how strong the strongest postcondition of the system is with respect to the specified postcondition, and (2) how strong the specified precondition is with respect to the weakest precondition of the system needed to ensure the specified postcondition holds. We introduce two notions, forward and backward robustness, to characterize the robustness of a system against sensor attacks as the loss of safety. To reason about robustness, we introduce two simulation distances, forward and backward simulation distances, which are defined based on the behavioral distances between the original system and the system with compromised sensors. Forward and backward distances, respectively, characterize upper bounds of the degree of forward and backward safety loss caused by the sensor attacks. We verify the two simulation distances by expressing them as modalities, i.e., formulas of differential dynamic logic, and develop an ad-hoc proof system to reason with such formulas. We showcase our formal notions and reasoning techniques on two non-trivial case studies: an autonomous vehicle that needs to avoid collision and a water tank system.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Generalized Coronal Loop Scaling Laws and Their Implication for Turbulence in Solar Active Region Loops
Authors:
Y. Dai,
J. J. Xiang,
M. D. Ding
Abstract:
Recent coronal loop modeling has emphasized the importance of combining both Coulomb collisions and turbulent scattering to characterize field-aligned thermal conduction, which invokes a hybrid loop model. In this work we generalize the hybrid model by incorporating nonuniform heating and cross section that are both formulated by a power-law function of temperature. Based on the hybrid model solut…
▽ More
Recent coronal loop modeling has emphasized the importance of combining both Coulomb collisions and turbulent scattering to characterize field-aligned thermal conduction, which invokes a hybrid loop model. In this work we generalize the hybrid model by incorporating nonuniform heating and cross section that are both formulated by a power-law function of temperature. Based on the hybrid model solutions, we construct scaling laws that relate loop-top temperature ($T_a$) and heating rate ($H_a$) to other loop parameters. It is found that the loop-top properties for turbulent loops are additionally power-law functions of turbulent mean free path ($λ_T$), with the functional forms varying from situation to situation that depends on the specification of the heating and/or areal parameters. More importantly, both a sufficiently footpoint-concentrated heating and a cross-sectional expansion with height can effectively weaken (strengthen) the negative (positive) power-law dependence of $T_a$ ($H_a$) on $λ_T$. The reason lies in a notable reduction of heat flux by footpoint heating and/or cross-sectional expansion in the turbulence-dominated coronal part, where turbulent scattering introduces a much weaker dependence of the conduction coefficient on temperature. In this region, therefore, the reduction of the heat flux predominately relies on a backward flattening of the temperature gradient. Through numerical modeling that incorporates more realistic conditions, this scenario is further consolidated. Our results have important implication for solar active region (AR) loops. With the factors of nonuniform heating and cross section taken into account, AR loops can bear relatively stronger turbulence while still kee** a physically reasonable temperature for nonflaring loops.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
Authors:
Zhenhong Zhou,
Jiuyang Xiang,
Haopeng Chen,
Quan Liu,
Zherui Li,
Sen Su
Abstract:
Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans deri…
▽ More
Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
LuaTaint: A Static Taint Analysis System for Web Interface Framework Vulnerability of IoT Devices
Authors:
Jiahui Xiang,
Wenhai Wang,
Tong Ye,
Peiyu Liu
Abstract:
IoT devices are currently facing continuous malicious attacks due to their widespread use. Among these IoT devices, web vulnerabilities are also widely exploited because of their inherent characteristics, such as improper permission controls and insecure interfaces. Recently, the embedded system web interface framework has become highly diverse, and specific vulnerabilities can arise if developers…
▽ More
IoT devices are currently facing continuous malicious attacks due to their widespread use. Among these IoT devices, web vulnerabilities are also widely exploited because of their inherent characteristics, such as improper permission controls and insecure interfaces. Recently, the embedded system web interface framework has become highly diverse, and specific vulnerabilities can arise if developers forget to detect user input parameters or if the detection process is not strict enough. Therefore, discovering vulnerabilities in the web interfaces of IoT devices accurately and comprehensively through an automated method is a major challenge. This paper aims to work out the challenge. We have developed an automated vulnerability detection system called LuaTaint for the typical web interface framework, LuCI. The system employs static taint analysis to address web security issues on mobile terminal platforms to ensure detection coverage. It integrates rules pertaining to page handler control logic within the taint detection process to improve its extensibility. We also implemented a post-processing step with the assistance of large language models to enhance accuracy and reduce the need for manual analysis. We have created a prototype of LuaTaint and tested it on 92 IoT firmwares from 8 well-known vendors. LuaTaint has discovered 68 unknown vulnerabilities.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Multi-Intent Attribute-Aware Text Matching in Searching
Authors:
Mingzhe Li,
Xiuying Chen,
**g Xiang,
Qishen Zhang,
Changsheng Ma,
Chenchen Dai,
**xiong Chang,
Zhongyi Liu,
Guannan Zhang
Abstract:
Text matching systems have become a fundamental service in most searching platforms. For instance, they are responsible for matching user queries to relevant candidate items, or rewriting the user-input query to a pre-selected high-performing one for a better search experience. In practice, both the queries and items often contain multiple attributes, such as the category of the item and the locat…
▽ More
Text matching systems have become a fundamental service in most searching platforms. For instance, they are responsible for matching user queries to relevant candidate items, or rewriting the user-input query to a pre-selected high-performing one for a better search experience. In practice, both the queries and items often contain multiple attributes, such as the category of the item and the location mentioned in the query, which represent condensed key information that is helpful for matching. However, most of the existing works downplay the effectiveness of attributes by integrating them into text representations as supplementary information. Hence, in this work, we focus on exploring the relationship between the attributes from two sides. Since attributes from two ends are often not aligned in terms of number and type, we propose to exploit the benefit of attributes by multiple-intent modeling. The intents extracted from attributes summarize the diverse needs of queries and provide rich content of items, which are more refined and abstract, and can be aligned for paired inputs. Concretely, we propose a multi-intent attribute-aware matching model (MIM), which consists of three main components: attribute-aware encoder, multi-intent modeling, and intent-aware matching. In the attribute-aware encoder, the text and attributes are weighted and processed through a scaled attention mechanism with regard to the attributes' importance. Afterward, the multi-intent modeling extracts intents from two ends and aligns them. Herein, we come up with a distribution loss to ensure the learned intents are diverse but concentrated, and a kullback-leibler divergence loss that aligns the learned intents. Finally, in the intent-aware matching, the intents are evaluated by a self-supervised masking task, and then incorporated to output the final matching result.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
MMToM-QA: Multimodal Theory of Mind Question Answering
Authors:
Chuanyang **,
Yutong Wu,
**g Cao,
Jiannan Xiang,
Yen-Ling Kuo,
Zhiting Hu,
Tomer Ullman,
Antonio Torralba,
Joshua B. Tenenbaum,
Tianmin Shu
Abstract:
Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for develo** machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than v…
▽ More
Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for develo** machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
△ Less
Submitted 15 June, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Make-A-Character: High Quality Text-to-3D Character Generation within Minutes
Authors:
Jianqiang Ren,
Chao He,
Lin Liu,
Jiahao Chen,
Yutong Wang,
Yafei Song,
Jianfang Li,
Tangli Xue,
Siqi Hu,
Tao Chen,
Kunkun Zheng,
Jian**g Xiang,
Liefeng Bo
Abstract:
There is a growing demand for customized and expressive 3D characters with the emergence of AI agents and Metaverse, but creating 3D characters using traditional computer graphics tools is a complex and time-consuming task. To address these challenges, we propose a user-friendly framework named Make-A-Character (Mach) to create lifelike 3D avatars from text descriptions. The framework leverages th…
▽ More
There is a growing demand for customized and expressive 3D characters with the emergence of AI agents and Metaverse, but creating 3D characters using traditional computer graphics tools is a complex and time-consuming task. To address these challenges, we propose a user-friendly framework named Make-A-Character (Mach) to create lifelike 3D avatars from text descriptions. The framework leverages the power of large language and vision models for textual intention understanding and intermediate image generation, followed by a series of human-oriented visual perception and 3D generation modules. Our system offers an intuitive approach for users to craft controllable, realistic, fully-realized 3D characters that meet their expectations within 2 minutes, while also enabling easy integration with existing CG pipeline for dynamic expressiveness. For more information, please visit the project page at https://human3daigc.github.io/MACH/.
△ Less
Submitted 24 December, 2023;
originally announced December 2023.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
CR-SFP: Learning Consistent Representation for Soft Filter Pruning
Authors:
**gyang Xiang,
Zhuangzhi Chen,
Jianbiao Mei,
Siqi Li,
Jun Chen,
Yong Liu
Abstract:
Soft filter pruning~(SFP) has emerged as an effective pruning technique for allowing pruned filters to update and the opportunity for them to regrow to the network. However, this pruning strategy applies training and pruning in an alternative manner, which inevitably causes inconsistent representations between the reconstructed network~(R-NN) at the training and the pruned network~(P-NN) at the in…
▽ More
Soft filter pruning~(SFP) has emerged as an effective pruning technique for allowing pruned filters to update and the opportunity for them to regrow to the network. However, this pruning strategy applies training and pruning in an alternative manner, which inevitably causes inconsistent representations between the reconstructed network~(R-NN) at the training and the pruned network~(P-NN) at the inference, resulting in performance degradation. In this paper, we propose to mitigate this gap by learning consistent representation for soft filter pruning, dubbed as CR-SFP. Specifically, for each training step, CR-SFP optimizes the R-NN and P-NN simultaneously with different distorted versions of the same training data, while forcing them to be consistent by minimizing their posterior distribution via the bidirectional KL-divergence loss. Meanwhile, the R-NN and P-NN share backbone parameters thus only additional classifier parameters are introduced. After training, we can export the P-NN for inference. CR-SFP is a simple yet effective training framework to improve the accuracy of P-NN without introducing any additional inference cost. It can also be combined with a variety of pruning criteria and loss functions. Extensive experiments demonstrate our CR-SFP achieves consistent improvements across various CNN architectures. Notably, on ImageNet, our CR-SFP reduces more than 41.8\% FLOPs on ResNet18 with 69.2\% top-1 accuracy, improving SFP by 2.1\% under the same training settings. The code will be publicly available on GitHub.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
MaxQ: Multi-Axis Query for N:M Sparsity Network
Authors:
**gyang Xiang,
Siqi Li,
Junhao Chen,
Zhuangzhi Chen,
Tianxin Huang,
Linpeng Peng,
Yong Liu
Abstract:
N:M sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. However, existing N:M sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides, they directly apply N:M sparsity to the whole network, which will cause severe inf…
▽ More
N:M sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. However, existing N:M sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides, they directly apply N:M sparsity to the whole network, which will cause severe information loss. Thus, they are still sub-optimal. In this paper, we propose an efficient and effective Multi-Axis Query methodology, dubbed as MaxQ, to rectify these problems. During the training, MaxQ employs a dynamic approach to generate soft N:M masks, considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile, a sparsity strategy that gradually increases the percentage of N:M weight blocks is applied, which allows the network to heal from the pruning-induced damage progressively. During the runtime, the N:M soft masks can be precomputed as constants and folded into weights without causing any distortion to the sparse pattern and incurring additional computational overhead. Comprehensive experiments demonstrate that MaxQ achieves consistent improvements across diverse CNN architectures in various computer vision tasks, including image classification, object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern, MaxQ can achieve 74.6\% top-1 accuracy on ImageNet and improve by over 2.8\% over the state-of-the-art. Codes and checkpoints are available at \url{https://github.com/**gyangXiang/MaxQ}.
△ Less
Submitted 16 March, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
Authors:
Jun Xiang,
Xuan Gao,
Yudong Guo,
Juyong Zhang
Abstract:
We propose FlashAvatar, a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to mo…
▽ More
We propose FlashAvatar, a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions, proper initialization can help reduce the number of Gaussians, thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/
△ Less
Submitted 29 March, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Coupling shape and pairing vibrations in a collective Hamiltonian based on nuclear energy density functionals (II): low-energy excitation spectra of triaxial nuclei
Authors:
J. Xiang,
Z. P. Li,
T. Nikšić,
D. Vretenar,
W. H. Long,
X. Y. Wu
Abstract:
The triaxial quadrupole collective Hamiltonian, based on relativistic energy density functionals, is extended to include a pairing collective coordinate. In addition to triaxial shape vibrations and rotations, the model describes pairing vibrations and the coupling between triaxial shape and pairing degrees of freedom. The parameters of the collective Hamiltonian are determined by a covariant ener…
▽ More
The triaxial quadrupole collective Hamiltonian, based on relativistic energy density functionals, is extended to include a pairing collective coordinate. In addition to triaxial shape vibrations and rotations, the model describes pairing vibrations and the coupling between triaxial shape and pairing degrees of freedom. The parameters of the collective Hamiltonian are determined by a covariant energy density functional, with constraints on the intrinsic triaxial shape and pairing deformations. The effect of coupling between triaxial shape and pairing degrees of freedom is analyzed in a study of low-lying spectra and transition rates of $^{128}$Xe. When compared to results obtained with the standard triaxial quadrupole collective Hamiltonian, the inclusion of dynamical pairing compresses the low-lying spectra and improves interband transitions, in better agreement with data. The effect of zero-point energy (ZPE) correction on low-lying excited spectra is also discussed.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Kitchen Artist: Precise Control of Liquid Dispensing for Gourmet Plating
Authors:
Hung-Jui Huang,
**gyi Xiang,
Wenzhen Yuan
Abstract:
Manipulating liquid is widely required for many tasks, especially in cooking. A common way to address this is extruding viscous liquid from a squeeze bottle. In this work, our goal is to create a sauce plating robot, which requires precise control of the thickness of squeezed liquids on a surface. Different liquids demand different manipulation policies. We command the robot to tilt the container…
▽ More
Manipulating liquid is widely required for many tasks, especially in cooking. A common way to address this is extruding viscous liquid from a squeeze bottle. In this work, our goal is to create a sauce plating robot, which requires precise control of the thickness of squeezed liquids on a surface. Different liquids demand different manipulation policies. We command the robot to tilt the container and monitor the liquid response using a force sensor to identify liquid properties. Based on the liquid properties, we predict the liquid behavior with fixed squeezing motions in a data-driven way and calculate the required drawing speed for the desired stroke size. This open-loop system works effectively even without sensor feedback. Our experiments demonstrate accurate stroke size control across different liquids and fill levels. We show that understanding liquid properties can facilitate effective liquid manipulation. More importantly, our dish garnishing robot has a wide range of applications and holds significant commercialization potential.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Ultrabroadband, high color purity multispectral color filter arrays
Authors:
Jiewei Xiang,
Meiting Song,
Yi Zhang,
Jennifer Kruschwitz,
Jaime Cardenas
Abstract:
Multispectral imagers that capture spatial and spectral information are of growing importance in various fields, particularly in remote sensing and metrology. To enable integrated snapshot multispectral imagers and eliminate the drawbacks of traditional systems, such as bulkiness and slow scanning mechanisms, miniature, broadband multispectral filter arrays with narrow line widths, high transmissi…
▽ More
Multispectral imagers that capture spatial and spectral information are of growing importance in various fields, particularly in remote sensing and metrology. To enable integrated snapshot multispectral imagers and eliminate the drawbacks of traditional systems, such as bulkiness and slow scanning mechanisms, miniature, broadband multispectral filter arrays with narrow line widths, high transmission, and CMOS compatibility are essential. However, current miniature filter arrays, primarily based on diffraction nanostructures, suffer from limitations such as small working bandwidth, low transmission, poor color purity, and sensitivity to polarization and incident angles. To address these challenges, we present a high-order Fabry-Perot Multispectral Filter Array (MSFA) with selective peak suppression, leveraging subwavelength structures for filter tuning without changing the physical thickness and employing an ultrathin metal layer to exploit high-order resonances, significantly extending the working range and spectral resolution. High color purity across a broad range (400nm-1000nm) is made possible through optical absorption from polysilicon and selective suppression from a platinum layer. The fabricated color filter arrays cover wavelengths from 622nm to 960nm, with Full Width Half Maximum (FWHM) ranging from 13nm to 31nm and average transmissions exceeding 60%. Furthermore, these filters can be downscaled to sizes compatible with modern CMOS imagers, reaching dimensions as small as 1um. The introduction of a resonance combining design further extends the working range (455nm-960nm), aligning with the capabilities of silicon photodetectors. Its adaptability across wavelength ranges and potential for tunable applications hold promise for transformative imaging and display technologies across a wide spectrum.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Simultaneous Shape Tracking of Multiple Deformable Linear Objects with Global-Local Topology Preservation
Authors:
**gyi Xiang,
Holly Dinkel
Abstract:
This work presents an algorithm for tracking the shape of multiple entangling Deformable Linear Objects (DLOs) from a sequence of RGB-D images. This algorithm runs in real-time and improves on previous single-DLO tracking approaches by enabling tracking of multiple objects. This is achieved using Global-Local Topology Preservation (GLTP). This work uses the geodesic distance in GLTP to define the…
▽ More
This work presents an algorithm for tracking the shape of multiple entangling Deformable Linear Objects (DLOs) from a sequence of RGB-D images. This algorithm runs in real-time and improves on previous single-DLO tracking approaches by enabling tracking of multiple objects. This is achieved using Global-Local Topology Preservation (GLTP). This work uses the geodesic distance in GLTP to define the distance between separate objects and the distance between different parts of the same object. Tracking multiple entangling DLOs is demonstrated experimentally. The source code is publicly released.
△ Less
Submitted 23 October, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Spec-NeRF: Multi-spectral Neural Radiance Fields
Authors:
Jiabao Li,
Yuqi Li,
Ciliang Sun,
Chong Wang,
**hui Xiang
Abstract:
We propose Multi-spectral Neural Radiance Fields(Spec-NeRF) for jointly reconstructing a multispectral radiance field and spectral sensitivity functions(SSFs) of the camera from a set of color images filtered by different filters. The proposed method focuses on modeling the physical imaging process, and applies the estimated SSFs and radiance field to synthesize novel views of multispectral scenes…
▽ More
We propose Multi-spectral Neural Radiance Fields(Spec-NeRF) for jointly reconstructing a multispectral radiance field and spectral sensitivity functions(SSFs) of the camera from a set of color images filtered by different filters. The proposed method focuses on modeling the physical imaging process, and applies the estimated SSFs and radiance field to synthesize novel views of multispectral scenes. In this method, the data acquisition requires only a low-cost trichromatic camera and several off-the-shelf color filters, making it more practical than using specialized 3D scanning and spectral imaging equipment. Our experiments on both synthetic and real scenario datasets demonstrate that utilizing filtered RGB images with learnable NeRF and SSFs can achieve high fidelity and promising spectral reconstruction while retaining the inherent capability of NeRF to comprehend geometric structures. Code is available at https://github.com/CPREgroup/SpecNeRF-v2.
△ Less
Submitted 14 September, 2023;
originally announced October 2023.
-
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach
Authors:
Feng Luo,
**xi Xiang,
Jun Zhang,
Xiao Han,
Wei Yang
Abstract:
The recent use of diffusion prior, enhanced by pre-trained text-image models, has markedly elevated the performance of image super-resolution (SR). To alleviate the huge computational cost required by pixel-based diffusion SR, latent-based methods utilize a feature encoder to transform the image and then implement the SR image generation in a compact latent space. Nevertheless, there are two major…
▽ More
The recent use of diffusion prior, enhanced by pre-trained text-image models, has markedly elevated the performance of image super-resolution (SR). To alleviate the huge computational cost required by pixel-based diffusion SR, latent-based methods utilize a feature encoder to transform the image and then implement the SR image generation in a compact latent space. Nevertheless, there are two major issues that limit the performance of latent-based diffusion. First, the compression of latent space usually causes reconstruction distortion. Second, huge computational cost constrains the parameter scale of the diffusion model. To counteract these issues, we first propose a frequency compensation module that enhances the frequency components from latent space to pixel space. The reconstruction distortion (especially for high-frequency information) can be significantly decreased. Then, we propose to use Sample-Space Mixture of Experts (SS-MoE) to achieve more powerful latent-based SR, which steadily improves the capacity of the model without a significant increase in inference costs. These carefully crafted designs contribute to performance improvements in largely explored 4x blind super-resolution benchmarks and extend to large magnification factors, i.e., 8x image SR benchmarks. The code is available at https://github.com/amandaluof/moe_sr.
△ Less
Submitted 13 December, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Effortless Cross-Platform Video Codec: A Codebook-Based Method
Authors:
Kuan Tian,
Yonghang Guan,
**xi Xiang,
Jun Zhang,
Xiao Han,
Wei Yang
Abstract:
Under certain circumstances, advanced neural video codecs can surpass the most complex traditional codecs in their rate-distortion (RD) performance. One of the main reasons for the high performance of existing neural video codecs is the use of the entropy model, which can provide more accurate probability distribution estimations for compressing the latents. This also implies the rigorous requirem…
▽ More
Under certain circumstances, advanced neural video codecs can surpass the most complex traditional codecs in their rate-distortion (RD) performance. One of the main reasons for the high performance of existing neural video codecs is the use of the entropy model, which can provide more accurate probability distribution estimations for compressing the latents. This also implies the rigorous requirement that entropy models running on different platforms should use consistent distribution estimations. However, in cross-platform scenarios, entropy models running on different platforms usually yield inconsistent probability distribution estimations due to floating point computation errors that are platform-dependent, which can cause the decoding side to fail in correctly decoding the compressed bitstream sent by the encoding side. In this paper, we propose a cross-platform video compression framework based on codebooks, which avoids autoregressive entropy modeling and achieves video compression by transmitting the index sequence of the codebooks. Moreover, instead of using optical flow for context alignment, we propose to use the conditional cross-attention module to obtain the context between frames. Due to the absence of autoregressive modeling and optical flow alignment, we can design an extremely minimalist framework that can greatly benefit computational efficiency. Importantly, our framework no longer contains any distribution estimation modules for entropy modeling, and thus computations across platforms are not necessarily consistent. Experimental results show that our method can outperform the traditional H.265 (medium) even without any entropy constraints, while achieving the cross-platform property intrinsically.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration
Authors:
**gyang Xiang,
Siqi Li,
Jun Chen,
Shipeng Bai,
Yukai Ma,
Guang Dai,
Yong Liu
Abstract:
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space…
▽ More
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https://github.com/**gyangXiang/SUBP}.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Neural Impostor: Editing Neural Radiance Fields with Explicit Shape Manipulation
Authors:
Ruiyang Liu,
**xu Xiang,
Bowen Zhao,
Ran Zhang,
**gyi Yu,
Changxi Zheng
Abstract:
Neural Radiance Fields (NeRF) have significantly advanced the generation of highly realistic and expressive 3D scenes. However, the task of editing NeRF, particularly in terms of geometry modification, poses a significant challenge. This issue has obstructed NeRF's wider adoption across various applications. To tackle the problem of efficiently editing neural implicit fields, we introduce Neural I…
▽ More
Neural Radiance Fields (NeRF) have significantly advanced the generation of highly realistic and expressive 3D scenes. However, the task of editing NeRF, particularly in terms of geometry modification, poses a significant challenge. This issue has obstructed NeRF's wider adoption across various applications. To tackle the problem of efficiently editing neural implicit fields, we introduce Neural Impostor, a hybrid representation incorporating an explicit tetrahedral mesh alongside a multigrid implicit field designated for each tetrahedron within the explicit mesh. Our framework bridges the explicit shape manipulation and the geometric editing of implicit fields by utilizing multigrid barycentric coordinate encoding, thus offering a pragmatic solution to deform, composite, and generate neural implicit fields while maintaining a complex volumetric appearance. Furthermore, we propose a comprehensive pipeline for editing neural implicit fields based on a set of explicit geometric editing operations. We show the robustness and adaptability of our system through diverse examples and experiments, including the editing of both synthetic objects and real captured data. Finally, we demonstrate the authoring process of a hybrid synthetic-captured object utilizing a variety of editing operations, underlining the transformative potential of Neural Impostor in the field of 3D content creation and manipulation.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Towards Real-Time Neural Video Codec for Cross-Platform Application Using Calibration Information
Authors:
Kuan Tian,
Yonghang Guan,
**xi Xiang,
Jun Zhang,
Xiao Han,
Wei Yang
Abstract:
The state-of-the-art neural video codecs have outperformed the most sophisticated traditional codecs in terms of RD performance in certain cases. However, utilizing them for practical applications is still challenging for two major reasons. 1) Cross-platform computational errors resulting from floating point operations can lead to inaccurate decoding of the bitstream. 2) The high computational com…
▽ More
The state-of-the-art neural video codecs have outperformed the most sophisticated traditional codecs in terms of RD performance in certain cases. However, utilizing them for practical applications is still challenging for two major reasons. 1) Cross-platform computational errors resulting from floating point operations can lead to inaccurate decoding of the bitstream. 2) The high computational complexity of the encoding and decoding process poses a challenge in achieving real-time performance. In this paper, we propose a real-time cross-platform neural video codec, which is capable of efficiently decoding of 720P video bitstream from other encoding platforms on a consumer-grade GPU. First, to solve the problem of inconsistency of codec caused by the uncertainty of floating point calculations across platforms, we design a calibration transmitting system to guarantee the consistent quantization of entropy parameters between the encoding and decoding stages. The parameters that may have transboundary quantization between encoding and decoding are identified in the encoding stage, and their coordinates will be delivered by auxiliary transmitted bitstream. By doing so, these inconsistent parameters can be processed properly in the decoding stage. Furthermore, to reduce the bitrate of the auxiliary bitstream, we rectify the distribution of entropy parameters using a piecewise Gaussian constraint. Second, to match the computational limitations on the decoding side for real-time video codec, we design a lightweight model. A series of efficiency techniques enable our model to achieve 25 FPS decoding speed on NVIDIA RTX 2080 GPU. Experimental results demonstrate that our model can achieve real-time decoding of 720P videos while encoding on another platform. Furthermore, the real-time model brings up to a maximum of 24.2\% BD-rate improvement from the perspective of PSNR with the anchor H.265.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Non-parametric Ensemble Empirical Mode Decomposition for extracting weak features to identify bearing defects
Authors:
Anil Kumar,
Yaakoub Berrouche,
Radosław Zimroz,
Govind Vashishtha,
Sumika Chauhan,
C. P. Gandhi,
Hesheng Tang,
Jiawei Xiang
Abstract:
A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation…
▽ More
A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation results show that mode mixing in NPCEEMD is less than the existing decomposition methods. After conducting in-depth simulation analysis, the proposed method is applied to experimental data. The proposed NPCEEMD method works in following steps. First raw signal is obtained. Second, the obtained signal is decomposed. Then, the mutual information (MI) of the raw signal with NPCEEMD-generated IMFs is computed. Further IMFs with MI above 0.1 are selected and combined to form a resulting signal. Finally, envelope spectrum of resulting signal is computed to confirm the presence of defect.
△ Less
Submitted 2 October, 2023; v1 submitted 12 September, 2023;
originally announced September 2023.
-
AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections
Authors:
Yue Wu,
Sicheng Xu,
Jianfeng Xiang,
Fangyun Wei,
Qifeng Chen,
Jiaolong Yang,
Xin Tong
Abstract:
Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that g…
▽ More
Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Quantifying and Analyzing Entity-level Memorization in Large Language Models
Authors:
Zhenhong Zhou,
Jiuyang Xiang,
Chaomeng Chen,
Sen Su
Abstract:
Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization…
▽ More
Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. We find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. The results demonstrate that LLMs not only memorize their training data but also understand associations between entities. These findings necessitate that trainers of LLMs exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations.
△ Less
Submitted 5 November, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Dynamic Low-Rank Instance Adaptation for Universal Neural Image Compression
Authors:
Yue Lv,
**xi Xiang,
Jun Zhang,
Wenming Yang,
Xiao Han,
Wei Yang
Abstract:
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing…
▽ More
The latest advancements in neural image compression show great potential in surpassing the rate-distortion performance of conventional standard codecs. Nevertheless, there exists an indelible domain gap between the datasets utilized for training (i.e., natural images) and those utilized for inference (e.g., artistic images). Our proposal involves a low-rank adaptation approach aimed at addressing the rate-distortion drop observed in out-of-domain datasets. Specifically, we perform low-rank matrix decomposition to update certain adaptation parameters of the client's decoder. These updated parameters, along with image latents, are encoded into a bitstream and transmitted to the decoder in practical scenarios. Due to the low-rank constraint imposed on the adaptation parameters, the resulting bit rate overhead is small. Furthermore, the bit rate allocation of low-rank adaptation is \emph{non-trivial}, considering the diverse inputs require varying adaptation bitstreams. We thus introduce a dynamic gating network on top of the low-rank adaptation method, in order to decide which decoder layer should employ adaptation. The dynamic adaptation network is optimized end-to-end using rate-distortion loss. Our proposed method exhibits universality across diverse image datasets. Extensive results demonstrate that this paradigm significantly mitigates the domain gap, surpassing non-adaptive methods with an average BD-rate improvement of approximately $19\%$ across out-of-domain images. Furthermore, it outperforms the most advanced instance adaptive methods by roughly $5\%$ BD-rate. Ablation studies confirm our method's ability to universally enhance various image compression architectures.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Stochastic averaging principle and stability for multi-valued McKean-Vlasov stochastic differential equations with jumps
Authors:
Guangjun Shen,
Jie Xiang,
Jiang-Lun Wu
Abstract:
In this paper, we consider the stochastic averaging principle and stability for multi-valued McKean-Vlasov stochastic differential equations with jumps. First, under certain averaging conditions, we are able to show that the solutions of the equations concerned can be approximated by solutions of the associated averaged multi-valued McKean-Vlasov stochastic differential equations with jumps in the…
▽ More
In this paper, we consider the stochastic averaging principle and stability for multi-valued McKean-Vlasov stochastic differential equations with jumps. First, under certain averaging conditions, we are able to show that the solutions of the equations concerned can be approximated by solutions of the associated averaged multi-valued McKean-Vlasov stochastic differential equations with jumps in the sense of the mean square convergence. Second, we extend the classical Itô's formula from stochastic differential equations to multi-valued McKean-Vlasov stochastic differential equations with jumps. Last, as application of Itô's formula, we present the exponential stability of second moments, the exponentially 2-ultimate boundedness and the almost surely asymptotic stability for their solutions in terms of a Lyapunov function.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Mini-PointNetPlus: a local feature descriptor in deep learning model for 3d environment perception
Authors:
Chuanyu Luo,
Nuo Cheng,
Sikun Ma,
Jun Xiang,
Xiaohan Li,
Shengguang Lei,
Pu Li
Abstract:
Common deep learning models for 3D environment perception often use pillarization/voxelization methods to convert point cloud data into pillars/voxels and then process it with a 2D/3D convolutional neural network (CNN). The pioneer work PointNet has been widely applied as a local feature descriptor, a fundamental component in deep learning models for 3D perception, to extract features of a point c…
▽ More
Common deep learning models for 3D environment perception often use pillarization/voxelization methods to convert point cloud data into pillars/voxels and then process it with a 2D/3D convolutional neural network (CNN). The pioneer work PointNet has been widely applied as a local feature descriptor, a fundamental component in deep learning models for 3D perception, to extract features of a point cloud. This is achieved by using a symmetric max-pooling operator which provides unique pillar/voxel features. However, by ignoring most of the points, the max-pooling operator causes an information loss, which reduces the model performance. To address this issue, we propose a novel local feature descriptor, mini-PointNetPlus, as an alternative for plug-and-play to PointNet. Our basic idea is to separately project the data points to the individual features considered, each leading to a permutation invariant. Thus, the proposed descriptor transforms an unordered point cloud to a stable order. The vanilla PointNet is proved to be a special case of our mini-PointNetPlus. Due to fully utilizing the features by the proposed descriptor, we demonstrate in experiment a considerable performance improvement for 3D perception.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
A Fast and Map-Free Model for Trajectory Prediction in Traffics
Authors:
Junhong Xiang,
**gmin Zhang,
Zhixiong Nan
Abstract:
To handle the two shortcomings of existing methods, (i)nearly all models rely on high-definition (HD) maps, yet the map information is not always available in real traffic scenes and HD map-building is expensive and time-consuming and (ii) existing models usually focus on improving prediction accuracy at the expense of reducing computing efficiency, yet the efficiency is crucial for various real a…
▽ More
To handle the two shortcomings of existing methods, (i)nearly all models rely on high-definition (HD) maps, yet the map information is not always available in real traffic scenes and HD map-building is expensive and time-consuming and (ii) existing models usually focus on improving prediction accuracy at the expense of reducing computing efficiency, yet the efficiency is crucial for various real applications, this paper proposes an efficient trajectory prediction model that is not dependent on traffic maps. The core idea of our model is encoding single-agent's spatial-temporal information in the first stage and exploring multi-agents' spatial-temporal interactions in the second stage. By comprehensively utilizing attention mechanism, LSTM, graph convolution network and temporal transformer in the two stages, our model is able to learn rich dynamic and interaction information of all agents. Our model achieves the highest performance when comparing with existing map-free methods and also exceeds most map-based state-of-the-art methods on the Argoverse dataset. In addition, our model also exhibits a faster inference speed than the baseline methods.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Nonlinear phonon Hall effects in ferroelectrics: its existence and non-volatile electrical control
Authors:
W. Luo,
J. Y. Ji,
P. Chen,
Y. Xu,
L. F. Zhang,
H. J. Xiang,
L. Bellaiche
Abstract:
Nonlinear Hall effects have been previously investigated in non-centrosymmetric systems for electronic systems. However, they only exist in metallic systems and are not compatible with ferroelectrics since these latter are insulators, hence limiting their applications. On the other hand, ferroelectrics naturally break inversion symmetry and can induce a non-zero Berry curvature. Here, we show that…
▽ More
Nonlinear Hall effects have been previously investigated in non-centrosymmetric systems for electronic systems. However, they only exist in metallic systems and are not compatible with ferroelectrics since these latter are insulators, hence limiting their applications. On the other hand, ferroelectrics naturally break inversion symmetry and can induce a non-zero Berry curvature. Here, we show that a non-volatile electric-field control of heat current can be realized in ferroelectrics through the nonlinear phonon Hall effects. More precisely, based on Boltzmann equation under the relaxation-time approximation, we derive the equation for nonlinear phonon Hall effects, and further show that the behaviors of nonlinear phonon (Boson) Hall effects are very different from nonlinear Hall effects for electrons (Fermion). Our work provides a route for electric-field control of thermal Hall current in ferroelectrics.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Language Models Meet World Models: Embodied Experiences Enhance Language Models
Authors:
Jiannan Xiang,
Tianhua Tao,
Yi Gu,
Tianmin Shu,
Zirui Wang,
Zichao Yang,
Zhiting Hu
Abstract:
While large language models (LMs) have shown remarkable capabilities across numerous tasks, they often struggle with simple reasoning and planning in physical environments, such as understanding object permanence or planning household activities. The limitation arises from the fact that LMs are trained only on written text and miss essential embodied knowledge and skills. In this paper, we propose…
▽ More
While large language models (LMs) have shown remarkable capabilities across numerous tasks, they often struggle with simple reasoning and planning in physical environments, such as understanding object permanence or planning household activities. The limitation arises from the fact that LMs are trained only on written text and miss essential embodied knowledge and skills. In this paper, we propose a new paradigm of enhancing LMs by finetuning them with world models, to gain diverse embodied knowledge while retaining their general language capabilities. Our approach deploys an embodied agent in a world model, particularly a simulator of the physical world (VirtualHome), and acquires a diverse set of embodied experiences through both goal-oriented planning and random exploration. These experiences are then used to finetune LMs to teach diverse abilities of reasoning and acting in the physical world, e.g., planning and completing goals, object permanence and tracking, etc. Moreover, it is desirable to preserve the generality of LMs during finetuning, which facilitates generalizing the embodied knowledge across tasks rather than being tied to specific simulations. We thus further introduce the classical (EWC) for selective weight updates, combined with low-rank adapters (LoRA) for training efficiency. Extensive experiments show our approach substantially improves base LMs on 18 downstream tasks by 64.28% on average. In particular, the small LMs (1.3B, 6B, and 13B) enhanced by our approach match or even outperform much larger LMs (e.g., ChatGPT).
△ Less
Submitted 28 October, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Structure Diagram Recognition in Financial Announcements
Authors:
Meixuan Qiao,
Jun Wang,
Junfu Xiang,
Qiyu Hou,
Ruixuan Li
Abstract:
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines…
▽ More
Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines, including straight lines, curves, and polylines of different orientations and angles. Second, we developed a two-stage method to efficiently generate the industry's first benchmark of structure diagrams from Chinese financial announcements, where a large number of diagrams were synthesized and annotated using an automated tool to train a preliminary recognition model with fairly good performance, and then a high-quality benchmark can be obtained by automatically annotating the real-world structure diagrams using the preliminary model and then making few manual corrections. Finally, we experimentally verified the significant performance advantage of our structure diagram recognition method over previous methods.
△ Less
Submitted 1 May, 2023; v1 submitted 25 April, 2023;
originally announced April 2023.