-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Authors:
Shengding Hu,
Yuge Tu,
Xu Han,
Chaoqun He,
Ganqu Cui,
Xiang Long,
Zhi Zheng,
Yewei Fang,
Yuxiang Huang,
Weilin Zhao,
Xinrong Zhang,
Zheng Leng Thai,
Kaihuo Zhang,
Chongyi Wang,
Yuan Yao,
Chenyang Zhao,
Jie Zhou,
Jie Cai,
Zhongwu Zhai,
Ning Ding,
Chao Jia,
Guoyang Zeng,
Dahai Li,
Zhiyuan Liu,
Maosong Sun
Abstract:
The burgeoning interest in develo** Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce…
▽ More
The burgeoning interest in develo** Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM .
△ Less
Submitted 3 June, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
PoCo: Point Context Cluster for RGBD Indoor Place Recognition
Authors:
**g Liang,
Zhuo Deng,
Zheming Zhou,
Omid Ghasemalizadeh,
Dinesh Manocha,
Min Sun,
Cheng-Hao Kuo,
Arnie Sen
Abstract:
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (…
▽ More
We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (CoCs) to extract global descriptors directly from the noisy point clouds through end-to-end learning. Moreover, we develop the architecture by integrating both color and geometric modalities into the point features to enhance the global descriptor representation. We conducted evaluations on public datasets ScanNet-PR and ARKit with 807 and 5047 scenarios, respectively. PoCo achieves SOTA performance: on ScanNet-PR, we achieve R@1 of 64.63%, a 5.7% improvement from the best-published result CGis (61.12%); on Arkit, we achieve R@1 of 45.12%, a 13.3% improvement from the best-published result CGis (39.82%). In addition, PoCo shows higher efficiency than CGis in inference time (1.75X-faster), and we demonstrate the effectiveness of PoCo in recognizing places within a real-world laboratory environment.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Advancing LLM Reasoning Generalists with Preference Trees
Authors:
Lifan Yuan,
Ganqu Cui,
Hanbin Wang,
Ning Ding,
Xingyao Wang,
Jia Deng,
Boji Shan,
Huimin Chen,
Ruobing Xie,
Yankai Lin,
Zhenghao Liu,
Bowen Zhou,
Hao Peng,
Zhiyuan Liu,
Maosong Sun
Abstract:
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 1…
▽ More
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
The state-of-the-art in Cardiac MRI Reconstruction: Results of the CMRxRecon Challenge in MICCAI 2023
Authors:
Jun Lyu,
Chen Qin,
Shuo Wang,
Fanwen Wang,
Yan Li,
Zi Wang,
Kunyuan Guo,
Cheng Ouyang,
Michael Tänzer,
Meng Liu,
Longyu Sun,
Mengting Sun,
Qin Li,
Zhang Shi,
Sha Hua,
Hao Li,
Zhensen Chen,
Zhenlin Zhang,
Bingyu Xin,
Dimitris N. Metaxas,
George Yiasemis,
Jonas Teuwen,
Li** Zhang,
Weitian Chen,
Yidong Zhao
, et al. (25 additional authors not shown)
Abstract:
Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation p…
▽ More
Cardiac MRI, crucial for evaluating heart structure and function, faces limitations like slow imaging and motion artifacts. Undersampling reconstruction, especially data-driven algorithms, has emerged as a promising solution to accelerate scans and enhance imaging performance using highly under-sampled data. Nevertheless, the scarcity of publicly available cardiac k-space datasets and evaluation platform hinder the development of data-driven reconstruction algorithms. To address this issue, we organized the Cardiac MRI Reconstruction Challenge (CMRxRecon) in 2023, in collaboration with the 26th International Conference on MICCAI. CMRxRecon presented an extensive k-space dataset comprising cine and map** raw data, accompanied by detailed annotations of cardiac anatomical structures. With overwhelming participation, the challenge attracted more than 285 teams and over 600 participants. Among them, 22 teams successfully submitted Docker containers for the testing phase, with 7 teams submitted for both cine and map** tasks. All teams use deep learning based approaches, indicating that deep learning has predominately become a promising solution for the problem. The first-place winner of both tasks utilizes the E2E-VarNet architecture as backbones. In contrast, U-Net is still the most popular backbone for both multi-coil and single-coil reconstructions. This paper provides a comprehensive overview of the challenge design, presents a summary of the submitted results, reviews the employed methods, and offers an in-depth discussion that aims to inspire future advancements in cardiac MRI reconstruction models. The summary emphasizes the effective strategies observed in Cardiac MRI reconstruction, including backbone architecture, loss function, pre-processing techniques, physical modeling, and model complexity, thereby providing valuable insights for further developments in this field.
△ Less
Submitted 16 April, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
GDA: Generalized Diffusion for Robust Test-time Adaptation
Authors:
Yun-Yun Tsai,
Fu-Chen Chen,
Albert Y. C. Chen,
Junfeng Yang,
Che-Chun Su,
Min Sun,
Cheng-Hao Kuo
Abstract:
Machine learning models struggle with generalization when encountering out-of-distribution (OOD) samples with unexpected distribution shifts. For vision tasks, recent studies have shown that test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating new samples that align with the model's domain without the need to modify the mod…
▽ More
Machine learning models struggle with generalization when encountering out-of-distribution (OOD) samples with unexpected distribution shifts. For vision tasks, recent studies have shown that test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating new samples that align with the model's domain without the need to modify the model's weights. Unfortunately, those studies have primarily focused on pixel-level corruptions, thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically, GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model, in conjunction with style and content preservation losses during the reverse sampling process. In other words, GDA considers the model's output behavior with the semantic information of the samples as a whole, which can reduce ambiguity in downstream tasks during the generation process. Evaluation across various popular model architectures and OOD benchmarks shows that GDA consistently outperforms prior work on diffusion-driven adaptation. Notably, it achieves the highest classification accuracy improvements, ranging from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and Stylized benchmarks. This performance highlights GDA's generalization to a broader range of OOD benchmarks.
△ Less
Submitted 2 April, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior
Authors:
Zhongrui Yu,
Haoran Wang,
**ze Yang,
Hanzhang Wang,
Zeke Xie,
Yunfeng Cai,
Jiale Cao,
Zhong Ji,
Mingming Sun
Abstract:
Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviate…
▽ More
Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints. This issue stems from the sparse training views captured by a fixed camera on a moving vehicle. To tackle this problem, we propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model along with complementary multi-modal data. Specifically, we first fine-tune a Diffusion Model by adding images from adjacent frames as condition, meanwhile exploiting depth data from LiDAR point clouds to supply additional spatial information. Then we apply the Diffusion Model to regularize the 3DGS at unseen views during training. Experimental results validate the effectiveness of our method compared with current state-of-the-art models, and demonstrate its advance in rendering images from broader views.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication
Authors:
Mingze Sun,
Chao Xu,
Xinyu Jiang,
Yang Liu,
Baigui Sun,
Ruqi Huang
Abstract:
In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ…
▽ More
In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Low-Complexity Estimation Algorithm and Decoupling Scheme for FRaC System
Authors:
Mengjiang Sun,
Peng Chen,
Zhenxin Cao,
Fei Shen
Abstract:
With the lea** advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been pr…
▽ More
With the lea** advances in autonomous vehicles and transportation infrastructure, dual function radar-communication (DFRC) systems have become attractive due to the size, cost and resource efficiency. A frequency modulated continuous waveform (FMCW)-based radar-communication system (FRaC) utilizing both sparse multiple-input and multiple-output (MIMO) arrays and index modulation (IM) has been proposed to form a DFRC system specifically designed for vehicular applications. In this paper, the three-dimensional (3D) parameter estimation problem in the FRaC is considered. Since the 3D-parameters including range, direction of arrival (DOA) and velocity are coupled in the estimating matrix of the FRaC system, the existing estimation algorithms cannot estimate the 3D-parameters accurately. Hence, a novel decomposed decoupled atomic norm minimization (DANM) method is proposed by splitting the 3D-parameter estimating matrix into multiple 2D matrices with sparsity constraints. Then, the 3D-parameters are estimated and efficiently and separately with the optimized decoupled estimating matrix. Moreover, the Cramér-Rao lower bound (CRLB) of the 3D-parameter estimation are derived, and the computational complexity of the proposed algorithm is analyzed. Simulation results show that the proposed decomposed DANM method exploits the advantage of the virtual aperture in the existence of coupling caused by IM and sparse MIMO array and outperforms the co-estimation algorithm with lower computation complexity.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Continual Few-shot Event Detection via Hierarchical Augmentation Networks
Authors:
Chenlong Zhang,
Pengfei Cao,
Yubo Chen,
Kang Liu,
Zhiqiang Zhang,
Mengshu Sun,
Jun Zhao
Abstract:
Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types an…
▽ More
Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Networks (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Chain of Compression: A Systematic Approach to Combinationally Compress Convolutional Neural Networks
Authors:
Yingtao Shen,
Minqing Sun,
Jie Zhao,
An Zou
Abstract:
Convolutional neural networks (CNNs) have achieved significant popularity, but their computational and memory intensity poses challenges for resource-constrained computing systems, particularly with the prerequisite of real-time performance. To release this burden, model compression has become an important research focus. Many approaches like quantization, pruning, early exit, and knowledge distil…
▽ More
Convolutional neural networks (CNNs) have achieved significant popularity, but their computational and memory intensity poses challenges for resource-constrained computing systems, particularly with the prerequisite of real-time performance. To release this burden, model compression has become an important research focus. Many approaches like quantization, pruning, early exit, and knowledge distillation have demonstrated the effect of reducing redundancy in neural networks. Upon closer examination, it becomes apparent that each approach capitalizes on its unique features to compress the neural network, and they can also exhibit complementary behavior when combined. To explore the interactions and reap the benefits from the complementary features, we propose the Chain of Compression, which works on the combinational sequence to apply these common techniques to compress the neural network. Validated on the image-based regression and classification networks across different data sets, our proposed Chain of Compression can significantly compress the computation cost by 100-1000 times with ignorable accuracy loss compared with the baseline model.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Robust and Scalable Model Editing for Large Language Models
Authors:
Yingfa Chen,
Zhengyan Zhang,
Xu Han,
Chaojun Xiao,
Zhiyuan Liu,
Chen Chen,
Kuai Li,
Tao Yang,
Maosong Sun
Abstract:
Large language models (LLMs) can make predictions using parametric knowledge--knowledge encoded in the model weights--or contextual knowledge--knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevan…
▽ More
Large language models (LLMs) can make predictions using parametric knowledge--knowledge encoded in the model weights--or contextual knowledge--knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevant. This enables updating and correcting the model's knowledge by in-context editing instead of retraining. Previous works have shown that LLMs are inclined to ignore contextual knowledge and fail to reliably fall back to parametric knowledge when presented with irrelevant context. In this work, we discover that, with proper prompting methods, instruction-finetuned LLMs can be highly controllable by contextual knowledge and robust to irrelevant context. Utilizing this feature, we propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. To better evaluate the robustness of model editors, we collect a new dataset, that contains irrelevant questions that are more challenging than the ones in existing datasets. Empirical results show that our method outperforms current state-of-the-art methods by a large margin. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs (and vice versa). The source code can be found at https://github.com/thunlp/EREN.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Stellar Spin Down in Post-Mass Transfer Binary Systems
Authors:
Meng Sun,
Seth Gossage,
Emily M. Leiner,
Aaron M. Geller
Abstract:
Motivated by measurements of the rotation speed of accretor stars in post-mass-transfer (post-MT) systems, we investigate how magnetic braking affects the spin-down of individual stars during binary evolution with the MESAbinary module. Unlike the conventional assumption of tidal synchronization coupled with magnetic braking in binaries, we first calculate whether tides are strong enough to synchr…
▽ More
Motivated by measurements of the rotation speed of accretor stars in post-mass-transfer (post-MT) systems, we investigate how magnetic braking affects the spin-down of individual stars during binary evolution with the MESAbinary module. Unlike the conventional assumption of tidal synchronization coupled with magnetic braking in binaries, we first calculate whether tides are strong enough to synchronize the orbit. Subsequently, this influences the spin-down of stars and the orbital separation. In this study, we apply four magnetic braking prescriptions to reduce the spin angular momentum of the two stars throughout the entire binary evolution simulation. Our findings reveal that despite magnetic braking causing continuous spin-down of the accretor, when the donor begins to transfer material onto the accretor, the accretor can rapidly spin up to its critical rotation rate. After MT, magnetic braking becomes more important in affecting the angular momentum evolution of the stars. Post-MT accretor stars thus serve as a valuable testbed for observing how the magnetic braking prescriptions operate in spinning down stars from their critical rotation, including the saturation regimes of the magnetic braking. The rotation rate of the accretor star, combined with its mass, could provide age information since the cessation of MT. By comparing the models against observation, the magnetic braking prescription by Garraffo et al. (2018b) is found to better align with the rotation data of post-MT accretors.
△ Less
Submitted 21 May, 2024; v1 submitted 25 March, 2024;
originally announced March 2024.
-
Plaintext-Free Deep Learning for Privacy-Preserving Medical Image Analysis via Frequency Information Embedding
Authors:
Mengyu Sun,
Ziyuan Yang,
Maosong Ran,
Zhiwen Wang,
Hui Yu,
Yi Zhang
Abstract:
In the fast-evolving field of medical image analysis, Deep Learning (DL)-based methods have achieved tremendous success. However, these methods require plaintext data for training and inference stages, raising privacy concerns, especially in the sensitive area of medical data. To tackle these concerns, this paper proposes a novel framework that uses surrogate images for analysis, eliminating the n…
▽ More
In the fast-evolving field of medical image analysis, Deep Learning (DL)-based methods have achieved tremendous success. However, these methods require plaintext data for training and inference stages, raising privacy concerns, especially in the sensitive area of medical data. To tackle these concerns, this paper proposes a novel framework that uses surrogate images for analysis, eliminating the need for plaintext images. This approach is called Frequency-domain Exchange Style Fusion (FESF). The framework includes two main components: Image Hidden Module (IHM) and Image Quality Enhancement Module~(IQEM). The~IHM performs in the frequency domain, blending the features of plaintext medical images into host medical images, and then combines this with IQEM to improve and create surrogate images effectively. During the diagnostic model training process, only surrogate images are used, enabling anonymous analysis without any plaintext data during both training and inference stages. Extensive evaluations demonstrate that our framework effectively preserves the privacy of medical images and maintains diagnostic accuracy of DL models at a relatively high level, proving its effectiveness across various datasets and DL-based models.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Range-Angle Estimation for FDA-MIMO System With Frequency Offset
Authors:
Mengjiang Sun,
Peng Chen,
Zhenxin Cao
Abstract:
Frequency diverse array multiple-input multiple-output (FDA-MIMO) radar differs from the traditional phased array (PA) radar, and can form range-angle-dependent beampattern and differentiate between closely spaced targets sharing the same angle but occupying distinct range cells. In the FDA-MIMO radar, target range estimation is achieved by employing a subtle frequency variation between adjacent a…
▽ More
Frequency diverse array multiple-input multiple-output (FDA-MIMO) radar differs from the traditional phased array (PA) radar, and can form range-angle-dependent beampattern and differentiate between closely spaced targets sharing the same angle but occupying distinct range cells. In the FDA-MIMO radar, target range estimation is achieved by employing a subtle frequency variation between adjacent array antennas, so the estimation performance is degraded severely in a practical scenario with frequency offset. In this paper, the range-angle estimation problem for FDA-MIMO radar is considered with frequency offsets in both transmitting and receiving arrays. First, we build a system model for the FDA-MIMO radar with transmitting and receiving frequency offsets. Then, the frequency offset is transferred into an equalized additional noise. The noise characteristics are analyzed in detail theoretically, together with the influence on the range-angle estimation. Moreover, since the effect of the transmitting frequency offset is similar to additional colored noise, denoising algorithms are introduced to mitigate the performance deterioration caused by the frequency offset. Finally, Cramér-Rao lower bounds (CRLB) for the range-angle estimation are derived in the scenario with the frequency offsets. Simulation results show the analysis of frequency offset and the corresponding estimation performance using different algorithms.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
A LiDAR-Aided Channel Model for Vehicular Intelligent Sensing-Communication Integration
Authors:
Ziwei Huang,
Lu Bai,
Mingran Sun,
Xiang Cheng
Abstract:
In this paper, a novel channel modeling approach, named light detection and ranging (LiDAR)-aided geometry-based stochastic modeling (LA-GBSM), is developed. Based on the developed LA-GBSM approach, a new millimeter wave (mmWave) channel model for sixth-generation (6G) vehicular intelligent sensing-communication integration is proposed, which can support the design of intelligent transportation sy…
▽ More
In this paper, a novel channel modeling approach, named light detection and ranging (LiDAR)-aided geometry-based stochastic modeling (LA-GBSM), is developed. Based on the developed LA-GBSM approach, a new millimeter wave (mmWave) channel model for sixth-generation (6G) vehicular intelligent sensing-communication integration is proposed, which can support the design of intelligent transportation systems (ITSs). The proposed LA-GBSM is accurately parameterized under high, medium, and low vehicular traffic density (VTD) conditions via a sensing-communication simulation dataset with LiDAR point clouds and scatterer information for the first time. Specifically, by detecting dynamic vehicles and static building/tress through LiDAR point clouds via machine learning, scatterers are divided into static and dynamic scatterers. Furthermore, statistical distributions of parameters, e.g., distance, angle, number, and power, related to static and dynamic scatterers are quantified under high, medium, and low VTD conditions. To mimic channel non-stationarity and consistency, based on the quantified statistical distributions, a new visibility region (VR)-based algorithm in consideration of newly generated static/dynamic scatterers is developed. Key channel statistics are derived and simulated. By comparing simulation results and ray-tracing (RT)-based results, the utility of the proposed LA-GBSM is verified.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
A system capable of verifiably and privately screening global DNA synthesis
Authors:
Carsten Baum,
Jens Berlips,
Walther Chen,
Hongrui Cui,
Ivan Damgard,
Jiangbin Dong,
Kevin M. Esvelt,
Mingyu Gao,
Dana Gretton,
Leonard Foner,
Martin Kysel,
Kaiyi Zhang,
Juanru Li,
Xiang Li,
Omer Paneth,
Ronald L. Rivest,
Francesca Sage-Ling,
Adi Shamir,
Yue Shen,
Meicen Sun,
Vinod Vaikuntanathan,
Lynn Van Hauwe,
Theia Vogel,
Benjamin Weinstein-Raun,
Yun Wang
, et al. (5 additional authors not shown)
Abstract:
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't n…
▽ More
Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be used to ignite a pandemic. There are three complications. First, we don't need to quickly update printers to deal with newly discovered currencies, whereas we regularly learn of new viruses and other biological threats. Second, anti-counterfeiting specifications on a local printer can't be extracted and misused by malicious actors, unlike information on biological threats. Finally, any screening must keep the inspected DNA sequences private, as they may constitute valuable trade secrets. Here we describe SecureDNA, a free, privacy-preserving, and fully automated system capable of verifiably screening all DNA synthesis orders of 30+ base pairs against an up-to-date database of hazards, and its operational performance and specificity when applied to 67 million base pairs of DNA synthesized by providers in the United States, Europe, and China.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
VL-Mamba: Exploring State Space Models for Multimodal Learning
Authors:
Yanyuan Qiao,
Zheng Yu,
Longteng Guo,
Sihan Chen,
Zijia Zhao,
Mingzhen Sun,
Qi Wu,
**g Liu
Abstract:
Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great p…
▽ More
Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
VideoBadminton: A Video Dataset for Badminton Action Recognition
Authors:
Qi Li,
Tzu-Chen Chiu,
Hsiang-Wei Huang,
Min-Te Sun,
Wei-Shinn Ku
Abstract:
In the dynamic and evolving field of computer vision, action recognition has become a key focus, especially with the advent of sophisticated methodologies like Convolutional Neural Networks (CNNs), Convolutional 3D, Transformer, and spatial-temporal feature fusion. These technologies have shown promising results on well-established benchmarks but face unique challenges in real-world applications,…
▽ More
In the dynamic and evolving field of computer vision, action recognition has become a key focus, especially with the advent of sophisticated methodologies like Convolutional Neural Networks (CNNs), Convolutional 3D, Transformer, and spatial-temporal feature fusion. These technologies have shown promising results on well-established benchmarks but face unique challenges in real-world applications, particularly in sports analysis, where the precise decomposition of activities and the distinction of subtly different actions are crucial. Existing datasets like UCF101, HMDB51, and Kinetics have offered a diverse range of video data for various scenarios. However, there's an increasing need for fine-grained video datasets that capture detailed categorizations and nuances within broader action categories. In this paper, we introduce the VideoBadminton dataset derived from high-quality badminton footage. Through an exhaustive evaluation of leading methodologies on this dataset, this study aims to advance the field of action recognition, particularly in badminton sports. The introduction of VideoBadminton could not only serve for badminton action recognition but also provide a dataset for recognizing fine-grained actions. The insights gained from these evaluations are expected to catalyze further research in action comprehension, especially within sports contexts.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Consistency Model is an Effective Posterior Sample Approximation for Diffusion Inverse Solvers
Authors:
Tongda Xu,
Ziran Zhu,
Jian Li,
Dailan He,
Yuanyuan Wang,
Ming Sun,
Ling Li,
Hongwei Qin,
Yan Wang,
**g**g Liu,
Ya-Qin Zhang
Abstract:
Diffusion Inverse Solvers (DIS) are designed to sample from the conditional distribution $p_θ(X_0|y)$, with a predefined diffusion model $p_θ(X_0)$, an operator $f(\cdot)$, and a measurement $y=f(x'_0)$ derived from an unknown image $x'_0$. Existing DIS estimate the conditional score function by evaluating $f(\cdot)$ with an approximated posterior sample drawn from $p_θ(X_0|X_t)$. However, most pr…
▽ More
Diffusion Inverse Solvers (DIS) are designed to sample from the conditional distribution $p_θ(X_0|y)$, with a predefined diffusion model $p_θ(X_0)$, an operator $f(\cdot)$, and a measurement $y=f(x'_0)$ derived from an unknown image $x'_0$. Existing DIS estimate the conditional score function by evaluating $f(\cdot)$ with an approximated posterior sample drawn from $p_θ(X_0|X_t)$. However, most prior approximations rely on the posterior means, which may not lie in the support of the image distribution, thereby potentially diverge from the appearance of genuine images. Such out-of-support samples may significantly degrade the performance of the operator $f(\cdot)$, particularly when it is a neural network. In this paper, we introduces a novel approach for posterior approximation that guarantees to generate valid samples within the support of the image distribution, and also enhances the compatibility with neural network-based operators $f(\cdot)$. We first demonstrate that the solution of the Probability Flow Ordinary Differential Equation (PF-ODE) with an initial value $x_t$ yields an effective posterior sample $p_θ(X_0|X_t=x_t)$. Based on this observation, we adopt the Consistency Model (CM), which is distilled from PF-ODE, for posterior sampling. Furthermore, we design a novel family of DIS using only CM. Through extensive experiments, we show that our proposed method for posterior sample approximation substantially enhance the effectiveness of DIS for neural network operators $f(\cdot)$ (e.g., in semantic segmentation). Additionally, our experiments demonstrate the effectiveness of the new CM-based inversion techniques. The source code is provided in the supplementary material.
△ Less
Submitted 1 June, 2024; v1 submitted 8 February, 2024;
originally announced March 2024.
-
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Authors:
Ruyi Xu,
Yuan Yao,
Zonghao Guo,
Junbo Cui,
Zanlin Ni,
Chunjiang Ge,
Tat-Seng Chua,
Zhiyuan Liu,
Maosong Sun,
Gao Huang
Abstract:
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in t…
▽ More
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
CasSR: Activating Image Power for Real-World Image Super-Resolution
Authors:
Haolan Chen,
**hua Hao,
Kai Zhao,
Kun Yuan,
Ming Sun,
Chao Zhou,
Wei Hu
Abstract:
The objective of image super-resolution is to generate clean and high-resolution images from degraded versions. Recent advancements in diffusion modeling have led to the emergence of various image super-resolution techniques that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the prevalent severe degradation in low-resolution images and the inherent characteristics of diffusi…
▽ More
The objective of image super-resolution is to generate clean and high-resolution images from degraded versions. Recent advancements in diffusion modeling have led to the emergence of various image super-resolution techniques that leverage pretrained text-to-image (T2I) models. Nevertheless, due to the prevalent severe degradation in low-resolution images and the inherent characteristics of diffusion models, achieving high-fidelity image restoration remains challenging. Existing methods often exhibit issues including semantic loss, artifacts, and the introduction of spurious content not present in the original image. To tackle this challenge, we propose Cascaded diffusion for Super-Resolution, CasSR , a novel method designed to produce highly detailed and realistic images. In particular, we develop a cascaded controllable diffusion model that aims to optimize the extraction of information from low-resolution images. This model generates a preliminary reference image to facilitate initial information extraction and degradation mitigation. Furthermore, we propose a multi-attention mechanism to enhance the T2I model's capability in maximizing the restoration of the original image content. Through a comprehensive blend of qualitative and quantitative analyses, we substantiate the efficacy and superiority of our approach.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Neural-network density functional theory
Authors:
Yang Li,
Zechen Tang,
Zezhou Chen,
Minghui Sun,
Boheng Zhao,
He Li,
Honggeng Tao,
Zilong Yuan,
Wenhui Duan,
Yong Xu
Abstract:
Deep-learning density functional theory (DFT) shows great promise to significantly accelerate material discovery and potentially revolutionize materials research, which demands a close combination between neural networks and DFT computation. However, current research in this field primarily relies on supervised learning, making the developments of neural networks and DFT isolated from each other.…
▽ More
Deep-learning density functional theory (DFT) shows great promise to significantly accelerate material discovery and potentially revolutionize materials research, which demands a close combination between neural networks and DFT computation. However, current research in this field primarily relies on supervised learning, making the developments of neural networks and DFT isolated from each other. In this work, we present a theoretical framework of neural-network DFT, which unifies the optimization of neural networks with the variational computation of DFT, enabling physics-informed unsupervised learning. Moreover, we develop a differential DFT code incorporated with deep-learning DFT Hamiltonian, and introduce algorithms of automatic differentiation and backpropagation to DFT, demonstrating the concept of neural-network DFT. The advanced neural-network architecture not only surpasses conventional approaches in accuracy and efficiency, but offers a new paradigm for develo** deep-learning DFT methods.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement
Authors:
Qiang Zhu,
**hua Hao,
Yukang Ding,
Yu Liu,
Qiao Mo,
Ming Sun,
Chao Zhou,
Shuyuan Zhu
Abstract:
Recently, numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However, these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos, such as motion vectors and residual frames, which carry abundant temporal and spatial information. To remedy this problem, we propose the Coding Priors-Guided Aggregation…
▽ More
Recently, numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However, these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos, such as motion vectors and residual frames, which carry abundant temporal and spatial information. To remedy this problem, we propose the Coding Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial information from coding priors. The CPGA mainly consists of an inter-frame temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA) module. Specifically, the ITA module aggregates temporal information from consecutive frames and coding priors, while the MNA module globally captures spatial information guided by residual frames. In addition, to facilitate research in VQE task, we newly construct the Video Coding Priors (VCP) dataset, comprising 300 videos with various coding priors extracted from corresponding bitstreams. It remedies the shortage of previous datasets on the lack of coding information. Experimental results demonstrate the superiority of our method compared to existing state-of-the-art methods. The code and dataset will be released at https://github.com/CPGA/CPGA.git.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
Authors:
Ao Sun,
Weilin Zhao,
Xu Han,
Cheng Yang,
Zhiyuan Liu,
Chuan Shi,
Maosong Sun
Abstract:
Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across mult…
▽ More
Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100.
△ Less
Submitted 6 June, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
EventRPG: Event Data Augmentation with Relevance Propagation Guidance
Authors:
Mingyuan Sun,
Donghao Zhang,
Zongyuan Ge,
Jiaxu Wang,
Jia Li,
Zheng Fang,
Ren**g Xu
Abstract:
Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitt…
▽ More
Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitting and improve the generalization ability of neural networks, and saliency-based augmentation methods are proven to be effective in the image processing field. However, there is no approach available for extracting saliency maps from SNNs. Therefore, for the first time, we present Spiking Layer-Time-wise Relevance Propagation rule (SLTRP) and Spiking Layer-wise Relevance Propagation rule (SLRP) in order for SNN to generate stable and accurate CAMs and saliency maps. Based on this, we propose EventRPG, which leverages relevance propagation on the spiking neural network for more efficient augmentation. Our proposed method has been evaluated on several SNN structures, achieving state-of-the-art performance in object recognition tasks including N-Caltech101, CIFAR10-DVS, with accuracies of 85.62% and 85.55%, as well as action recognition task SL-Animals with an accuracy of 91.59%. Our code is available at https://github.com/myuansun/EventRPG.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Broadband NIR photon upconversion generates NIR persistent luminescence for bioimaging
Authors:
Shuting Yang,
Bing Qi,
Mingzi Sun,
Wen**g Dai,
Ziyun Miao,
Wei Zheng,
Bolong Huang,
Jie Wang
Abstract:
Upconversion persistent luminescence (UCPL) phosphors that can be directly charged by near-infrared (NIR) light have gained considerable attention due to their promising applications ranging from photonics to biomedicine. However, current lanthanide-based UCPL phosphors show small absorption cross-sections and low upconversion charging efficiency. The development of UCPL phosphors faces challenges…
▽ More
Upconversion persistent luminescence (UCPL) phosphors that can be directly charged by near-infrared (NIR) light have gained considerable attention due to their promising applications ranging from photonics to biomedicine. However, current lanthanide-based UCPL phosphors show small absorption cross-sections and low upconversion charging efficiency. The development of UCPL phosphors faces challenges of lacking flexible upconversion charging pathways and poor design flexibility. Herein, we discovered a new lattice defect-mediated broadband photon upconversion process and the accompanied NIR-to-NIR UCPL in Cr-doped zinc gallate nanoparticles. The zinc gallate nanoparticles can be directly activated by broadband NIR light in the 700-1000 nm range to produce persistent luminescence at about 700 nm, which is also readily enhanced by rationally tailoring the lattice defects in the phosphors. This proposed UCPL phosphors achieved a signal-to-background ratio of over 200 in bioimaging by efficiently avoiding interference from autofluorescence and light scattering. Our findings reported the lattice defect-mediated photon upconversion for the first time, which significantly expanded the horizons for the flexible design of NIR-to-NIR UCPL phosphors toward broad applications.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Mastering Text, Code and Math Simultaneously via Fusing Highly Specialized Language Models
Authors:
Ning Ding,
Yulin Chen,
Ganqu Cui,
Xingtai Lv,
Weilin Zhao,
Ruobing Xie,
Bowen Zhou,
Zhiyuan Liu,
Maosong Sun
Abstract:
Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typ…
▽ More
Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
△ Less
Submitted 26 March, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Representing Molecules as Random Walks Over Interpretable Grammars
Authors:
Michael Sun,
Minghao Guo,
Weize Yuan,
Veronika Thost,
Crystal Elaine Owens,
Aristotle Franklin Grosz,
Sharvaa Selvan,
Katelyn Zhou,
Hassan Mohiuddin,
Benjamin J Pedretti,
Zachary P Smith,
Jie Chen,
Wojciech Matusik
Abstract:
Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representin…
▽ More
Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.
△ Less
Submitted 2 June, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
The intermittently-resonant coevolution of migrating planets and their pulsating stars
Authors:
Jared Bryan,
Julien de Wit,
Meng Sun,
Zoe L. de Beurs,
Richard H. D. Townsend
Abstract:
Hot Jupiters are expected to form far from their host star and move toward close-in, circular orbits via a smooth, monotonic decay due to mild and constant tidal dissipation. Yet, three systems have recently been found exhibiting planet-induced stellar pulsations suggesting unexpectedly strong tidal interactions. Here we combine stellar evolution and tide models to show that dynamical tides raised…
▽ More
Hot Jupiters are expected to form far from their host star and move toward close-in, circular orbits via a smooth, monotonic decay due to mild and constant tidal dissipation. Yet, three systems have recently been found exhibiting planet-induced stellar pulsations suggesting unexpectedly strong tidal interactions. Here we combine stellar evolution and tide models to show that dynamical tides raised by eccentric gas giants can give rise to chains of resonance locks with multiple modes, enriching the dynamics seen in single-mode resonance locking of circularized systems. These series of resonance locks yield orders-of-magnitude larger changes in eccentricity and harmonic pulsations relative to those expected from a single episode of resonance locking or nonresonant tidal interactions. Resonances become more frequent as a star evolves off the main sequence providing an alternative explanation to the origin of some stellar pulsators and yielding the concept of "dormant migrating giants". Evolution trajectories are characterized by competing episodes of inward/outward migration and spin-up/-down of the star which are sensitive to the system parameters, revealing a new challenge in modeling migration paths and in contextualizing the observed populations of giant exoplanets and stellar binaries. This sensitivity however offers a new window to constrain the stellar properties of planetary hosts via tidal asteroseismology.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Authors:
Haokun Lin,
Haoli Bai,
Zhili Liu,
Lu Hou,
Muyi Sun,
Linqi Song,
Ying Wei,
Zhenan Sun
Abstract:
Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression…
▽ More
Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper, we first propose the Module-wise Pruning Error (MoPE) metric, accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training, MoPE-CLIP effectively leverages knowledge from the teacher model, significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning, consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric, and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Authors:
Zhicheng Guo,
Sijie Cheng,
Hao Wang,
Shihao Liang,
Yujia Qin,
Peng Li,
Zhiyuan Liu,
Maosong Sun,
Yang Liu
Abstract:
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or…
▽ More
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
△ Less
Submitted 19 June, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
To Be or not to Be: the role of rotation in modeling Galactic Be X-ray Binaries
Authors:
Kyle Akira Rocha,
Vicky Kalogera,
Zoheyr Doctor,
Jeff J. Andrews,
Meng Sun,
Seth Gossage,
Simone S. Bavera,
Tassos Fragos,
Konstantinos Kovlakas,
Matthias U. Kruckow,
Devina Misra,
Philipp M. Srivastava,
Zepei Xing,
Emmanouil Zapartas
Abstract:
Be X-ray binaries (Be-XRBs) are crucial in understanding high-mass X-ray binaries, featuring a rapidly rotating Be star and a neutron star companion in an eccentric orbit, intermittently accreting material from the Be star's decretion disk. Originating from binary stellar evolution, Be-XRBs are of significant interest to binary population synthesis (BPS) studies, encapsulating the physics of super…
▽ More
Be X-ray binaries (Be-XRBs) are crucial in understanding high-mass X-ray binaries, featuring a rapidly rotating Be star and a neutron star companion in an eccentric orbit, intermittently accreting material from the Be star's decretion disk. Originating from binary stellar evolution, Be-XRBs are of significant interest to binary population synthesis (BPS) studies, encapsulating the physics of supernovae, common envelope, and mass transfer (MT). Using the POSYDON BPS code, employing pre-computed grids of detailed binary stellar evolution models, we investigate the Galactic Be-XRB population. POSYDON incorporates stellar rotation self-consistently during MT phases, enabling a detailed examination of the rotational distribution of Be stars. Our fiducial BPS and Be-XRB model align well with the orbital properties of Galactic Be-XRBs, emphasizing the role of rotational constraints. Our modeling reveals a bimodal rotational distribution of Be-XRB-like systems, in excellent agreement with literature values. All Be-XRBs undergo an MT phase before the first compact object forms, with over half experiencing a second MT phase from a stripped helium companion (Case BB). Computing rotationally-limited MT efficiencies and applying them to our population, we find that the majority of Be-XRBs have undergone highly non-conservative MT (beta ~ 0.15). Our study underscores the importance of detailed angular momentum modeling during MT in interpreting Be-XRB populations, emphasizing this population as a key probe for the stability and efficiency of MT in interacting binaries.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
Authors:
Changyue Liao,
Mo Sun,
Zihan Yang,
Kaiqi Chen,
Binhang Yuan,
Fei Wu,
Zeke Wang
Abstract:
Recent advances in large language models have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient des…
▽ More
Recent advances in large language models have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swap**, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swap** from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
FARPLS: A Feature-Augmented Robot Trajectory Preference Labeling System to Assist Human Labelers' Preference Elicitation
Authors:
Hanfang Lyu,
Yuanchen Bai,
Xin Liang,
Ujaan Das,
Chuhan Shi,
Leiliang Gong,
Yingchi Li,
Mingfei Sun,
Ming Ge,
Xiaojuan Ma
Abstract:
Preference-based learning aims to align robot task objectives with human values. One of the most common methods to infer human preferences is by pairwise comparisons of robot task trajectories. Traditional comparison-based preference labeling systems seldom support labelers to digest and identify critical differences between complex trajectories recorded in videos. Our formative study (N = 12) sug…
▽ More
Preference-based learning aims to align robot task objectives with human values. One of the most common methods to infer human preferences is by pairwise comparisons of robot task trajectories. Traditional comparison-based preference labeling systems seldom support labelers to digest and identify critical differences between complex trajectories recorded in videos. Our formative study (N = 12) suggests that individuals may overlook non-salient task features and establish biased preference criteria during their preference elicitation process because of partial observations. In addition, they may experience mental fatigue when given many pairs to compare, causing their label quality to deteriorate. To mitigate these issues, we propose FARPLS, a Feature-Augmented Robot trajectory Preference Labeling System. FARPLS highlights potential outliers in a wide variety of task features that matter to humans and extracts the corresponding video keyframes for easy review and comparison. It also dynamically adjusts the labeling order according to users' familiarities, difficulties of the trajectory pair, and level of disagreements. At the same time, the system monitors labelers' consistency and provides feedback on labeling progress to keep labelers engaged. A between-subjects study (N = 42, 105 pairs of robot pick-and-place trajectories per person) shows that FARPLS can help users establish preference criteria more easily and notice more relevant details in the presented trajectories than the conventional interface. FARPLS also improves labeling consistency and engagement, mitigating challenges in preference elicitation without raising cognitive loads significantly
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Hypothesis testing for homogenous of nodes in $β$-models
Authors:
Kang Fu,
Jianwei Hu,
Meng Sun
Abstract:
The $β$-model has been extensively utilized to model degree heterogeneity in networks, wherein each node is assigned a unique parameter. In this article, we consider the hypothesis testing problem that two nodes $i$ and $j$ of a $β$-model have the same node parameter. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Furthe…
▽ More
The $β$-model has been extensively utilized to model degree heterogeneity in networks, wherein each node is assigned a unique parameter. In this article, we consider the hypothesis testing problem that two nodes $i$ and $j$ of a $β$-model have the same node parameter. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Further, we investigate the homogeneous test for $β$-model by combining individual $p$-values to aggregate small effects of multiple tests. Both simulation studies and real-world data examples indicate that the proposed method works well.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Reply with Sticker: New Dataset and Model for Sticker Retrieval
Authors:
Bin Liang,
Bingbing Wang,
Zhixin Bai,
Qiwei Lang,
Mingwei Sun,
Kaiheng Hou,
Kam-Fai Wong,
Ruifeng Xu
Abstract:
Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the curren…
▽ More
Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the current utterance. However, in the real-world scenario, using stickers to express what we want to say rather than as a supplement to our words only is also important. Therefore, in this paper, we create a new dataset for sticker retrieval in conversation, called StickerInt, where stickers are used to reply to previous conversations or supplement our words. Based on the created dataset, we present a simple yet effective framework for sticker retrieval in conversation based on the learning of intention and the cross-modal relationships between conversation context and stickers, coined as \textbf{Int-RA}. Specifically, we first devise a knowledge-enhanced intention predictor to introduce the intention information into the conversation representations. Subsequently, a relation-aware sticker selector is devised to retrieve the response sticker via cross-modal relationships. Extensive experiments on the created dataset show that the proposed model achieves state-of-the-art performance in sticker retrieval.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
ChatUIE: Exploring Chat-based Unified Information Extraction using Large Language Models
Authors:
Jun Xu,
Mengshu Sun,
Zhiqiang Zhang,
Jun Zhou
Abstract:
Recent advancements in large language models have shown impressive performance in general chat. However, their domain-specific capabilities, particularly in information extraction, have certain limitations. Extracting structured information from natural language that deviates from known schemas or instructions has proven challenging for previous prompt-based methods. This motivated us to explore d…
▽ More
Recent advancements in large language models have shown impressive performance in general chat. However, their domain-specific capabilities, particularly in information extraction, have certain limitations. Extracting structured information from natural language that deviates from known schemas or instructions has proven challenging for previous prompt-based methods. This motivated us to explore domain-specific modeling in chat-based language models as a solution for extracting structured information from natural language. In this paper, we present ChatUIE, an innovative unified information extraction framework built upon ChatGLM. Simultaneously, reinforcement learning is employed to improve and align various tasks that involve confusing and limited samples. Furthermore, we integrate generation constraints to address the issue of generating elements that are not present in the input. Our experimental results demonstrate that ChatUIE can significantly improve the performance of information extraction with a slight decrease in chatting ability.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
Authors:
Yunpeng Qu,
Kun Yuan,
Kai Zhao,
Qizhi Xie,
**hua Hao,
Ming Sun,
Chao Zhou
Abstract:
Diffusion-based methods, endowed with a formidable generative prior, have received increasing attention in Image Super-Resolution (ISR) recently. However, as low-resolution (LR) images often undergo severe degradation, it is challenging for ISR models to perceive the semantic and degradation information, resulting in restoration images with incorrect content or unrealistic artifacts. To address th…
▽ More
Diffusion-based methods, endowed with a formidable generative prior, have received increasing attention in Image Super-Resolution (ISR) recently. However, as low-resolution (LR) images often undergo severe degradation, it is challenging for ISR models to perceive the semantic and degradation information, resulting in restoration images with incorrect content or unrealistic artifacts. To address these issues, we propose a \textit{Cross-modal Priors for Super-Resolution (XPSR)} framework. Within XPSR, to acquire precise and comprehensive semantic conditions for the diffusion model, cutting-edge Multimodal Large Language Models (MLLMs) are utilized. To facilitate better fusion of cross-modal priors, a \textit{Semantic-Fusion Attention} is raised. To distill semantic-preserved information instead of undesired degradations, a \textit{Degradation-Free Constraint} is attached between LR and its high-resolution (HR) counterpart. Quantitative and qualitative results show that XPSR is capable of generating high-fidelity and high-realism images across synthetic and real-world datasets. Codes will be released at \url{https://github.com/qyp2000/XPSR}.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Effectiveness Assessment of Recent Large Vision-Language Models
Authors:
Yao Jiang,
Xinyu Yan,
Ge-Peng Ji,
Keren Fu,
Meijun Sun,
Huan Xiong,
Deng-** Fan,
Fahad Shahbaz Khan
Abstract:
The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of…
▽ More
The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of these novel models. To gauge their effectiveness in specialized tasks, we employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. These six tasks include salient/camouflaged/transparent object detection, as well as polyp detection, skin lesion detection, and industrial anomaly detection. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks. Moreover, we conduct empirical investigations utilizing the aforementioned LVLMs together with GPT-4V, assessing their multi-modal understanding capabilities in general tasks including object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these LVLMs demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deep into this inadequacy and uncover several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope that this study can provide useful insights for the future development of LVLMs, hel** researchers improve LVLMs for both general and specialized applications.
△ Less
Submitted 11 June, 2024; v1 submitted 7 March, 2024;
originally announced March 2024.
-
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Authors:
Xinpeng Wang,
Shitong Duan,
Xiaoyuan Yi,
**g Yao,
Shanlin Zhou,
Zhihua Wei,
Peng Zhang,
Dongkuan Xu,
Maosong Sun,
Xing Xie
Abstract:
Big models have achieved revolutionary breakthroughs in the field of AI, but they might also pose potential concerns. Addressing such concerns, alignment technologies were introduced to make these models conform to human preferences and values. Despite considerable advancements in the past year, various challenges lie in establishing the optimal alignment strategy, such as data cost and scalable o…
▽ More
Big models have achieved revolutionary breakthroughs in the field of AI, but they might also pose potential concerns. Addressing such concerns, alignment technologies were introduced to make these models conform to human preferences and values. Despite considerable advancements in the past year, various challenges lie in establishing the optimal alignment strategy, such as data cost and scalable oversight, and how to align remains an open question. In this survey paper, we comprehensively investigate value alignment approaches. We first unpack the historical context of alignment tracing back to the 1920s (where it comes from), then delve into the mathematical essence of alignment (what it is), shedding light on the inherent challenges. Following this foundation, we provide a detailed examination of existing alignment methods, which fall into three categories: Reinforcement Learning, Supervised Fine-Tuning, and In-context Learning, and demonstrate their intrinsic connections, strengths, and limitations, hel** readers better understand this research area. In addition, two emerging topics, personal alignment, and multimodal alignment, are also discussed as novel frontiers in this field. Looking forward, we discuss potential alignment paradigms and how they could handle remaining challenges, prospecting where future alignment will go.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts
Authors:
Zewei Tian,
Min Sun,
Alex Liu,
Shawon Sarkar,
**g Liu
Abstract:
This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teach…
▽ More
This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Zero-Shot Cross-Lingual Document-Level Event Causality Identification with Heterogeneous Graph Contrastive Transfer Learning
Authors:
Zhitao He,
Pengfei Cao,
Zhuoran **,
Yubo Chen,
Kang Liu,
Zhiqiang Zhang,
Mengshu Sun,
Jun Zhao
Abstract:
Event Causality Identification (ECI) refers to the detection of causal relations between events in texts. However, most existing studies focus on sentence-level ECI with high-resource languages, leaving more challenging document-level ECI (DECI) with low-resource languages under-explored. In this paper, we propose a Heterogeneous Graph Interaction Model with Multi-granularity Contrastive Transfer…
▽ More
Event Causality Identification (ECI) refers to the detection of causal relations between events in texts. However, most existing studies focus on sentence-level ECI with high-resource languages, leaving more challenging document-level ECI (DECI) with low-resource languages under-explored. In this paper, we propose a Heterogeneous Graph Interaction Model with Multi-granularity Contrastive Transfer Learning (GIMC) for zero-shot cross-lingual document-level ECI. Specifically, we introduce a heterogeneous graph interaction network to model the long-distance dependencies between events that are scattered over a document. Then, to improve cross-lingual transferability of causal knowledge learned from the source language, we propose a multi-granularity contrastive transfer learning module to align the causal representations across languages. Extensive experiments show our framework outperforms the previous state-of-the-art model by 9.4% and 8.2% of average F1 score on monolingual and multilingual scenarios respectively. Notably, in the multilingual scenario, our zero-shot framework even exceeds GPT-3.5 with few-shot learning by 24.3% in overall performance.
△ Less
Submitted 22 March, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
TTA-Nav: Test-time Adaptive Reconstruction for Point-Goal Navigation under Visual Corruptions
Authors:
Maytus Piriyajitakonkij,
Mingfei Sun,
Mengmi Zhang,
Wei Pan
Abstract:
Robot navigation under visual corruption presents a formidable challenge. To address this, we propose a Test-time Adaptation (TTA) method, named as TTA-Nav, for point-goal navigation under visual corruptions. Our "plug-and-play" method incorporates a top-down decoder to a pre-trained navigation model. Firstly, the pre-trained navigation model gets a corrupted image and extracts features. Secondly,…
▽ More
Robot navigation under visual corruption presents a formidable challenge. To address this, we propose a Test-time Adaptation (TTA) method, named as TTA-Nav, for point-goal navigation under visual corruptions. Our "plug-and-play" method incorporates a top-down decoder to a pre-trained navigation model. Firstly, the pre-trained navigation model gets a corrupted image and extracts features. Secondly, the top-down decoder produces the reconstruction given the high-level features extracted by the pre-trained model. Then, it feeds the reconstruction of a corrupted image back to the pre-trained model. Finally, the pre-trained model does forward pass again to output action. Despite being trained solely on clean images, the top-down decoder can reconstruct cleaner images from corrupted ones without the need for gradient-based adaptation. The pre-trained navigation model with our top-down decoder significantly enhances navigation performance across almost all visual corruptions in our benchmarks. Our method improves the success rate of point-goal navigation from the state-of-the-art result of 46% to 94% on the most severe corruption. This suggests its potential for broader application in robotic visual navigation. Project page: https://sites.google.com/view/tta-nav
△ Less
Submitted 14 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
Authors:
Chao Xu,
Yang Liu,
Jiazheng Xing,
Weida Wang,
Mingze Sun,
Jun Dan,
Tianxin Huang,
Siyuan Li,
Zhi-Qi Cheng,
Ying Tai,
Baigui Sun
Abstract:
In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating various dynamically audio-consistent talking faces, termed Listening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangl…
▽ More
In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating various dynamically audio-consistent talking faces, termed Listening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. To tackle the issues, we first dig out the intricate relationships among facial factors and simplify the decoupling process, tailoring a Progressive Audio Disentanglement for accurate facial geometry and semantics learning, where each stage incorporates a customized training module responsible for a specific factor. Secondly, to achieve visually diverse and audio-synchronized animation solely from input audio within a single model, we introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics, as well as texture and temporal coherence between frames. In this way, we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at https://github.com/modelscope/facechain.
△ Less
Submitted 31 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Global self$-$similarity of dense granular flow in hopper: the role of hopper width
Authors:
Changhao Li,
Xin Li,
Xianggui Chen,
Zaixin Wang,
Min Sun,
Decai Huang
Abstract:
The influence of hopper width on dense granular flow in a two$-$dimensional hopper is investigated through experiments and simulations. Though the flow rate remains stable for larger hopper widths, a slight reduction in hopper width results in a significant increase in flow rate for smaller hopper widths. Both Beverloo\('\)s and Janda\('\)s formula accurately capture the relationship between the f…
▽ More
The influence of hopper width on dense granular flow in a two$-$dimensional hopper is investigated through experiments and simulations. Though the flow rate remains stable for larger hopper widths, a slight reduction in hopper width results in a significant increase in flow rate for smaller hopper widths. Both Beverloo\('\)s and Janda\('\)s formula accurately capture the relationship between the flow rate and outlet size. Flow characteristics in the regions near the outlet exhibit local self$-$similarity, supporting Beverloo and Janda's principles. Moreover, global self$-$similarity is analysed, indicated by the transition in flow state from mass flow in regions far from the outlet to funnel flow near the outlet. The earlier occurrence of this transition favors to enhance the grain velocity and consequently increases the dense flow rate. An exponential scaling law is proposed to describe the dependencies of flow rate, grain velocity, and transition height between the mass flow pattern and the funnel flow pattern on silo width.
△ Less
Submitted 20 April, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
How long will the quasar UV/optical flickering be damped?
Authors:
Shuying Zhou,
Mouyuan Sun,
Zhen-Yi Cai,
Guowei Ren,
Jun-Xian Wang,
Yongquan Xue
Abstract:
The UV/optical light curves of Active Galactic Nuclei (AGNs) are commonly described by the Damped Random Walk (DRW) model. However, the physical interpretation of the dam** timescale, a key parameter in the DRW model, remains unclear. Particularly, recent observations indicate a weak dependence of the dam** timescale upon both wavelength and accretion rate, clearly being inconsistent with the…
▽ More
The UV/optical light curves of Active Galactic Nuclei (AGNs) are commonly described by the Damped Random Walk (DRW) model. However, the physical interpretation of the dam** timescale, a key parameter in the DRW model, remains unclear. Particularly, recent observations indicate a weak dependence of the dam** timescale upon both wavelength and accretion rate, clearly being inconsistent with the accretion-disk theory. In this study, we investigate the dam** timescale in the framework of the Corona Heated Accretion disk Reprocessing (CHAR) model, a physical model that describes AGN variability. We find that while the CHAR model can reproduce the observed power spectral densities of the 20-year light curves for 190 sources from \cite{Stone2022}, the observed dam** timescale, as well as its weak dependence on wavelength, can also be well recovered through fitting the mock light curves with DRW. We further demonstrate that such weak dependence is artificial due to the effect of inadequate durations of light curves, which leads to best-fitting dam** timescales lower than the intrinsic ones. After eliminating this effect, the CHAR model indeed yields a strong dependence of the intrinsic dam** timescale on the bolometric luminosity and rest-frame wavelength. Our results highlight the demand for sufficiently long light curves in AGN variability studies and important applications of the CHAR model in such studies.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Human Robot Pacing Mismatch
Authors:
Muchen Sun,
Peter Trautman,
Todd Murphey
Abstract:
A widely accepted explanation for robots planning overcautious or overaggressive trajectories alongside human is that the crowd density exceeds a threshold such that all feasible trajectories are considered unsafe -- the freezing robot problem. However, even with low crowd density, the robot's navigation performance could still drop drastically when in close proximity to human. In this work, we ar…
▽ More
A widely accepted explanation for robots planning overcautious or overaggressive trajectories alongside human is that the crowd density exceeds a threshold such that all feasible trajectories are considered unsafe -- the freezing robot problem. However, even with low crowd density, the robot's navigation performance could still drop drastically when in close proximity to human. In this work, we argue that a broader cause of suboptimal navigation performance near human is due to the robot's misjudgement for the human's willingness (flexibility) to share space with others, particularly when the robot assumes the human's flexibility holds constant during interaction, a phenomenon of what we call human robot pacing mismatch. We show that the necessary condition for solving pacing mismatch is to model the evolution of both the robot and the human's flexibility during decision making, a strategy called distribution space modeling. We demonstrate the advantage of distribution space coupling through an anecdotal case study and discuss the future directions of solving human robot pacing mismatch.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Mixed Strategy Nash Equilibrium for Crowd Navigation
Authors:
Muchen Sun,
Francesca Baldini,
Katie Hughes,
Peter Trautman,
Todd Murphey
Abstract:
Robots navigating in crowded areas should negotiate free space with humans rather than fully controlling collision avoidance, as this can lead to freezing behavior. Game theory provides a framework for the robot to reason about potential cooperation from humans for collision avoidance during path planning. In particular, the mixed strategy Nash equilibrium captures the negotiation behavior under u…
▽ More
Robots navigating in crowded areas should negotiate free space with humans rather than fully controlling collision avoidance, as this can lead to freezing behavior. Game theory provides a framework for the robot to reason about potential cooperation from humans for collision avoidance during path planning. In particular, the mixed strategy Nash equilibrium captures the negotiation behavior under uncertainty, making it well suited for crowd navigation. However, computing the mixed strategy Nash equilibrium is often prohibitively expensive for real-time decision-making. In this paper, we propose an iterative Bayesian update scheme over probability distributions of trajectories. The algorithm simultaneously generates a stochastic plan for the robot and probabilistic predictions of other pedestrians' paths. We prove that the proposed algorithm is equivalent to solving a mixed strategy game for crowd navigation, and the algorithm guarantees the recovery of the global Nash equilibrium of the game. We name our algorithm Bayes' Rule Nash Equilibrium (BRNE) and develop a real-time model prediction crowd navigation framework. Since BRNE is not solving a general-purpose mixed strategy Nash equilibrium but a tailored formula specifically for crowd navigation, it can compute the solution in real-time on a low-power embedded computer. We evaluate BRNE in both simulated environments and real-world pedestrian datasets. BRNE consistently outperforms non-learning and learning-based methods regarding safety and navigation efficiency. It also reaches human-level crowd navigation performance in the pedestrian dataset benchmark. Lastly, we demonstrate the practicality of our algorithm with real humans on an untethered quadruped robot with fully onboard perception and computation.
△ Less
Submitted 17 June, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
Fast Ergodic Search with Kernel Functions
Authors:
Muchen Sun,
Ayush Gaggar,
Peter Trautman,
Todd Murphey
Abstract:
Ergodic search enables optimal exploration of an information distribution while guaranteeing the asymptotic coverage of the search space. However, current methods typically have exponential computation complexity in the search space dimension and are restricted to Euclidean space. We introduce a computationally efficient ergodic search method. Our contributions are two-fold. First, we develop a ke…
▽ More
Ergodic search enables optimal exploration of an information distribution while guaranteeing the asymptotic coverage of the search space. However, current methods typically have exponential computation complexity in the search space dimension and are restricted to Euclidean space. We introduce a computationally efficient ergodic search method. Our contributions are two-fold. First, we develop a kernel-based ergodic metric and generalize it from Euclidean space to Lie groups. We formally prove the proposed metric is consistent with the standard ergodic metric while guaranteeing linear complexity in the search space dimension. Secondly, we derive the first-order optimality condition of the kernel ergodic metric for nonlinear systems, which enables efficient trajectory optimization. Comprehensive numerical benchmarks show that the proposed method is at least two orders of magnitude faster than the state-of-the-art algorithm. Finally, we demonstrate the proposed algorithm with a peg-in-hole insertion task. We formulate the problem as a coverage task in the space of SE(3) and use a 30-second-long human demonstration as the prior distribution for ergodic coverage. Ergodicity guarantees the asymptotic solution of the peg-in-hole problem so long as the solution resides within the prior information distribution, which is seen in the 100\% success rate.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics
Authors:
Zhu Liu,
Cunliang Kong,
Ying Liu,
Maosong Sun
Abstract:
Large language models have achieved remarkable success in general language understanding tasks. However, as a family of generative methods with the objective of next token prediction, the semantic evolution with the depth of these models are not fully explored, unlike their predecessors, such as BERT-like architectures. In this paper, we specifically investigate the bottom-up evolution of lexical…
▽ More
Large language models have achieved remarkable success in general language understanding tasks. However, as a family of generative methods with the objective of next token prediction, the semantic evolution with the depth of these models are not fully explored, unlike their predecessors, such as BERT-like architectures. In this paper, we specifically investigate the bottom-up evolution of lexical semantics for a popular LLM, namely Llama2, by probing its hidden states at the end of each layer using a contextualized word identification task. Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction. This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics. The conclusion is further supported by the monotonic increase in performance via the hidden states for the last meaningless symbols, such as punctuation, in the prompting strategy. Our codes are available at https://github.com/RyanLiut/LLM_LexSem.
△ Less
Submitted 9 June, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.