-
Light-SLAM: A Robust Deep-Learning Visual SLAM System Based on LightGlue under Challenging Lighting Conditions
Authors:
Zhiqi Zhao,
Chang Wu,
Xiaotong Kong,
Zejie Lv,
Xiaoqi Du,
Qiyan Li
Abstract:
Simultaneous Localization and Map** (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant draw…
▽ More
Simultaneous Localization and Map** (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant drawbacks. To address this problem, we propose a novel hybrid system for visual SLAM based on the LightGlue deep learning network. It uses deep local feature descriptors to replace traditional hand-crafted features and a more efficient and accurate deep network to achieve fast and precise feature matching. Thus, we use the robustness of deep learning to improve the whole system. We have combined traditional geometry-based approaches to introduce a complete visual SLAM system for monocular, binocular, and RGB-D sensors. We thoroughly tested the proposed system on four public datasets: KITTI, EuRoC, TUM, and 4Season, as well as on actual campus scenes. The experimental results show that the proposed method exhibits better accuracy and robustness in adapting to low-light and strongly light-varying environments than traditional manual features and deep learning-based methods. It can also run on GPU in real time.
△ Less
Submitted 10 May, 2024;
originally announced July 2024.
-
UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models
Authors:
Zhanyue Qin,
Haochuan Wang,
Deyuan Liu,
Ziyang Song,
Cunhang Fan,
Zhao Lv,
**lin Wu,
Zhen Lei,
Zhiying Tu,
Dianhui Chu,
Xiaoyan Yu,
Dianbo Sui
Abstract:
Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game…
▽ More
Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game UNO to evaluate the sequential decision-making capability of LLMs and explain in detail why we choose UNO. In UNO Arena, We evaluate the sequential decision-making capability of LLMs dynamically with novel metrics based Monte Carlo methods. We set up random players, DQN-based reinforcement learning players, and LLM players (e.g. GPT-4, Gemini-pro) for comparison testing. Furthermore, in order to improve the sequential decision-making capability of LLMs, we propose the TUTRI player, which can involves having LLMs reflect their own actions wtih the summary of game history and the game strategy. Numerous experiments demonstrate that the TUTRI player achieves a notable breakthrough in the performance of sequential decision-making compared to the vanilla LLM player.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
Authors:
Deyuan Liu,
Zhanyue Qin,
Hairu Wang,
Zhao Yang,
Zecheng Wang,
Fangying Rong,
Qingbin Liu,
Yanchao Hao,
Xi Chen,
Cunhang Fan,
Zhao Lv,
Zhiying Tu,
Dianhui Chu,
Bo Li,
Dianbo Sui
Abstract:
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach…
▽ More
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Authors:
Chaojie Wang,
Yanchen Deng,
Zhiyi Lv,
Zeng Liang,
Jujie He,
Shuicheng Yan,
An Bo
Abstract:
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing…
▽ More
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.
△ Less
Submitted 27 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
VideoLLM-online: Online Video Large Language Model for Streaming Video
Authors:
Joya Chen,
Zhaoyang Lv,
Shiwei Wu,
Kevin Qinghong Lin,
Chenan Song,
Difei Gao,
Jia-Wei Liu,
Ziteng Gao,
Dongxing Mao,
Mike Zheng Shou
Abstract:
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St…
▽ More
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Authors:
Anbai Jiang,
Bing Han,
Zhiqiang Lv,
Yufeng Deng,
Wei-Qiang Zhang,
Xie Chen,
Yanmin Qian,
Jia Liu,
**yi Fan
Abstract:
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res…
▽ More
Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Frequency-mix Knowledge Distillation for Fake Speech Detection
Authors:
Cunhang Fan,
Shunbo Dong,
Jun Xue,
Yujie Chen,
Jiangyan Yi,
Zhao Lv
Abstract:
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA…
▽ More
In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31\% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
DIET: Customized Slimming for Incompatible Networks in Sequential Recommendation
Authors:
Kairui Fu,
Shengyu Zhang,
Zheqi Lv,
**gyuan Chen,
Jiwei Li
Abstract:
Due to the continuously improving capabilities of mobile edges, recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. Despite their significant progress, these methods require substantial on-edge computationa…
▽ More
Due to the continuously improving capabilities of mobile edges, recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. Despite their significant progress, these methods require substantial on-edge computational resources and frequent network transfers to keep the model up to date. The former may disrupt other processes on the edge to acquire computational resources, while the latter consumes network bandwidth, leading to a decrease in user satisfaction. In response to these challenges, we propose a customizeD slImming framework for incompatiblE neTworks(DIET). DIET deploys the same generic backbone (potentially incompatible for a specific edge) to all devices. To minimize frequent bandwidth usage and storage consumption in personalization, DIET tailors specific subnets for each edge based on its past interactions, learning to generate slimming subnets(diets) within incompatible networks for efficient transfer. It also takes the inter-layer relationships into account, empirically reducing inference time while obtaining more suitable diets. We further explore the repeated modules within networks and propose a more storage-efficient framework, DIETING, which utilizes a single layer of parameters to represent the entire network, achieving comparably excellent performance. The experiments across four state-of-the-art datasets and two widely used models demonstrate the superior accuracy in recommendation and efficiency in transmission and storage of our framework.
△ Less
Submitted 15 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Anomalous Enhancement of the Electrocatalytic Hydrogen Evolution Reaction in AuPt Nanoclusters
Authors:
Jiahui Kang,
Jan Kloppenburg,
Jiali Sheng,
Zhenyu Xu,
Kristoffer Meinander,
Hua Jiang,
Zhong-Peng Lv,
Esko I. Kauppinen,
Qiang Zhang,
Xi Chen,
Olli Ikkala,
Miguel A. Caro,
Bo Peng
Abstract:
Energy- and resource-efficient electrocatalytic water splitting is of paramount importance to enable sustainable hydrogen production. The best bulk catalyst for the hydrogen evolution reaction (HER), i.e., platinum, is one of the scarcest elements on Earth. The use of raw material for HER can be dramatically reduced by utilizing nanoclusters. In addition, nanoalloying can further improve the perfo…
▽ More
Energy- and resource-efficient electrocatalytic water splitting is of paramount importance to enable sustainable hydrogen production. The best bulk catalyst for the hydrogen evolution reaction (HER), i.e., platinum, is one of the scarcest elements on Earth. The use of raw material for HER can be dramatically reduced by utilizing nanoclusters. In addition, nanoalloying can further improve the performance of these nanoclusters. In this paper, we present results for HER on nanometer-sized ligand-free AuPt nanoclusters grafted on carbon nanotubes. These results demonstrate excellent monodispersity and a significant reduction of the overpotential for the electrocatalytic HER. We utilize atomistic machine learning techniques to elucidate the atomic-scale origin of the synergistic effect between Pt and Au. We show that the presence of surface Au atoms, known to be poor HER catalysts, in a Pt(core)/AuPt(shell) nanocluster structure, drives an anomalous enhancement of the inherently high catalytic activity of Pt atoms.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Backpropogation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration
Authors:
Wei Ji,
Li Li,
Zheqi Lv,
Wenqiao Zhang,
Mengze Li,
Zhen Wan,
Wenqiang Lei,
Roger Zimmermann
Abstract:
In our increasingly interconnected world, where intelligent devices continually amass copious personalized multi-modal data, a pressing need arises to deliver high-quality, personalized device-aware services. However, this endeavor presents a multifaceted challenge to prevailing artificial intelligence (AI) systems primarily rooted in the cloud. As these systems grapple with shifting data distribu…
▽ More
In our increasingly interconnected world, where intelligent devices continually amass copious personalized multi-modal data, a pressing need arises to deliver high-quality, personalized device-aware services. However, this endeavor presents a multifaceted challenge to prevailing artificial intelligence (AI) systems primarily rooted in the cloud. As these systems grapple with shifting data distributions between the cloud and devices, the traditional approach of fine-tuning-based adaptation (FTA) exists the following issues: the costly and time-consuming data annotation required by FTA and the looming risk of model overfitting. To surmount these challenges, we introduce a Universal On-Device Multi-modal Model Adaptation Framework, revolutionizing on-device model adaptation by striking a balance between efficiency and effectiveness. The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices. To enhance adaptability across multi-modal tasks, the AnchorFrame Distribution Reasoner (ADR) minimizes communication costs. Our contributions, encapsulated in the Cloud-Device Collaboration Multi-modal Parameter Generation (CDC-MMPG) framework, represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA). Extensive experiments validate the efficiency and effectiveness of our method, particularly in video question answering and retrieval tasks, driving forward the integration of intelligent devices into our daily lives.
△ Less
Submitted 21 May, 2024;
originally announced June 2024.
-
Coherent XUV super continuum emission from atomic bound states
Authors:
**g Zhao,
Xiaowei Wang,
Li Wang,
Jiacan Wang,
Yalei Zhu,
Fan Xiao,
Wenkai Tao,
Zhigang Zheng,
Haizhong Wu,
Xu Sun,
Yue Lang,
Congsen Meng,
Dongwen Zhang,
Zhihui Lv,
**lei Liu,
Zengxiu Zhao
Abstract:
Coherent supercontinuum radiation in the extreme-ultraviolet (XUV) range is indispensable for synthesizing attosecond light pulses and for exploring transient atomic structures. Here, we report the striking observations of coherent XUV supercontinuum (XSC) extended from below to far above the ionization threshold, which exhibits completely different temporal and spatial properties comparing to the…
▽ More
Coherent supercontinuum radiation in the extreme-ultraviolet (XUV) range is indispensable for synthesizing attosecond light pulses and for exploring transient atomic structures. Here, we report the striking observations of coherent XUV supercontinuum (XSC) extended from below to far above the ionization threshold, which exhibits completely different temporal and spatial properties comparing to the conventional rescattering induced high harmonic generation (HHG). We demonstrate that the strong-field created coherence among bound orbitals strongly distort the atomic transition energies during the pulse, leading to coherent emission spanning tens of electron-volts, in contrast to the line emission via free-induction decay occurring after the pulse. The supposed non-radiating bound dark states contribute as well by emitting dressed energy through dark-to-bright emission mechanism. All the processes modulated at sub-cycle time scale jointly form this new-type coherent XSC. This work achieves the strong-field attosecond control of the exotic atomic radiation dynamics and provides the means of simultaneous generation of separated attosecond sources, i.e., XSC and HHG, with potential advancing attosecond interferometry.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Single-Spin Waved-Brim Flat-Top Hat in the Band Edge of GdIH Monolayer
Authors:
Ningning Jia,
Zhao Yang,
Jiangtao Cai,
Zhiheng Lv,
Yongting Shi,
Tielei Song,
Xin Cui,
Zhifeng Liu
Abstract:
Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stabl…
▽ More
Exotic electronic bands, such as flat bands, linear crossing bands, spontaneously valley- or spin-polarized bands, in two-dimensional materials have been the hot topics in condensed matter physics. Herein, we first propose a general dispersion model for possible hat-like electronic bands, and then identify an intriguing single-spin \emph{waved-brim flat-top hat} in the valence band edge of a stable ferromagnetic semiconducting electrene (i.e., Janus GdIH monolayer), which can be well described by a simplified two-bands Hamiltonian model. Specifically, the hat-band has a waved brim with six valleys along the boundary of the first Brillouin zone; meanwhile it holds a flat top close to the Fermi level, resulting in the emergence of single-spin van Hove singularities divergence and Lifshitz transitions. Owing to the breaking of both time-reversal and space inversion symmetries, a sizable spontaneous valley polarization is formed between the adjacent brim valleys, which provides the opportunity to realize the high-temperature anomalous valley Hall effect. Particularly, via modest strains and carriers do**, various conductive bipolar-states (spin-up vs. spin-down, K valley vs. $-$K valley, and ultra-low-speed vs. ultra-high-speed) can be modulated out from the distorted waved-brim flat-top hat of GdIH ML.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation
Authors:
Haoyu Zheng,
Wenqiao Zhang,
Yaoke Wang,
Hao Zhou,
Jiang Liu,
Juncheng Li,
Zheqi Lv,
Siliang Tang,
Yueting Zhuang
Abstract:
Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce th…
▽ More
Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce the fine-grained animation of smooth text-conditioned image morphing without fine-tuning, i.e., due to their highly unstructured latent space. In this paper, we introduce a tuning-free LLM-driven attention control framework, encapsulated by the progressive process of LLM planning, prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. LASER employs a large language model (LLM) to refine coarse descriptions into detailed prompts, guiding pre-trained text-to-image models for subsequent image generation. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity and enable seamless morphing directly from text prompts, eliminating the need for additional fine-tuning or annotations. Our meticulous control over spatial features and self-attention ensures structural consistency in the images. This paper presents a novel framework integrating LLMs with text-to-image models to create high-quality animations from a single text input. We also propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER produces impressive, consistent, and efficient results in animation generation, positioning it as a powerful tool for advanced digital content creation.
△ Less
Submitted 23 April, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
Terrain Point Cloud Inpainting via Signal Decomposition
Authors:
Yizhou Xie,
Xiangning Xie,
Yuran Wang,
Yanci Zhang,
Zejun Lv
Abstract:
The rapid development of 3D acquisition technology has made it possible to obtain point clouds of real-world terrains. However, due to limitations in sensor acquisition technology or specific requirements, point clouds often contain defects such as holes with missing data. Inpainting algorithms are widely used to patch these holes. However, existing traditional inpainting algorithms rely on precis…
▽ More
The rapid development of 3D acquisition technology has made it possible to obtain point clouds of real-world terrains. However, due to limitations in sensor acquisition technology or specific requirements, point clouds often contain defects such as holes with missing data. Inpainting algorithms are widely used to patch these holes. However, existing traditional inpainting algorithms rely on precise hole boundaries, which limits their ability to handle cases where the boundaries are not well-defined. On the other hand, learning-based completion methods often prioritize reconstructing the entire point cloud instead of solely focusing on hole filling. Based on the fact that real-world terrain exhibits both global smoothness and rich local detail, we propose a novel representation for terrain point clouds. This representation can help to repair the holes without clear boundaries. Specifically, it decomposes terrains into low-frequency and high-frequency components, which are represented by B-spline surfaces and relative height maps respectively. In this way, the terrain point cloud inpainting problem is transformed into a B-spline surface fitting and 2D image inpainting problem. By solving the two problems, the highly complex and irregular holes on the terrain point clouds can be well-filled, which not only satisfies the global terrain undulation but also exhibits rich geometric details. The experimental results also demonstrate the effectiveness of our method.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Transforming the Synthesis of Carbon Nanotubes with Machine Learning Models and Automation
Authors:
Yue Li,
Shurui Wang,
Zhou Lv,
Zhaoji Wang,
Yunbiao Zhao,
Ying Xie,
Yang Xu,
Liu Qian,
Yaodong Yang,
Ziqiang Zhao,
** Zhang
Abstract:
Carbon-based nanomaterials (CBNs) are showing significant potential in various fields, such as electronics, energy, and mechanics. However, their practical applications face synthesis challenges stemming from the complexities of structural control, large-area uniformity, and high yield. Current research methodologies fall short in addressing the multi-variable, coupled interactions inherent to CBN…
▽ More
Carbon-based nanomaterials (CBNs) are showing significant potential in various fields, such as electronics, energy, and mechanics. However, their practical applications face synthesis challenges stemming from the complexities of structural control, large-area uniformity, and high yield. Current research methodologies fall short in addressing the multi-variable, coupled interactions inherent to CBNs production. Machine learning methods excel at navigating such complexities. Their integration with automated synthesis platforms has demonstrated remarkable potential in accelerating chemical synthesis research, but remains underexplored in the nanomaterial domain. Here we introduce Carbon Copilot (CARCO), an artificial intelligence (AI)-driven platform that integrates transformer-based language models tailored for carbon materials, robotic chemical vapor deposition (CVD), and data-driven machine learning models, empowering accelerated research of CBNs synthesis. Employing CARCO, we demonstrate innovative catalyst discovery by predicting a superior Titanium-Platinum bimetallic catalyst for high-density horizontally aligned carbon nanotube (HACNT) array synthesis, validated through over 500 experiments. Furthermore, with the assistance of millions of virtual experiments, we achieved an unprecedented 56.25% precision in synthesizing HACNT arrays with predetermined densities in the real world. All were accomplished within just 43 days. This work not only advances the field of HACNT arrays but also exemplifies the integration of AI with human expertise to overcome the limitations of traditional experimental approaches, marking a paradigm shift in nanomaterials research and paving the way for broader applications.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
Authors:
Qiao Gu,
Zhaoyang Lv,
Duncan Frost,
Simon Green,
Julian Straub,
Chris Sweeney
Abstract:
In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and…
▽ More
In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
Authors:
Wenqiao Zhang,
Tianwei Lin,
Jiang Liu,
Fangxun Shu,
Haoyuan Li,
Lei Zhang,
He Wanggui,
Hao Zhou,
Zheqi Lv,
Hao Jiang,
Juncheng Li,
Siliang Tang,
Yueting Zhuang
Abstract:
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through v…
▽ More
Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training.
Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Treewidth of generalized Hamming graph, bipartite Kneser graph and generalized Petersen graph
Authors:
Yichen Wang,
Mengyu Cao,
Zequn Lv,
Mei Lu
Abstract:
Let $t,q$ and $n$ be positive integers. Write $[q] = \{1,2,\ldots,q\}$. The generalized Hamming graph $H(t,q,n)$ is the graph whose vertex set is the cartesian product of $n$ copies of $[q]$$(q\ge 2)$, where two vertices are adjacent if their Hamming distance is at most $t$. In particular, $H(1,q,n)$ is the well-known Hamming graph and $H(1,2,n)$ is the hypercube. In 2006, Chandran and Kavitha des…
▽ More
Let $t,q$ and $n$ be positive integers. Write $[q] = \{1,2,\ldots,q\}$. The generalized Hamming graph $H(t,q,n)$ is the graph whose vertex set is the cartesian product of $n$ copies of $[q]$$(q\ge 2)$, where two vertices are adjacent if their Hamming distance is at most $t$. In particular, $H(1,q,n)$ is the well-known Hamming graph and $H(1,2,n)$ is the hypercube. In 2006, Chandran and Kavitha described the asymptotic value of $tw(H(1,q,n))$, where $tw(G)$ denotes the treewidth of $G$. In this paper, we give the exact pathwidth of $H(t,2,n)$ and show that $tw(H(t,q,n)) = Θ(tq^n/\sqrt{n})$ when $n$ goes to infinity. Based on those results, we show that the treewidth of bipartite Kneser graph $BK(n,k)$ is $\binom{n}{k} - 1$ when $n$ is sufficient large relative to $k$ and the bounds of $tw(BK(2k+1,k))$ are given. Moreover, we present the bounds of the treewidth of generalized Petersen graph.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
AuG-KD: Anchor-Based Mixup Generation for Out-of-Domain Knowledge Distillation
Authors:
Zihao Tang,
Zheqi Lv,
Shengyu Zhang,
Yifan Zhou,
Xinyu Duan,
Fei Wu,
Kun Kuang
Abstract:
Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance d…
▽ More
Due to privacy or patent concerns, a growing number of large models are released without granting access to their training data, making transferring their knowledge inefficient and problematic. In response, Data-Free Knowledge Distillation (DFKD) methods have emerged as direct solutions. However, simply adopting models derived from DFKD for real-world applications suffers significant performance degradation, due to the discrepancy between teachers' training data and real-world scenarios (student domain). The degradation stems from the portions of teachers' knowledge that are not applicable to the student domain. They are specific to the teacher domain and would undermine students' performance. Hence, selectively transferring teachers' appropriate knowledge becomes the primary challenge in DFKD. In this work, we propose a simple but effective method AuG-KD. It utilizes an uncertainty-guided and sample-specific anchor to align student-domain data with the teacher domain and leverages a generative method to progressively trade off the learning process between OOD knowledge distillation and domain-specific information learning via mixup learning. Extensive experiments in 3 datasets and 8 settings demonstrate the stability and superiority of our approach. Code available at https://github.com/IshiKura-a/AuG-KD .
△ Less
Submitted 17 March, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
Authors:
Zhengyao Lv,
Yuxiang Wei,
Wangmeng Zuo,
Kwan-Yee K. Wong
Abstract:
Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless, synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically, w…
▽ More
Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless, synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically, we first employ the layout control map to faithfully represent layouts in the feature space. Subsequently, we combine the layout and semantic features in a timestep-adaptive manner to synthesize images with realistic details. During fine-tuning, we propose the Semantic Alignment (SA) loss to further enhance layout alignment. Additionally, we introduce the Layout-Free Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the priors of pre-trained models, thereby improving the visual quality and semantic consistency of synthesized images. Extensive experiments demonstrate that our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment. The source code and model are available at https://github.com/cszy98/PLACE/tree/main.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
A Simple Baseline for Efficient Hand Mesh Reconstruction
Authors:
Zhishan Zhou,
Shihao. zhou,
Zhi Lv,
Minqiang Zou,
Yao Tang,
Jiajun Liang
Abstract:
3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods b…
▽ More
3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Aria Everyday Activities Dataset
Authors:
Zhaoyang Lv,
Nicholas Charron,
Pierre Moulon,
Alexander Gamino,
Cheng Peng,
Chris Sweeney,
Edward Miller,
Huixuan Tang,
Jeff Meissner,
**g Dong,
Kiran Somasundaram,
Luis Pesqueira,
Mark Schwesinger,
Omkar Parkhi,
Qiao Gu,
Renzo De Nardi,
Shangyi Cheng,
Steve Saarinen,
Vijay Baiyya,
Yuyang Zou,
Richard Newcombe,
Jakob Julian Engel,
Xiaqing Pan,
Carl Ren
Abstract:
We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data includi…
▽ More
We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data including high frequency globally aligned 3D trajectories, scene point cloud, per-frame 3D eye gaze vector and time aligned speech transcription. In this paper, we demonstrate a few exemplar research applications enabled by this dataset, including neural scene reconstruction and prompted segmentation. AEA is an open source dataset that can be downloaded from https://www.projectaria.com/datasets/aea/. We are also providing open-source implementations and examples of how to use the dataset in Project Aria Tools https://github.com/facebookresearch/projectaria_tools.
△ Less
Submitted 21 February, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation
Authors:
Zihao Tang,
Zheqi Lv,
Shengyu Zhang,
Fei Wu,
Kun Kuang
Abstract:
The rapid advancement of Large Language Models (LLMs) has revolutionized various sectors by automating routine tasks, marking a step toward the realization of Artificial General Intelligence (AGI). However, they still struggle to accommodate the diverse and specific needs of users and simplify the utilization of AI models for the average user. In response, we propose ModelGPT, a novel framework de…
▽ More
The rapid advancement of Large Language Models (LLMs) has revolutionized various sectors by automating routine tasks, marking a step toward the realization of Artificial General Intelligence (AGI). However, they still struggle to accommodate the diverse and specific needs of users and simplify the utilization of AI models for the average user. In response, we propose ModelGPT, a novel framework designed to determine and generate AI models specifically tailored to the data or task descriptions provided by the user, leveraging the capabilities of LLMs. Given user requirements, ModelGPT is able to provide tailored models at most 270x faster than the previous paradigms (e.g. all-parameter or LoRA finetuning). Comprehensive experiments on NLP, CV, and Tabular datasets attest to the effectiveness of our framework in making AI models more accessible and user-friendly. Our code is available at https://github.com/IshiKura-a/ModelGPT.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing
Authors:
Bryan Wang,
Yuliang Li,
Zhaoyang Lv,
Haijun Xia,
Yan Xu,
Raj Sodhi
Abstract:
Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features…
▽ More
Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion
Authors:
Cunhang Fan,
Yujie Chen,
Jun Xue,
Yonghui Kong,
Jianhua Tao,
Zhao Lv
Abstract:
In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly re…
▽ More
In recent years, knowledge graph completion (KGC) models based on pre-trained language model (PLM) have shown promising results. However, the large number of parameters and high computational cost of PLM models pose challenges for their application in downstream tasks. This paper proposes a progressive distillation method based on masked generation features for KGC task, aiming to significantly reduce the complexity of pre-trained models. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. However, traditional feature distillation suffers from the limitation of having a single representation of information in teacher models. To solve this problem, we propose masked generation of teacher-student features, which contain richer representation information. Furthermore, there is a significant gap in representation ability between teacher and student. Therefore, we design a progressive distillation method to distill student models at each grade level, enabling efficient knowledge transfer from teachers to students. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods. Furthermore, in the progressive distillation stage, the model significantly reduces the model parameters while maintaining a certain level of performance. Specifically, the model parameters of the lower-grade student model are reduced by 56.7\% compared to the baseline.
△ Less
Submitted 10 June, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
A First Step Towards Runtime Analysis of Evolutionary Neural Architecture Search
Authors:
Zeqiong Lv,
Chao Qian,
Yanan Sun
Abstract:
Evolutionary neural architecture search (ENAS) employs evolutionary algorithms to find high-performing neural architectures automatically, and has achieved great success. However, compared to the empirical success, its rigorous theoretical analysis has yet to be touched. This work goes preliminary steps toward the mathematical runtime analysis of ENAS. In particular, we define a binary classificat…
▽ More
Evolutionary neural architecture search (ENAS) employs evolutionary algorithms to find high-performing neural architectures automatically, and has achieved great success. However, compared to the empirical success, its rigorous theoretical analysis has yet to be touched. This work goes preliminary steps toward the mathematical runtime analysis of ENAS. In particular, we define a binary classification problem $\textsc{UNIFORM}$, and formulate an explicit fitness function to represent the relationship between neural architecture and classification accuracy. Furthermore, we consider (1+1)-ENAS algorithm with mutation to optimize the neural architecture, and obtain the following runtime bounds: both the local and global mutations find the optimum in an expected runtime of $Θ(n)$, where $n$ is the problem size. The theoretical results show that the local and global mutations achieve nearly the same performance on $\textsc{UNIFORM}$. Empirical results also verify the equivalence of these two mutation operators.
△ Less
Submitted 8 April, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
An ontology alignment method with user intervention using compact differential evolution with adaptive parameter control
Authors:
Zhaoming Lv
Abstract:
User interaction is one of the most effective ways to improve the ontology alignment quality. However, this approach faces the challenge of how users can participate effectively in the matching process. To solve this challenge. In this paper, an interactive ontology alignment approach using compact differential evolution algorithm with adaptive parameter control (IOACDE) is proposed. In this metho…
▽ More
User interaction is one of the most effective ways to improve the ontology alignment quality. However, this approach faces the challenge of how users can participate effectively in the matching process. To solve this challenge. In this paper, an interactive ontology alignment approach using compact differential evolution algorithm with adaptive parameter control (IOACDE) is proposed. In this method, the ontology alignment process is modeled as an interactive optimization problem and users are allowed to intervene in matching in two ways. One is that the map** suggestions generated by IOACDE as a complete candidate alignment is evaluated by user during optimization process. The other is that the user ameliorates the alignment results by evaluating single map** after the automatic matching process. To demonstrate the effectiveness of the proposed algorithm, the neural embedding model and K nearest neighbor (KNN) is employed to simulate user for the ontologies of the real world. The experimental results show that the proposed interactive approach can improve the alignment quality compared to the non-interactive. Compared with the state-of-the-art methods from OAEI, the results show that the proposed algorithm has a better performance under the same error rate.
△ Less
Submitted 18 January, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Understanding the Universal Dust Attenuation Scaling Relation of Star-Forming Galaxies
Authors:
J. Qin,
X. Z. Zheng,
S. Wuyts,
Z. Lv,
M. Qiao,
J. -S. Huang,
F. S. Liu,
A. Katsianis,
V. Gonzalez,
F. Bian,
H. Xu,
Z. Pan,
W. Liu,
Q. -H. Tan,
F. X. An,
D. D. Shi,
Y. Zhang,
R. Wen,
S. Liu,
C. Yang
Abstract:
Star-forming galaxies (SFGs) adhere to a surprisingly tight scaling relation of dust attenuation parameterized by the infrared excess (IRX=$L_{\rm IR}/L_{\rm UV}$), being jointly determined by the star formation rate (SFR), galaxy size ($R_{\rm e}$), metallicity ($Z$/Z$_\odot$) and axial ratio ($b/a$). We examine how these galaxy parameters determine the effective dust attenuation and give rise to…
▽ More
Star-forming galaxies (SFGs) adhere to a surprisingly tight scaling relation of dust attenuation parameterized by the infrared excess (IRX=$L_{\rm IR}/L_{\rm UV}$), being jointly determined by the star formation rate (SFR), galaxy size ($R_{\rm e}$), metallicity ($Z$/Z$_\odot$) and axial ratio ($b/a$). We examine how these galaxy parameters determine the effective dust attenuation and give rise to the universal IRX relation, utilizing a simple two-component star-dust geometry model in which dust in the dense and diffuse interstellar medium (ISM) follows exponential mass density profiles, connected with but not necessarily identical to the stellar mass profiles. Meanwhile, empirical relations are adopted to link galaxy properties, including the gas--star formation relation, the dust-to-stellar size relation, as well as the dust-to-gas ratio versus metallicity relation. By fitting a large sample of local SFGs with the model, we obtain the best-fitting model parameters as a function of metallicity, showing that the two-component geometry model is able to successfully reproduce the dependence of IRX on SFR, $R_{\rm e}$, $b/a$ at given $Z$/Z$_\odot$, as well as the dependence of power-law indices on metallicity. Moreover, we also retrieve constraints on the model geometry parameters, including the optical depth of birth clouds (BCs), BC-to-total dust mass fraction, BC covering factor of UV-emitting stars, and star-to-total dust disc radius ratio, which all evolve with galaxy metallicity. Finally, a consistent picture of how the star-dust geometry in SFGs evolves with galaxy metallicity is discussed.
△ Less
Submitted 30 January, 2024; v1 submitted 27 December, 2023;
originally announced December 2023.
-
Learning to Reweight for Graph Neural Network
Authors:
Zhengyu Chen,
Teng Xiao,
Kun Kuang,
Zheqi Lv,
Min Zhang,
**luan Yang,
Chengqiang Lu,
Hongxia Yang,
Fei Wu
Abstract:
Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data. The cardinal impetus underlying the severe degeneration is that the GNNs are architected predicated upon the I.I.D assumptions. In such a setting, GNNs are inclined to leverage imperceptible st…
▽ More
Graph Neural Networks (GNNs) show promising results for graph tasks. However, existing GNNs' generalization ability will degrade when there exist distribution shifts between testing and training graph data. The cardinal impetus underlying the severe degeneration is that the GNNs are architected predicated upon the I.I.D assumptions. In such a setting, GNNs are inclined to leverage imperceptible statistical correlations subsisting in the training set to predict, albeit it is a spurious correlation. In this paper, we study the problem of the generalization ability of GNNs in Out-Of-Distribution (OOD) settings. To solve this problem, we propose the Learning to Reweight for Generalizable Graph Neural Network (L2R-GNN) to enhance the generalization ability for achieving satisfactory performance on unseen testing graphs that have different distributions with training graphs. We propose a novel nonlinear graph decorrelation method, which can substantially improve the out-of-distribution generalization ability and compares favorably to previous methods in restraining the over-reduced sample size. The variables of the graph representation are clustered based on the stability of the correlation, and the graph decorrelation method learns weights to remove correlations between the variables of different clusters rather than any two variables. Besides, we interpose an efficacious stochastic algorithm upon bi-level optimization for the L2R-GNN framework, which facilitates simultaneously learning the optimal weights and GNN parameters, and avoids the overfitting problem. Experimental results show that L2R-GNN greatly outperforms baselines on various graph prediction benchmarks under distribution shifts.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Basic Survey Scheduling for the Wide Field Survey Telescope (WFST)
Authors:
Yan-Peng Chen,
Ji-an Jiang,
Wen-Tao Luo,
Xian Zhong Zheng,
Min Fang,
Chao Yang,
Yuan-Yu Hong,
Zong-Fei Lv
Abstract:
Aiming at improving the survey efficiency of the Wide Field Survey Telescope, we have developed a basic scheduling strategy that takes into account the telescope characteristics, observing conditions, and weather conditions at the Lenghu site. The sky area is divided into rectangular regions, referred to as `tiles', with a size of 2.577 deg * 2.634 deg slightly smaller than the focal area of the m…
▽ More
Aiming at improving the survey efficiency of the Wide Field Survey Telescope, we have developed a basic scheduling strategy that takes into account the telescope characteristics, observing conditions, and weather conditions at the Lenghu site. The sky area is divided into rectangular regions, referred to as `tiles', with a size of 2.577 deg * 2.634 deg slightly smaller than the focal area of the mosaic CCDs. These tiles are continuously filled in annulars parallel to the equator. The brightness of the sky background, which varies with the moon phase and distance from the moon, plays a significant role in determining the accessible survey fields. Approximately 50 connected tiles are grouped into one block for observation. To optimize the survey schedule, we perform simulations by taking into account the length of exposures, data readout, telescope slewing, and all relevant observing conditions. We utilize the Greedy Algorithm for scheduling optimization. Additionally, we propose a dedicated dithering pattern to cover the gaps between CCDs and the four corners of the mosaic CCD array, which are located outside of the 3 deg field of view. This dithering pattern helps to achieve relatively uniform exposure maps for the final survey outputs.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer
Authors:
Wenqiao Zhang,
Zheqi Lv,
Hao Zhou,
Jia-Wei Liu,
Juncheng Li,
Mengze Li,
Siliang Tang,
Yueting Zhuang
Abstract:
Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate.This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to target a new and challenging setting of knowledge transfer that extends ADA from a single source domain to mult…
▽ More
Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate.This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to target a new and challenging setting of knowledge transfer that extends ADA from a single source domain to multiple source domains, termed Multi-source Active Domain Adaptation (MADA). Not surprisingly, we find that most traditional ADA methods cannot work directly in such a setting, mainly due to the excessive domain gap introduced by all the source domains and thus their uncertainty-aware sample selection can easily become miscalibrated under the multi-domain shifts. Considering this, we propose a Dynamic integrated uncertainty valuation framework(Detective) that comprehensively consider the domain shift between multi-source domains and target domain to detect the informative target samples. Specifically, the leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the model's parameters to fit the union of multi-source domains. This enables an approximate single-source domain modeling by the dynamic model. We then comprehensively measure both domain uncertainty and predictive uncertainty in the target domain to detect informative target samples using evidential deep learning, thereby mitigating uncertainty miscalibration. Furthermore, we introduce a contextual diversity-aware calculator to enhance the diversity of the selected samples. Experiments demonstrate that our solution outperforms existing methods by a considerable margin on three domain adaptation benchmarks.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Self-suppressed quantum diffusion and fundamental noise limit of soliton microcombs
Authors:
Xing **,
Zhe Lv,
Qihuang Gong,
Qi-Fan Yang
Abstract:
Quantum diffusion of soliton microcombs has long been recognized as their fundamental noise limit. Here we surpass such limit by utilizing dispersive wave dynamics in multimode microresonators. Through the recoil force provided by these dispersive waves, the quantum diffusion can be suppressed to a much lower level that forms the ultimate fundamental noise limit of soliton microcombs. Our findings…
▽ More
Quantum diffusion of soliton microcombs has long been recognized as their fundamental noise limit. Here we surpass such limit by utilizing dispersive wave dynamics in multimode microresonators. Through the recoil force provided by these dispersive waves, the quantum diffusion can be suppressed to a much lower level that forms the ultimate fundamental noise limit of soliton microcombs. Our findings enable coherence engineering of soliton microcombs in the quantum-limited regime, providing critical guidelines for using soliton microcombs to synthesize ultralow-noise microwave and optical signals.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Unpaired MRI Super Resolution with Contrastive Learning
Authors:
Hao Li,
Quanwei Liu,
Jianan Liu,
Xiling Liu,
Yanni Dong,
Tao Huang,
Zhihan Lv
Abstract:
Magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. However, the inherent long scan time of MRI restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. Due to lacking of aligned high-resolution (HR) and low-resolution (LR) MRI image pairs, uns…
▽ More
Magnetic resonance imaging (MRI) is crucial for enhancing diagnostic accuracy in clinical settings. However, the inherent long scan time of MRI restricts its widespread applicability. Deep learning-based image super-resolution (SR) methods exhibit promise in improving MRI resolution without additional cost. Due to lacking of aligned high-resolution (HR) and low-resolution (LR) MRI image pairs, unsupervised approaches are widely adopted for SR reconstruction with unpaired MRI images. However, these methods still require a substantial number of HR MRI images for training, which can be difficult to acquire. To this end, we propose an unpaired MRI SR approach that employs contrastive learning to enhance SR performance with limited HR training data. Empirical results presented in this study underscore significant enhancements in the peak signal-to-noise ratio and structural similarity index, even when a paucity of HR images is available. These findings accentuate the potential of our approach in addressing the challenge of limited HR training data, thereby contributing to the advancement of MRI in clinical applications.
△ Less
Submitted 16 February, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
How do the resting EEG preprocessing states affect the outcomes of postprocessing?
Authors:
Shiang Hu,
Jie Ruan,
Juan Hou,
Pedro Antonio Valdes-Sosa,
Zhao Lv
Abstract:
Plenty of artifact removal tools and pipelines have been developed to correct the EEG recordings and discover the values below the waveforms. Without visual inspection from the experts, it is susceptible to derive improper preprocessing states, like the insufficient preprocessed EEG (IPE), and the excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the pos…
▽ More
Plenty of artifact removal tools and pipelines have been developed to correct the EEG recordings and discover the values below the waveforms. Without visual inspection from the experts, it is susceptible to derive improper preprocessing states, like the insufficient preprocessed EEG (IPE), and the excessive preprocessed EEG (EPE). However, little is known about the impacts of IPE or EPE on the postprocessing in the frequency, spatial and temporal domains, particularly as to the spectra and the functional connectivity (FC) analysis. Here, the clean EEG (CE) was synthesized as the ground truth based on the New-York head model and the multivariate autoregressive model. Later, the IPE and the EPE were simulated by injecting the Gaussian noise and losing the brain activities, respectively. Then, the impacts on postprocessing were quantified by the deviation caused by the IPE or EPE from the CE as to the 4 temporal statistics, the multichannel power, the cross spectra, the dispersion of source imaging, and the properties of scalp EEG network. Lastly, the association analysis was performed between the PaLOSi metric and the varying trends of postprocessing with the evolution of preprocessing states. This study shed light on how the postprocessing outcomes are affected by the preprocessing states and PaLOSi may be a potential effective quality metric.
△ Less
Submitted 12 December, 2023; v1 submitted 22 October, 2023;
originally announced October 2023.
-
Spectral homogeneity cross frequencies can be a quality metric for the large-scale resting EEG preprocessing
Authors:
Shiang Hu,
Jie Ruan,
Nicolas Langer,
Jorge Bosch-Bayard,
Zhao Lv,
Dezhong Yao,
Pedro Antonio Valdes-Sosa
Abstract:
The brain projects require the collection of massive electrophysiological data, aiming to the longitudinal, sectional, or populational neuroscience studies. Quality metrics automatically label the data after centralized preprocessing. However, although the waveforms-based metrics are partially useful, they may be unreliable by neglecting the spectral profiles. Here, we detected the phenomenon of p…
▽ More
The brain projects require the collection of massive electrophysiological data, aiming to the longitudinal, sectional, or populational neuroscience studies. Quality metrics automatically label the data after centralized preprocessing. However, although the waveforms-based metrics are partially useful, they may be unreliable by neglecting the spectral profiles. Here, we detected the phenomenon of parallel log spectra (PaLOS) that the scalp EEG power in the log scale were parallel to each other from 10% of 2549 HBN EEG. This phenomenon was reproduced in 8% of 412 PMDT EEG from 4 databases. We designed the PaLOS index (PaLOSi) to indicate this phenomenon by decomposing the cross-spectra at different frequencies into the common principal component spaces. We found that the PaLOS biophysically implied a prominently dominant dipole in the source space which was implausible for the resting EEG. And it may be practically resulted from excessive preprocessing. Compared with the 1966 normative EEG cross-spectra, the HBN and the PMDT EEG with PaLOS presented generally much higher electrode pairwise coherences and higher similarity of coherence-based network patterns, which went against the known frequency dependent characteristic of coherence networks. We suggest the PaLOSi should lay in the range of 0.4-0.7 for large resting EEG quality assurance.
△ Less
Submitted 4 December, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection
Authors:
Cunhang Fan,
Mingming Ding,
Jianhua Tao,
Ruibo Fu,
Jiangyan Yi,
Zhengqi Wen,
Zhao Lv
Abstract:
Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel…
▽ More
Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion module and response-based teacher-student paradigms are proposed to guide the training of noisy data from both the data distribution and decision-making perspectives. In the noisy student branch, speech enhancement is introduced initially for denoising, aiming to reduce the interference of strong noise. The proposed interactive fusion combines denoised features and noisy features to mitigate the impact of speech distortion and ensure consistency with the data distribution of the clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, enabling noisy speech to behave similarly to clean speech. Additionally, a joint training method is employed to optimize both branches for achieving global optimality. Experimental results based on multiple datasets demonstrate that the proposed method performs effectively in noisy environments and maintains its performance in cross-dataset experiments. Source code is available at https://github.com/fchest/DKDSSD.
△ Less
Submitted 16 April, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Design of JiuTian Intelligent Network Simulation Platform
Authors:
Lei Zhao,
Miaomiao Zhang,
Guangyu Li,
Zhuowen Guan,
Sijia Liu,
Zhaobin Xiao,
Yuting Cao,
Zhe Lv,
Yan** Liang
Abstract:
This paper introduced the JiuTian Intelligent Network Simulation Platform, which can provide wireless communication simulation data services for the Open Innovation Platform. The platform contains a series of scalable simulator functionalities, offering open services that enable users to use reinforcement learning algorithms for model training and inference based on simulation environments and dat…
▽ More
This paper introduced the JiuTian Intelligent Network Simulation Platform, which can provide wireless communication simulation data services for the Open Innovation Platform. The platform contains a series of scalable simulator functionalities, offering open services that enable users to use reinforcement learning algorithms for model training and inference based on simulation environments and data. Additionally, it allows users to address optimization tasks in different scenarios by uploading and updating parameter configurations. The platform and its open services were primarily introduced from the perspectives of background, overall architecture, simulator, business scenarios, and future directions.
△ Less
Submitted 28 September, 2023;
originally announced October 2023.
-
1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction
Authors:
Zhishan Zhou,
Zhi Lv,
Shihao Zhou,
Minqiang Zou,
Tong Wu,
Mochen Yu,
Yao Tang,
Jiajun Liang
Abstract:
This report introduce our work on Egocentric 3D Hand Pose Estimation workshop. Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. In the competition, we adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. We noticed that Hand-objects occlusions and self-occlusions lead to perfo…
▽ More
This report introduce our work on Egocentric 3D Hand Pose Estimation workshop. Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. In the competition, we adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. We noticed that Hand-objects occlusions and self-occlusions lead to performance degradation, thus proposed a non-model method to merge multi-view results in the post-process stage. Moreover, We utilized test time augmentation and model ensemble to make further improvement. We also found that public dataset and rational preprocess are beneficial. Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.
△ Less
Submitted 9 October, 2023; v1 submitted 7 October, 2023;
originally announced October 2023.
-
Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators from High-Resolution Orthographic Imagery and Hybrid Learning
Authors:
Ethan Brewer,
Giovani Valdrighi,
Parikshit Solunke,
Joao Rulff,
Yurii Piadyk,
Zhonghui Lv,
Jorge Poco,
Claudio Silva
Abstract:
Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geog…
▽ More
Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.
△ Less
Submitted 18 February, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Authors:
Zhengyang Lv,
Mingyu Yan,
Xin Liu,
Mengyao Dong,
Xiaochun Ye,
Dongrui Fan,
Ninghui Sun
Abstract:
Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hard…
▽ More
Graph-related applications have experienced significant growth in academia and industry, driven by the powerful representation capabilities of graph. However, efficiently executing these applications faces various challenges, such as load imbalance, random memory access, etc. To address these challenges, researchers have proposed various acceleration systems, including software frameworks and hardware accelerators, all of which incorporate graph pre-processing (GPP). GPP serves as a preparatory step before the formal execution of applications, involving techniques such as sampling, reorder, etc. However, GPP execution often remains overlooked, as the primary focus is directed towards enhancing graph applications themselves. This oversight is concerning, especially considering the explosive growth of real-world graph data, where GPP becomes essential and even dominates system running overhead. Furthermore, GPP methods exhibit significant variations across devices and applications due to high customization. Unfortunately, no comprehensive work systematically summarizes GPP. To address this gap and foster a better understanding of GPP, we present a comprehensive survey dedicated to this area. We propose a double-level taxonomy of GPP, considering both algorithmic and hardware perspectives. Through listing relavent works, we illustrate our taxonomy and conduct a thorough analysis and summary of diverse GPP techniques. Lastly, we discuss challenges in GPP and potential future directions.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection
Authors:
Cunhang Fan,
Hongyu Zhang,
Wei Huang,
Jun Xue,
Jianhua Tao,
Jiangyan Yi,
Zhao Lv,
Xiaopei Wu
Abstract:
Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean…
▽ More
Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean characteristics. In order to address this problem, this paper proposes a dynamical graph self-distillation (DGSD) approach for AAD, which does not require speech stimuli as input. Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals. In addition, to further improve AAD detection performance, self-distillation, consisting of feature distillation and hierarchical distillation strategies at each layer, is integrated. These strategies leverage features and classification results from the deepest network layers to guide the learning of shallow layers. Our experiments are conducted on two publicly available datasets, KUL and DTU. Under a 1-second time window, we achieve results of 90.0\% and 79.6\% accuracy on KUL and DTU, respectively. We compare our DGSD method with competitive baselines, and the experimental results indicate that the detection performance of our proposed DGSD method is not only superior to the best reproducible baseline but also significantly reduces the number of trainable parameters by approximately 100 times.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Implicit Neural Representation for MRI Parallel Imaging Reconstruction
Authors:
Hao Li,
Yusheng Zhou,
Jianan Liu,
Xiling Liu,
Tao Huang,
Zhihan Lv,
Weidong Cai
Abstract:
Magnetic resonance imaging (MRI) usually faces lengthy acquisition times, prompting the exploration of strategies such as parallel imaging (PI) to alleviate this problem by periodically skip** specific K-space lines and subsequently reconstructing high-quality images from the undersampled K-space. Implicit neural representation (INR) has recently emerged as a promising deep learning technique, c…
▽ More
Magnetic resonance imaging (MRI) usually faces lengthy acquisition times, prompting the exploration of strategies such as parallel imaging (PI) to alleviate this problem by periodically skip** specific K-space lines and subsequently reconstructing high-quality images from the undersampled K-space. Implicit neural representation (INR) has recently emerged as a promising deep learning technique, characterizing objects as continuous functions of spatial coordinates typically parameterized by a multilayer perceptron (MLP). In this study, we propose a novel MRI PI reconstruction method that uses INR. Our approach represents reconstructed fully-sampled images as functions of voxel coordinates and prior feature vectors from undersampled images, addressing the generalization challenges of INR. Specifically, we introduce a scale-embedded encoder to generate scale-independent, voxel-specific features from MR images across various undersampling scales. These features are then concatenated with coordinate vectors to reconstruct fully-sampled MR images, facilitating multiple-scale reconstructions. To evaluate our method's performance, we conducted experiments using publicly available MRI datasets, comparing it with alternative reconstruction techniques. Our quantitative assessment demonstrates the superiority of our proposed method.
△ Less
Submitted 10 April, 2024; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Probing the Galactic halo with RR Lyrae stars -- IV. On the Oosterhoff dichotomy of RR Lyrae stars
Authors:
Shan Zhang,
Gaochao Liu,
Yang Huang,
Zongfei Lv,
Sarah Ann Bird,
Bingqiu Chen,
Huawei Zhang,
Timothy C. Beers,
Xinyi Li,
Haijun Tian,
Peng Zhang
Abstract:
We use 3653 (2661 RRab, 992 RRc) RR Lyrae stars (RRLs) with 7D (3D position, 3D velocity, and metallicity) information selected from SDSS, LAMOST, and Gaia EDR3, and divide the sample into two Oosterhoff groups (Oo I and Oo II) according to their amplitude-period behaviour in the Bailey Diagram. We present a comparative study of these two groups based on chemistry, kinematics, and dynamics. We fin…
▽ More
We use 3653 (2661 RRab, 992 RRc) RR Lyrae stars (RRLs) with 7D (3D position, 3D velocity, and metallicity) information selected from SDSS, LAMOST, and Gaia EDR3, and divide the sample into two Oosterhoff groups (Oo I and Oo II) according to their amplitude-period behaviour in the Bailey Diagram. We present a comparative study of these two groups based on chemistry, kinematics, and dynamics. We find that Oo I RRLs are relatively more metal rich, with predominately radially dominated orbits and large eccentricities, while Oo II RRLs are relatively more metal poor, and have mildly radially dominated orbits. The Oosterhoff dichotomy of the Milky Way's halo is more apparent for the inner-halo region than for the outer-halo region. Additionally, we also search for this phenomenon in the halos of the two largest satellite galaxies, the Large and Small Magellanic clouds (LMC, SMC), and compare over different bins in metallicity. We find that the Oosterhoff dichotomy is not immutable, and varies based on position in the Galaxy and from galaxy-to-galaxy. We conclude that the Oosterhoff dichotomy is the result of a combination of stellar and galactic evolution, and that it is much more complex than the dichotomy originally identified in Galactic globular clusters.
△ Less
Submitted 12 September, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Hawkeye: Change-targeted Testing for Android Apps based on Deep Reinforcement Learning
Authors:
Chao Peng,
Zhengwei Lv,
Jiarong Fu,
Jiayuan Liang,
Zhao Zhang,
Ajitha Rajan,
** Yang
Abstract:
Android Apps are frequently updated to keep up with changing user, hardware, and business demands. Ensuring the correctness of App updates through extensive testing is crucial to avoid potential bugs reaching the end user. Existing Android testing tools generate GUI events focussing on improving the test coverage of the entire App rather than prioritising updates and its impacted elements. Recent…
▽ More
Android Apps are frequently updated to keep up with changing user, hardware, and business demands. Ensuring the correctness of App updates through extensive testing is crucial to avoid potential bugs reaching the end user. Existing Android testing tools generate GUI events focussing on improving the test coverage of the entire App rather than prioritising updates and its impacted elements. Recent research has proposed change-focused testing but relies on random exploration to exercise the updates and impacted GUI elements that is ineffective and slow for large complex Apps with a huge input exploration space. We propose directed testing of App updates with Hawkeye that is able to prioritise executing GUI actions associated with code changes based on deep reinforcement learning from historical exploration data. Our empirical evaluation compares Hawkeye with state-of-the-art model-based and reinforcement learning-based testing tools FastBot2 and ARES using 10 popular open-source and 1 commercial App. We find that Hawkeye is able to generate GUI event sequences targeting changed functions more reliably than FastBot2 and ARES for the open source Apps and the large commercial App. Hawkeye achieves comparable performance on smaller open source Apps with a more tractable exploration space. The industrial deployment of Hawkeye in the development pipeline also shows that Hawkeye is ideal to perform smoke testing for merge requests of a complicated commercial App.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Project Aria: A New Tool for Egocentric Multi-Modal AI Research
Authors:
Jakob Engel,
Kiran Somasundaram,
Michael Goesele,
Albert Sun,
Alexander Gamino,
Andrew Turner,
Arjang Talattof,
Arnie Yuan,
Bilal Souti,
Brighid Meredith,
Cheng Peng,
Chris Sweeney,
Cole Wilson,
Dan Barnes,
Daniel DeTone,
David Caruso,
Derek Valleroy,
Dinesh Ginjupalli,
Duncan Frost,
Edward Miller,
Elias Mueggler,
Evgeniy Oleinik,
Fan Zhang,
Guruprasad Somasundaram,
Gustavo Solaira
, et al. (49 additional authors not shown)
Abstract:
Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, mul…
▽ More
Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, multi-modal data recording and streaming device with the goal to foster and accelerate research in this area. In this paper, we describe the Aria device hardware including its sensor configuration and the corresponding software tools that enable recording and processing of such data.
△ Less
Submitted 1 October, 2023; v1 submitted 24 August, 2023;
originally announced August 2023.
-
Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection
Authors:
Cunhang Fan,
Jun Xue,
Jianhua Tao,
Jiangyan Yi,
Chenglong Wang,
Chengshi Zheng,
Zhao Lv
Abstract:
The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so a…
▽ More
The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization
Authors:
Zezhong Lv,
Bing Su,
Ji-Rong Wen
Abstract:
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language i…
▽ More
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed from the cross-modality knowledge. In this paper, we discover and mitigate this spurious correlation through a novel proposed counterfactual cross-modality reasoning method. Specifically, we first formulate query reconstruction as an aggregated causal effect of cross-modality and query knowledge. Then by introducing counterfactual cross-modality knowledge into this aggregation, the spurious impact of the unmasked part contributing to the reconstruction is explicitly modeled. Finally, by suppressing the unimodal effect of masked query, we can rectify the reconstructions of video proposals to perform reasonable contrastive learning. Extensive experimental evaluations demonstrate the effectiveness of our proposed method. The code is available at \href{https://github.com/sLdZ0306/CCR}{https://github.com/sLdZ0306/CCR}.
△ Less
Submitted 14 October, 2023; v1 submitted 10 August, 2023;
originally announced August 2023.
-
Hilton-Milner theorem for bounded multisets
Authors:
Jiaqi Liao,
Zequn Lv,
Mengyu Cao,
Mei Lu
Abstract:
Let $ k, n \in \mathbb{N}^+ $ and $ m \in \mathbb{N}^+ \cup \{\infty \} $. A $ k $-multiset in $ [n]_m $ is a $ k $-set whose elements are integers from $ \{1, 2, \ldots, n\} $, and each element is allowed to have at most $ m $ repetitions. A family of $ k $-multisets in $ [n]_m $ is said to be intersecting if every pair of $ k $-multisets from the family have non-empty intersection. In this paper…
▽ More
Let $ k, n \in \mathbb{N}^+ $ and $ m \in \mathbb{N}^+ \cup \{\infty \} $. A $ k $-multiset in $ [n]_m $ is a $ k $-set whose elements are integers from $ \{1, 2, \ldots, n\} $, and each element is allowed to have at most $ m $ repetitions. A family of $ k $-multisets in $ [n]_m $ is said to be intersecting if every pair of $ k $-multisets from the family have non-empty intersection. In this paper, we give the size and structure of the largest non-trivial intersecting family of $ k $-multisets in $ [n]_m $.
△ Less
Submitted 22 May, 2024; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Graph Embedding Dynamic Feature-based Supervised Contrastive Learning of Transient Stability for Changing Power Grid Topologies
Authors:
Zijian Lv,
Xin Chen,
Zijian Feng
Abstract:
Accurate online transient stability prediction is critical for ensuring power system stability when facing disturbances. While traditional transient stablity analysis replies on the time domain simulations can not be quickly adapted to the power grid toplogy change. In order to vectorize high-dimensional power grid topological structure information into low-dimensional node-based graph embedding s…
▽ More
Accurate online transient stability prediction is critical for ensuring power system stability when facing disturbances. While traditional transient stablity analysis replies on the time domain simulations can not be quickly adapted to the power grid toplogy change. In order to vectorize high-dimensional power grid topological structure information into low-dimensional node-based graph embedding streaming data, graph embedding dynamic feature (GEDF) has been proposed. The transient stability GEDF-based supervised contrastive learning (GEDF-SCL) model uses supervised contrastive learning to predict transient stability with GEDFs, considering power grid topology information. To evaluate the performance of the proposed GEDF-SCL model, power grids of varying topologies were generated based on the IEEE 39-bus system model. Transient operational data was obtained by simulating N-1 and N-$\bm{m}$-1 contingencies on these generated power system topologies. Test result demonstrated that the GEDF-SCL model can achieve high accuracy in transient stability prediction and adapt well to changing power grid topologies.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection
Authors:
Shunbo Dong,
Jun Xue,
Cunhang Fan,
Kang Zhu,
Yujie Chen,
Zhao Lv
Abstract:
In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the mode…
▽ More
In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the model's ability to locate discriminative information. Specmix cuts and pastes the frequency dimension information of the spectrogram in the same batch of samples without introducing other data, which helps the model to locate the really useful information. At the same time, we randomly select samples for augmentation to reduce the impact of data augmentation directly changing all the data. Once the purpose of hel** the model to locate information is achieved, it is also important to reduce unnecessary information. The role of MPIF-Res2Net is to reduce redundant interference information. Deceptive information from a single perspective is always similar, so the model learning this similar information will produce redundant spoofing clues and interfere with truly discriminative information. The proposed MPIF-Res2Net fuses information from different perspectives, making the information learned by the model more diverse, thereby reducing the redundancy caused by similar information and avoiding interference with the learning of discriminative information. The results on the ASVspoof 2021 LA dataset demonstrate the effectiveness of our proposed method, achieving EER and min-tDCF of 3.29% and 0.2557, respectively.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.