Search | arXiv e-print repository

Strategies to Improve Real-World Applicability of Laparoscopic Anatomy Segmentation Models

Authors: Fiona R. Kolbinger, Jiangpeng He, **ge Ma, Fengqing Zhu

Abstract: Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computer vision techniques for surgical decision support. Segmentation performance of such models is traditionally reported using metrics of overlap such as IoU. However, imbalanced and unrealistic representation of classes in the train… ▽ More Accurate identification and localization of anatomical structures of varying size and appearance in laparoscopic imaging are necessary to leverage the potential of computer vision techniques for surgical decision support. Segmentation performance of such models is traditionally reported using metrics of overlap such as IoU. However, imbalanced and unrealistic representation of classes in the training data and suboptimal selection of reported metrics have the potential to skew nominal segmentation performance and thereby ultimately limit clinical translation. In this work, we systematically analyze the impact of class characteristics (i.e., organ size differences), training and test data composition (i.e., representation of positive and negative examples), and modeling parameters (i.e., foreground-to-background class weight) on eight segmentation metrics: accuracy, precision, recall, IoU, F1 score (Dice Similarity Coefficient), specificity, Hausdorff Distance, and Average Symmetric Surface Distance. Our findings support two adjustments to account for data biases in surgical data science: First, training on datasets that are similar to the clinical real-world scenarios in terms of class distribution, and second, class weight adjustments to optimize segmentation model performance with regard to metrics of particular relevance in the respective clinical setting. △ Less

Submitted 15 April, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: 14 pages, 5 figures, 4 tables; accepted for the workshop "Data Curation and Augmentation in Medical Imaging" at CVPR 2024 (archival track)

arXiv:2403.12171 [pdf, other]

EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models

Authors: Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, **g Shao, Tao Gui, Qi Zhang, Xuan**g Huang

Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper… ▽ More Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of Large Language Models (LLMs). They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11530 [pdf, other]

Continual Forgetting for Pre-trained Vision Models

Authors: Hongbo Zhao, Bolin Ni, Haochen Wang, Junsong Fan, Fei Zhu, Yuxi Wang, Yuntao Chen, Gaofeng Meng, Zhaoxiang Zhang

Abstract: For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while ma… ▽ More For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. To address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective, parameter-efficient, data-efficient, and easy to implement. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on \url{https://github.com/bjzhb666/GS-LoRA}. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2403.11518 [pdf, other]

doi 10.1088/1674-1056/ad0d9d

Optical manipulation of the topological phase in ZrTe5 revealed by time- and angle-resolved photoemission

Authors: Chaozhi Huang, Chengyang Xu, Fengfeng Zhu, Shaofeng Duan, Jianzhe Liu, Lingxiao Gu, Shichong Wang, Haoran Liu, Dong Qian, Weidong Luo, Wentao Zhang

Abstract: High-resolution time- and angle-resolved photoemission measurements were conducted on the topological insulator ZrTe5. With strong femtosecond photoexcitation, a possible ultrafast phase transition from a weak to a strong topological insulating phase was experimentally realized by recovering the energy gap inversion in a time scale that was shorter than 0.15 ps. This photoinduced transient strong… ▽ More High-resolution time- and angle-resolved photoemission measurements were conducted on the topological insulator ZrTe5. With strong femtosecond photoexcitation, a possible ultrafast phase transition from a weak to a strong topological insulating phase was experimentally realized by recovering the energy gap inversion in a time scale that was shorter than 0.15 ps. This photoinduced transient strong topological phase can last longer than 2 ps at the highest excitation fluence studied, and it cannot be attributed to the photoinduced heating of electrons or modification of the conduction band filling. Additionally, the measured unoccupied electronic states are consistent with the first-principles calculation based on experimental crystal lattice constants, which favor a strong topological insulating phase. These findings provide new insights into the longstanding controversy about the strong and weak topological properties in ZrTe5, and they suggest that many-body effects including electron-electron interactions must be taken into account to understand the equilibrium weak topological insulating phase in ZrTe5. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Journal ref: Chinese Physics B 33, 017901 (2024)

arXiv:2403.10010 [pdf, other]

doi 10.1103/PhysRevLett.132.131002

Measurements of All-Particle Energy Spectrum and Mean Logarithmic Mass of Cosmic Rays from 0.3 to 30 PeV with LHAASO-KM2A

Authors: The LHAASO Collaboration, Zhen Cao, F. Aharonian, Q. An, A. Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen , et al. (256 additional authors not shown)

Abstract: We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at… ▽ More We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at $3.67 \pm 0.05 \pm 0.15$ PeV. Below the knee, the spectral index is found to be -$2.7413 \pm 0.0004 \pm 0.0050$, while above the knee, it is -$3.128 \pm 0.005 \pm 0.027$, with the sharpness of the transition measured with a statistical error of 2%. The mean logarithmic mass of cosmic rays is almost heavier than helium in the whole measured energy range. It decreases from 1.7 at 0.3 PeV to 1.3 at 3 PeV, representing a 24% decline following a power law with an index of -$0.1200 \pm 0.0003 \pm 0.0341$. This is equivalent to an increase in abundance of light components. Above the knee, the mean logarithmic mass exhibits a power law trend towards heavier components, which is reversal to the behavior observed in the all-particle energy spectrum. Additionally, the knee position and the change in power-law index are approximately the same. These findings suggest that the knee observed in the all-particle spectrum corresponds to the knee of the light component, rather than the medium-heavy components. △ Less

Submitted 26 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: 8 pages, 3 figures

Journal ref: Physical Review Letters 132, 131002 (2024)

arXiv:2403.09972 [pdf, other]

Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection

Authors: Moxin Li, Wenjie Wang, Fuli Feng, Fengbin Zhu, Qifan Wang, Tat-Seng Chua

Abstract: Self-detection for Large Language Model (LLM) seeks to evaluate the LLM output trustability by leveraging LLM's own capabilities, alleviating the output hallucination issue. However, existing self-detection approaches only retrospectively evaluate answers generated by LLM, typically leading to the over-trust in incorrectly generated answers. To tackle this limitation, we propose a novel self-detec… ▽ More Self-detection for Large Language Model (LLM) seeks to evaluate the LLM output trustability by leveraging LLM's own capabilities, alleviating the output hallucination issue. However, existing self-detection approaches only retrospectively evaluate answers generated by LLM, typically leading to the over-trust in incorrectly generated answers. To tackle this limitation, we propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. It thoroughly compares the trustability of multiple candidate answers to mitigate the over-trust in LLM-generated incorrect answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer, and then aggregates the justifications for comprehensive target answer evaluation. This framework can be seamlessly integrated with existing approaches for superior self-detection. Extensive experiments on six datasets spanning three tasks demonstrate the effectiveness of the proposed framework. △ Less

Submitted 4 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: Under review

arXiv:2403.09665 [pdf, ps, other]

Characterizations of quasi-homogeneous aggregation functions

Authors: Feng-qing Zhu, Xue-** Wang

Abstract: In this article, we first give the characterizations of quasi-homogeneous aggregation functions, which show us that quasi-homogeneous aggregation functions are classified into three classes. We then introduce the concept of triple generator of quasi-homogeneous aggregation function, which is applied to construct a quasi-homogeneous aggregation function. In this article, we first give the characterizations of quasi-homogeneous aggregation functions, which show us that quasi-homogeneous aggregation functions are classified into three classes. We then introduce the concept of triple generator of quasi-homogeneous aggregation function, which is applied to construct a quasi-homogeneous aggregation function. △ Less

Submitted 12 May, 2024; v1 submitted 11 January, 2024; originally announced March 2024.

Comments: 15

arXiv:2403.09283 [pdf]

Observation of quantum oscillations near the Mott-Ioffe-Regel limit in CaAs3

Authors: Yuxiang Wang, Minhao Zhao, **glei Zhang, Wenbin Wu, Shichao Li, Yong Zhang, Wenxiang Jiang, Nesta Benno Joseph, Liangcai Xu, Yicheng Mou, Yunkun Yang, Pengliang Leng, Yong Zhang, Li Pi, Alexey Suslov, Mykhaylo Ozerov, Jan Wyzula, Milan Orlita, Fengfeng Zhu, Yi Zhang, Xufeng Kou, Zengwei Zhu, Awadhesh Narayan, Dong Qian, **sheng Wen , et al. (3 additional authors not shown)

Abstract: The Mott-Ioffe-Regel limit sets the lower bound of carrier mean free path for coherent quasiparticle transport. Metallicity beyond this limit is of great interest because it is often closely related to quantum criticality and unconventional superconductivity. Progress along this direction mainly focuses on the strange-metal behaviors originating from the evolution of quasiparticle scattering rate… ▽ More The Mott-Ioffe-Regel limit sets the lower bound of carrier mean free path for coherent quasiparticle transport. Metallicity beyond this limit is of great interest because it is often closely related to quantum criticality and unconventional superconductivity. Progress along this direction mainly focuses on the strange-metal behaviors originating from the evolution of quasiparticle scattering rate such as linear-in-temperature resistivity, while the quasiparticle coherence phenomena in this regime are much less explored due to the short mean free path at the diffusive bound. Here we report the observation of quantum oscillations from Landau quantization near the Mott-Ioffe-Regel limit in CaAs3. Despite the insulator-like temperature dependence of resistivity, CaAs3 presents giant magnetoresistance and prominent Shubnikov-de Haas oscillations from Fermi surfaces, indicating highly coherent band transport. In contrast, the quantum oscillation is absent in the magnetic torque. The quasiparticle effective mass increases systematically with magnetic fields, manifesting a much larger value than the expectation given by magneto-infrared spectroscopy. It suggests a strong many-body renormalization effect near Fermi surface. We find that these unconventional behaviors may be explained by the interplay between the mobility edge and the van Hove singularity, which results in the formation of coherent cyclotron orbits emerging at the diffusive bound. Our results call for further study on the electron correlation effect of the van Hove singularity. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 18 pages, 5 figures

arXiv:2403.08377 [pdf, other]

doi 10.18653/v1/2023.emnlp-main.918

Learning to Describe for Predicting Zero-shot Drug-Drug Interactions

Authors: Fangqi Zhu, Yongqi Zhang, Lei Chen, Bing Qin, Ruifeng Xu

Abstract: Adverse drug-drug interactions~(DDIs) can compromise the effectiveness of concurrent drug administration, posing a significant challenge in healthcare. As the development of new drugs continues, the potential for unknown adverse effects resulting from DDIs becomes a growing concern. Traditional computational methods for DDI prediction may fail to capture interactions for new drugs due to the lack… ▽ More Adverse drug-drug interactions~(DDIs) can compromise the effectiveness of concurrent drug administration, posing a significant challenge in healthcare. As the development of new drugs continues, the potential for unknown adverse effects resulting from DDIs becomes a growing concern. Traditional computational methods for DDI prediction may fail to capture interactions for new drugs due to the lack of knowledge. In this paper, we introduce a new problem setup as zero-shot DDI prediction that deals with the case of new drugs. Leveraging textual information from online databases like DrugBank and PubChem, we propose an innovative approach TextDDI with a language model-based DDI predictor and a reinforcement learning~(RL)-based information selector, enabling the selection of concise and pertinent text for accurate DDI prediction on new drugs. Empirical results show the benefits of the proposed approach on several settings including zero-shot and few-shot DDI prediction, and the selected texts are semantically relevant. Our code and data are available at \url{https://github.com/zhufq00/DDIs-Prediction}. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.07167 [pdf, other]

Stationary phase analysis of ambient noise cross-correlations: Focusing on non-ballistic arrivals

Authors: Yunyue Elita Li, Feng Zhu, Jizhong Yang

Abstract: Stacked cross-correlation functions have become ubiquitous in the ambient seismic imaging and monitoring community as approximations to the Green's function between two receivers. While theoretical understanding of this approximation to the ballistic arrivals is well established, the equivalent analysis for the non-ballistic arrivals is alarmingly inadequate compared to the exponential growth of i… ▽ More Stacked cross-correlation functions have become ubiquitous in the ambient seismic imaging and monitoring community as approximations to the Green's function between two receivers. While theoretical understanding of this approximation to the ballistic arrivals is well established, the equivalent analysis for the non-ballistic arrivals is alarmingly inadequate compared to the exponential growth of its applications. To provide a fundamental understanding of the cross-correlation functions beyond the ballistic arrivals, we derive analytical stationary phase solutions for ambient noise cross-correlations with a focus on non-ballistic arrivals. We establish the mathematical and corresponding physical conditions that drastically differentiate the non-ballistic arrivals in the stacked cross-correlation and the actual Green's functions. In ambient noise environments, the coda waves due to random medium scatterings of an impulsive source cannot be distinguished from the cross-talk artifacts due to overlap** random noise sources. Therefore, changes in the non-ballistic arrivals cannot be uniquely attributed to changes in the medium or changes in the noise source environment without additional constraints. The theoretical results demand that interpreting large-elapse-time arrivals in the stacked cross-correlation functions as coda waves for deterministic information about the propagation medium should be conducted only after the source influence is sufficiently ruled out. Once the source influence is eliminated, the stationary phase solutions for scattering waves provide a solid basis for extracting reliable scattering information from the noise correlation functions for higher-resolution imaging and monitoring. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 22 pages, 11 figures, 1 table

arXiv:2403.06288 [pdf, other]

Probing Image Compression For Class-Incremental Learning

Authors: Justin Yang, Zhihao Duan, Andrew Peng, Yuning Huang, Jiangpeng He, Fengqing Zhu

Abstract: Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely… ▽ More Image compression emerges as a pivotal tool in the efficient handling and transmission of digital images. Its ability to substantially reduce file size not only facilitates enhanced data storage capacity but also potentially brings advantages to the development of continual machine learning (ML) systems, which learn new knowledge incrementally from sequential data. Continual ML systems often rely on storing representative samples, also known as exemplars, within a limited memory constraint to maintain the performance on previously learned data. These methods are known as memory replay-based algorithms and have proven effective at mitigating the detrimental effects of catastrophic forgetting. Nonetheless, the limited memory buffer size often falls short of adequately representing the entire data distribution. In this paper, we explore the use of image compression as a strategy to enhance the buffer's capacity, thereby increasing exemplar diversity. However, directly using compressed exemplars introduces domain shift during continual ML, marked by a discrepancy between compressed training data and uncompressed testing data. Additionally, it is essential to determine the appropriate compression algorithm and select the most effective rate for continual ML systems to balance the trade-off between exemplar quality and quantity. To this end, we introduce a new framework to incorporate image compression for continual ML including a pre-processing data compression step and an efficient compression rate/algorithm selection method. We conduct extensive experiments on CIFAR-100 and ImageNet datasets and show that our method significantly improves image classification accuracy in continual ML settings. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: Picture Coding Symposium (PCS) 2024

arXiv:2403.05770 [pdf, other]

doi 10.1109/TPAMI.2023.3273594

Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning

Authors: Bingqian Lin, Yanxin Long, Yi Zhu, Fengda Zhu, Xiaodan Liang, Qixiang Ye, Liang Lin

Abstract: Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human in… ▽ More Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on R2R show that PROPER can benefit multiple VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted by TPAMI 2023

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI,2023)

arXiv:2403.04272 [pdf, other]

Active Generalized Category Discovery

Authors: Shijie Ma, Fei Zhu, Zhun Zhong, Xu-Yao Zhang, Cheng-Lin Liu

Abstract: Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task, which endeavors to cluster unlabeled samples from both novel and old classes, leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes, and that novel categories are fully unlabeled, GCD inherently faces intractable problems, including imba… ▽ More Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task, which endeavors to cluster unlabeled samples from both novel and old classes, leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes, and that novel categories are fully unlabeled, GCD inherently faces intractable problems, including imbalanced classification performance and inconsistent confidence between old and new classes, especially in the low-labeling regime. Hence, some annotations of new classes are deemed necessary. However, labeling new classes is extremely costly. To address this issue, we take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD). The goal is to improve the performance of GCD by actively selecting a limited amount of valuable samples for labeling from the oracle. To solve this problem, we devise an adaptive sampling strategy, which jointly considers novelty, informativeness and diversity to adaptively select novel samples with proper uncertainty. However, owing to the varied orderings of label indices caused by the clustering of novel classes, the queried labels are not directly applicable to subsequent training. To overcome this issue, we further propose a stable label map** algorithm that transforms ground truth labels to the label space of the classifier, thereby ensuring consistent training across different active selection stages. Our method achieves state-of-the-art performance on both generic and fine-grained datasets. Our code is available at https://github.com/mashijie1028/ActiveGCD △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.03822 [pdf, other]

HoLens: A Visual Analytics Design for Higher-order Movement Modeling and Visualization

Authors: Zezheng Feng, Fang Zhu, Hongjun Wang, Jianing Hao, ShuangHua Yang, Wei Zeng, Huamin Qu

Abstract: Higher-order patterns reveal sequential multistep state transitions, which are usually superior to origin-destination analysis, which depicts only first-order geospatial movement patterns. Conventional methods for higher-order movement modeling first construct a directed acyclic graph (DAG) of movements, then extract higher-order patterns from the DAG. However, DAG-based methods heavily rely on th… ▽ More Higher-order patterns reveal sequential multistep state transitions, which are usually superior to origin-destination analysis, which depicts only first-order geospatial movement patterns. Conventional methods for higher-order movement modeling first construct a directed acyclic graph (DAG) of movements, then extract higher-order patterns from the DAG. However, DAG-based methods heavily rely on the identification of movement keypoints that are challenging for sparse movements and fail to consider the temporal variants that are critical for movements in urban environments. To overcome the limitations, we propose HoLens, a novel approach for modeling and visualizing higher-order movement patterns in the context of an urban environment. HoLens mainly makes twofold contributions: first, we design an auto-adaptive movement aggregation algorithm that self-organizes movements hierarchically by considering spatial proximity, contextual information, and temporal variability; second, we develop an interactive visual analytics interface consisting of well-established visualization techniques, including the H-Flow for visualizing the higher-order patterns on the map and the higher-order state sequence chart for representing the higher-order state transitions. Two real-world case studies manifest that the method can adaptively aggregate the data and exhibit the process of how to explore the higher-order patterns by HoLens. We also demonstrate our approach's feasibility, usability, and effectiveness through an expert interview with three domain experts. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: 20 pages, 18 figures, is accepted by computational visual media journal

arXiv:2403.03172 [pdf, other]

Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal Imagination

Authors: Liangzhou Wang, Kaiwen Zhu, Fengming Zhu, Xinghu Yao, Shujie Zhang, Deheng Ye, Haobo Fu, Qiang Fu, Wei Yang

Abstract: Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consen… ▽ More Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consensus mechanism to explicitly coordinate multiple agents. The proposed Multi-agent Goal Imagination (MAGI) framework guides agents to reach consensus with an Imagined common goal. The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states. We directly model this distribution with a self-supervised generative model, thus alleviating the "curse of dimensinality" problem induced by multi-agent multi-step policy rollout commonly used in model-based methods. We show that such efficient consensus mechanism can guide all agents cooperatively reaching valuable future states. Results on Multi-agent Particle-Environments and Google Research Football environment demonstrate the superiority of MAGI in both sample efficiency and performance. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.02886 [pdf, other]

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Authors: Fei Zhu, Xu-Yao Zhang, Zhen Cheng, Cheng-Lin Liu

Abstract: Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been deve… ▽ More Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE TPAMI. arXiv admin note: text overlap with arXiv:2303.02970; text overlap with arXiv:2007.01458 by other authors

arXiv:2403.01759 [pdf, other]

Open-world Machine Learning: A Review and New Outlooks

Authors: Fei Zhu, Shijie Ma, Zhen Cheng, Xu-Yao Zhang, Zhaoxiang Zhang, Cheng-Lin Liu

Abstract: Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In… ▽ More Machine learning has achieved remarkable success in many applications. However, existing studies are largely based on the closed-world assumption, which assumes that the environment is stationary, and the model is fixed once deployed. In many real-world applications, this fundamental and rather naive assumption may not hold because an open environment is complex, dynamic, and full of unknowns. In such cases, rejecting unknowns, discovering novelties, and then incrementally learning them, could enable models to be safe and evolve continually as biological systems do. This paper provides a holistic view of open-world machine learning by investigating unknown rejection, novel class discovery, and class-incremental learning in a unified paradigm. The challenges, principles, and limitations of current methodologies are discussed in detail. Finally, we discuss several potential directions for future research. This paper aims to provide a comprehensive introduction to the emerging open-world machine learning paradigm, to help researchers build more powerful AI systems in their respective fields, and to promote the development of artificial general intelligence. △ Less

Submitted 14 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2403.00810 [pdf, other]

Bootstrap** Cognitive Agents with a Large Language Model

Authors: Feiyu Zhu, Reid Simmons

Abstract: Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. On the other hand cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrap** a cognitive-based model with the noisy knowledge encoded in large language models.… ▽ More Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. On the other hand cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrap** a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent based entirely on large language models. Our experiments indicate that large language models are a good source of information for cognitive architectures, and the cognitive architecture in turn can verify and update the knowledge of large language models to a specific domain. △ Less

Submitted 24 February, 2024; originally announced March 2024.

arXiv:2403.00224 [pdf, other]

Tobit models for count time series

Authors: Christian H. Weiß, Fukang Zhu

Abstract: Several models for count time series have been developed during the last decades, often inspired by traditional autoregressive moving average (ARMA) models for real-valued time series, including integer-valued ARMA (INARMA) and integer-valued generalized autoregressive conditional heteroscedasticity (INGARCH) models. Both INARMA and INGARCH models exhibit an ARMA-like autocorrelation function (ACF… ▽ More Several models for count time series have been developed during the last decades, often inspired by traditional autoregressive moving average (ARMA) models for real-valued time series, including integer-valued ARMA (INARMA) and integer-valued generalized autoregressive conditional heteroscedasticity (INGARCH) models. Both INARMA and INGARCH models exhibit an ARMA-like autocorrelation function (ACF). To achieve negative ACF values within the class of INGARCH models, log and softplus link functions are suggested in the literature, where the softplus approach leads to conditional linearity in good approximation. However, the softplus approach is limited to the INGARCH family for unbounded counts, i.e. it can neither be used for bounded counts, nor for count processes from the INARMA family. In this paper, we present an alternative solution, named the Tobit approach, for achieving approximate linearity together with negative ACF values, which is more generally applicable than the softplus approach. A Skellam--Tobit INGARCH model for unbounded counts is studied in detail, including stationarity, approximate computation of moments, maximum likelihood and censored least absolute deviations estimation for unknown parameters and corresponding simulations. Extensions of the Tobit approach to other situations are also discussed, including underlying discrete distributions, INAR models, and bounded counts. Three real-data examples are considered to illustrate the usefulness of the new approach. △ Less

Submitted 29 February, 2024; originally announced March 2024.

arXiv:2402.18873 [pdf, other]

Reducing Hallucinations in Entity Abstract Summarization with Facts-Template Decomposition

Authors: Fangwei Zhu, Peiyi Wang, Zhifang Sui

Abstract: Entity abstract summarization aims to generate a coherent description of a given entity based on a set of relevant Internet documents. Pretrained language models (PLMs) have achieved significant success in this task, but they may suffer from hallucinations, i.e. generating non-factual information about the entity. To address this issue, we decompose the summary into two components: Facts that repr… ▽ More Entity abstract summarization aims to generate a coherent description of a given entity based on a set of relevant Internet documents. Pretrained language models (PLMs) have achieved significant success in this task, but they may suffer from hallucinations, i.e. generating non-factual information about the entity. To address this issue, we decompose the summary into two components: Facts that represent the factual information about the given entity, which PLMs are prone to fabricate; and Template that comprises generic content with designated slots for facts, which PLMs can generate competently. Based on the facts-template decomposition, we propose SlotSum, an explainable framework for entity abstract summarization. SlotSum first creates the template and then predicts the fact for each template slot based on the input documents. Benefiting from our facts-template decomposition, SlotSum can easily locate errors and further rectify hallucinated predictions with external knowledge. We construct a new dataset WikiFactSum to evaluate the performance of SlotSum. Experimental results demonstrate that SlotSum could generate summaries that are significantly more factual with credible external knowledge. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.18862 [pdf, other]

Towards Backward-Compatible Continual Learning of Image Compression

Authors: Zhihao Duan, Ming Lu, Justin Yang, Jiangpeng He, Zhan Ma, Fengqing Zhu

Abstract: This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine… ▽ More This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.18528 [pdf, other]

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

Authors: Jiangpeng He, Fengqing Zhu

Abstract: Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data while retaining learned knowledge. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution, which introduces a dual imbalance problem involving (i) disparities between stored exemplars of old tasks and new class data (inter-phase imbalance… ▽ More Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data while retaining learned knowledge. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution, which introduces a dual imbalance problem involving (i) disparities between stored exemplars of old tasks and new class data (inter-phase imbalance), and (ii) severe class imbalances within each individual task (intra-phase imbalance). We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers, thus inducing over/under-fitting and catastrophic forgetting in CIL. Our method addresses it by reweighting the gradients towards balanced optimization and unbiased classifier learning. Additionally, we observe imbalanced forgetting where paradoxically the instance-rich classes suffer higher performance degradation during CIL due to a larger amount of training data becoming unavailable in subsequent learning phases. To tackle this, we further introduce a distribution-aware knowledge distillation loss to mitigate forgetting by aligning output logits proportionally with the distribution of lost training data. We validate our method on CIFAR-100, ImageNetSubset, and Food101 across various evaluation protocols and demonstrate consistent improvements compared to existing works, showing great potential to apply CIL in real-world scenarios with enhanced robustness and effectiveness. △ Less

Submitted 29 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.15772 [pdf, other]

Mean-preserving rounding integer-valued ARMA models

Authors: Christian H. Weiß, Fukang Zhu

Abstract: In the past four decades, research on count time series has made significant progress, but research on $\mathbb{Z}$-valued time series is relatively rare. Existing $\mathbb{Z}$-valued models are mainly of autoregressive structure, where the use of the rounding operator is very natural. Because of the discontinuity of the rounding operator, the formulation of the corresponding model identifiability… ▽ More In the past four decades, research on count time series has made significant progress, but research on $\mathbb{Z}$-valued time series is relatively rare. Existing $\mathbb{Z}$-valued models are mainly of autoregressive structure, where the use of the rounding operator is very natural. Because of the discontinuity of the rounding operator, the formulation of the corresponding model identifiability conditions and the computation of parameter estimators need special attention. It is also difficult to derive closed-form formulae for crucial stochastic properties. We rediscover a stochastic rounding operator, referred to as mean-preserving rounding, which overcomes the above drawbacks. Then, a novel class of $\mathbb{Z}$-valued ARMA models based on the new operator is proposed, and the existence of stationary solutions of the models is established. Stochastic properties including closed-form formulae for (conditional) moments, autocorrelation function, and conditional distributions are obtained. The advantages of our novel model class compared to existing ones are demonstrated. In particular, our model construction avoids identifiability issues such that maximum likelihood estimation is possible. A simulation study is provided, and the appealing performance of the new models is shown by several real-world data sets. △ Less

Submitted 24 February, 2024; originally announced February 2024.

arXiv:2402.11425 [pdf, other]

Online Local False Discovery Rate Control: A Resource Allocation Approach

Authors: Ruicheng Ao, Hongyu Chen, David Simchi-Levi, Feng Zhu

Abstract: We consider the problem of sequentially conducting multiple experiments where each experiment corresponds to a hypothesis testing task. At each time point, the experimenter must make an irrevocable decision of whether to reject the null hypothesis (or equivalently claim a discovery) before the next experimental result arrives. The goal is to maximize the number of discoveries while maintaining a l… ▽ More We consider the problem of sequentially conducting multiple experiments where each experiment corresponds to a hypothesis testing task. At each time point, the experimenter must make an irrevocable decision of whether to reject the null hypothesis (or equivalently claim a discovery) before the next experimental result arrives. The goal is to maximize the number of discoveries while maintaining a low error rate at all time points measured by local False Discovery Rate (FDR). We formulate the problem as an online knapsack problem with exogenous random budget replenishment. We start with general arrival distributions and show that a simple policy achieves a $O(\sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $Ω(\sqrt{T})$ or even a $Ω(T)$ regret. With the observation that canonical policies tend to be too optimistic and over claim discoveries, we propose a novel policy that incorporates budget safety buffers. It turns out that a little more safety can greatly enhance efficiency -- small additional logarithmic buffers suffice to reduce the regret from $Ω(\sqrt{T})$ or even $Ω(T)$ to $O(\ln^2 T)$. From a practical perspective, we extend the policy to the scenario with continuous arrival distributions as well as time-dependent information structures. We conduct both synthetic experiments and empirical applications on a time series data from New York City taxi passengers to validate the performance of our proposed policies. Our results emphasize how effective policies should be designed in online resource allocation problems with exogenous budget replenishment. △ Less

Submitted 1 April, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

arXiv:2402.10626 [pdf, other]

Robust Beamforming for RIS-aided Communications: Gradient-based Manifold Meta Learning

Authors: Fenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed… ▽ More Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed to the highly non-convex optimization space of variables at both the BS and the RIS, and the diversity of communication environments. Generally, traditional optimization methods for this problem suffer from the high complexity, while existing deep learning based methods are lack of robustness in various scenarios. To address these issues, we introduce a gradient-based manifold meta learning method (GMML), which works without pre-training and has strong robustness for RIS-aided communications. Specifically, the proposed method fuses meta learning and manifold learning to improve the overall spectral efficiency, and reduce the overhead of the high-dimensional signal process. Unlike traditional deep learning based methods which directly take channel state information as input, GMML feeds the gradients of the precoding matrix and phase shifting matrix into neural networks. Coherently, we design a differential regulator to constrain the phase shifting matrix of the RIS. Numerical results show that the proposed GMML can improve the spectral efficiency by up to 7.31\%, and speed up the convergence by 23 times faster compared to traditional approaches. Moreover, they also demonstrate remarkable robustness and adaptability in dynamic settings. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: journal

arXiv:2402.06292 [pdf]

Towards full control of molecular exciton energy transfer via FRET in DNA origami assemblies

Authors: Aleksandra K. Adamczyk, Teun A. P. M. Huijben, Karol Kolataj, Fangjia Zhu, Rodolphe Marie, Fernando D. Stefani, Guillermo P. Acuna

Abstract: Controlling the flow of excitons between organic molecules holds immense promise for various applications, including energy conversion, spectroscopy, photocatalysis, sensing, and microscopy. DNA nanotechnology has shown promise in achieving this control by using synthetic DNA as a platform for positioning and, very recently, for also orienting organic dyes. In this study, the orientation of doubly… ▽ More Controlling the flow of excitons between organic molecules holds immense promise for various applications, including energy conversion, spectroscopy, photocatalysis, sensing, and microscopy. DNA nanotechnology has shown promise in achieving this control by using synthetic DNA as a platform for positioning and, very recently, for also orienting organic dyes. In this study, the orientation of doubly-linked dyes in DNA origami structures was manipulated to control energy transfer. By controlling independently the orientation of single donor and acceptor molecules, the average energy transfer efficiency was doubled. This work demonstrates the potential of DNA nanotechnology for precise control of the excitonic energy transfer with implications for artificial light-harvesting antennas. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: 19 pages, 4 figures

arXiv:2402.03628 [pdf, other]

Professional Agents -- Evolving Large Language Models into Autonomous Experts with Human-Level Competencies

Authors: Zhixuan Chu, Yan Wang, Feng Zhu, Lu Yu, Longfei Li, **jie Gu

Abstract: The advent of large language models (LLMs) such as ChatGPT, PaLM, and GPT-4 has catalyzed remarkable advances in natural language processing, demonstrating human-like language fluency and reasoning capacities. This position paper introduces the concept of Professional Agents (PAgents), an application framework harnessing LLM capabilities to create autonomous agents with controllable, specialized,… ▽ More The advent of large language models (LLMs) such as ChatGPT, PaLM, and GPT-4 has catalyzed remarkable advances in natural language processing, demonstrating human-like language fluency and reasoning capacities. This position paper introduces the concept of Professional Agents (PAgents), an application framework harnessing LLM capabilities to create autonomous agents with controllable, specialized, interactive, and professional-level competencies. We posit that PAgents can reshape professional services through continuously developed expertise. Our proposed PAgents framework entails a tri-layered architecture for genesis, evolution, and synergy: a base tool layer, a middle agent layer, and a top synergy layer. This paper aims to spur discourse on promising real-world applications of LLMs. We argue the increasing sophistication and integration of PAgents could lead to AI systems exhibiting professional mastery over complex domains, serving critical needs, and potentially achieving artificial general intelligence. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: 14 pages, 1 figure

arXiv:2402.02349 [pdf]

Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou

Abstract: Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho… ▽ More Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 14 pages, 6 figures; reference added

arXiv:2401.13223 [pdf, other]

TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data

Authors: Fengbin Zhu, Ziyang Liu, Fuli Feng, Chao Wang, Moxin Li, Tat-Seng Chua

Abstract: In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstr… ▽ More In this work, we address question answering (QA) over a hybrid of tabular and textual data that are very common content on the Web (e.g. SEC filings), where discrete reasoning capabilities are often required. Recently, large language models (LLMs) like GPT-4 have demonstrated strong multi-step reasoning capabilities. We then consider harnessing the amazing power of LLMs to solve our task. We abstract a Step-wise Pipeline for tabular and textual QA, which consists of three key steps, including Extractor, Reasoner and Executor, and initially design an instruction to instantiate the pipeline and validate that GPT-4 outperforms all existing methods. However, utilizing an online LLM like GPT-4 holds various challenges in terms of cost, latency, and data security risk, which motivates us to specialize smaller LLMs in this task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets following the Step-wise Pipeline. The experimental results have verified that our TAT-LLM model can outperform all baseline models, including the previous best fine-tuned models and very large-scale LLMs like GPT-4 on FinQA, TAT-QA and TAT-DQA benchmarks. △ Less

Submitted 22 February, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: ACL 2024 (Under Review)

arXiv:2401.11615 [pdf, other]

Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding

Authors: Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image… ▽ More While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2401.05960 [pdf, other]

Machine Learning Insides OptVerse AI Solver: Design Principles and Applications

Authors: Xijun Li, Fangzhou Zhu, Hui-Ling Zhen, Weilin Luo, Meng Lu, Yimin Huang, Zhenan Fan, Zirui Zhou, Yufei Kuang, Zhihai Wang, Zijie Geng, Yang Li, Haoyang Liu, Zhiwu An, Muming Yang, Jianshu Li, Jie Wang, Junchi Yan, Defeng Sun, Tao Zhong, Yong Zhang, Jia Zeng, Mingxuan Yuan, Jianye Hao, Jun Yao , et al. (1 additional authors not shown)

Abstract: In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional opt… ▽ More In an era of digital ubiquity, efficient resource management and decision-making are paramount across numerous industries. To this end, we present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI Solver, which aims to mitigate the scarcity of real-world mathematical programming instances, and to surpass the capabilities of traditional optimization techniques. We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem. Furthermore, we introduce a training framework leveraging augmentation policies to maintain solvers' utility in dynamic environments. Besides the data generation and augmentation, our proposed approaches also include novel ML-driven policies for personalized solver strategies, with an emphasis on applications like graph convolutional networks for initial basis selection and reinforcement learning for advanced presolving and cut selection. Additionally, we detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance. Compared with traditional solvers such as Cplex and SCIP, our ML-augmented OptVerse AI Solver demonstrates superior speed and precision across both established benchmarks and real-world scenarios, reinforcing the practical imperative and effectiveness of machine learning techniques in mathematical programming solvers. △ Less

Submitted 17 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.05836 [pdf]

On State Estimation in Multi-Sensor Fusion Navigation: Optimization and Filtering

Authors: Feng Zhu, Zhuo Xu, Xveqing Zhang, Yuantai Zhang, Weijie Chen, Xiaohong Zhang

Abstract: The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for nav… ▽ More The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for navigation are reduced to the state estimation problem which can be solved by two approaches: optimization and filtering. Recent research has shown that optimization-based frameworks outperform filtering-based ones in terms of accuracy. However, both methods are based on maximum likelihood estimation (MLE) and should be theoretically equivalent with the same linearization points, observation model, measurements, and Gaussian noise assumption. In this paper, we deeply dig into the theories and existing strategies utilized in both optimization-based and filtering-based approaches. It is demonstrated that the two methods are equal theoretically, but this equivalence corrupts due to different strategies applied in real-time operation. By adjusting existing strategies of the filtering-based approaches, the Monte-Carlo simulation and vehicular ablation experiments based on visual odometry (VO) indicate that the strategy adjusted filtering strictly equals to optimization. Therefore, future research on sensor-fusion problems should concentrate on their own algorithms and strategies rather than state estimation approaches. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.03828 [pdf]

A multimodal gesture recognition dataset for desktop human-computer interaction

Authors: Qi Wang, Fengchao Zhu, Guangming Zhu, Liang Zhang, Ning Li, Eryang Gao

Abstract: Gesture recognition is an indispensable component of natural and efficient human-computer interaction technology, particularly in desktop-level applications, where it can significantly enhance people's productivity. However, the current gesture recognition community lacks a suitable desktop-level (top-view perspective) dataset for lightweight gesture capture devices. In this study, we have establi… ▽ More Gesture recognition is an indispensable component of natural and efficient human-computer interaction technology, particularly in desktop-level applications, where it can significantly enhance people's productivity. However, the current gesture recognition community lacks a suitable desktop-level (top-view perspective) dataset for lightweight gesture capture devices. In this study, we have established a dataset named GR4DHCI. What distinguishes this dataset is its inherent naturalness, intuitive characteristics, and diversity. Its primary purpose is to serve as a valuable resource for the development of desktop-level portable applications. GR4DHCI comprises over 7,000 gesture samples and a total of 382,447 frames for both Stereo IR and skeletal modalities. We also address the variances in hand positioning during desktop interactions by incorporating 27 different hand positions into the dataset. Building upon the GR4DHCI dataset, we conducted a series of experimental studies, the results of which demonstrate that the fine-grained classification blocks proposed in this paper can enhance the model's recognition accuracy. Our dataset and experimental findings presented in this paper are anticipated to propel advancements in desktop-level gesture recognition research. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2401.03735 [pdf, other]

Language Models Know the Value of Numbers

Authors: Fangwei Zhu, Damai Dai, Zhifang Sui

Abstract: Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored. In this paper, we study a fundamental question: whether language models know the value of numbers, a basic element in math. To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes… ▽ More Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored. In this paper, we study a fundamental question: whether language models know the value of numbers, a basic element in math. To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes to read out input numbers from the hidden states. Experimental results support the existence of encoded number values in LLMs on different layers, and these values can be extracted via linear probes. Further experiments show that LLMs store their calculation results in a similar manner, and we can intervene the output via simple vector additions, proving the causal connection between encoded numbers and language model outputs. Our research provides evidence that LLMs know the value of numbers, thus offering insights for better exploring, designing, and utilizing numeric information in LLMs. △ Less

Submitted 9 June, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2401.03050 [pdf, ps, other]

Topological restrictions on relatively Anosov representations

Authors: Konstantinos Tsouvalas, Feng Zhu

Abstract: We obtain restrictions on which groups can admit relatively Anosov representations into specified target Lie groups, by examining the topology of possible Bowditch boundaries and how they interact with the Anosov limit maps. For instance, we prove that, up to finite index, any group admitting a relatively Anosov representation into SL(3,R) is a free group or surface group, and any group admitting… ▽ More We obtain restrictions on which groups can admit relatively Anosov representations into specified target Lie groups, by examining the topology of possible Bowditch boundaries and how they interact with the Anosov limit maps. For instance, we prove that, up to finite index, any group admitting a relatively Anosov representation into SL(3,R) is a free group or surface group, and any group admitting a relatively k-Anosov representation into Sp(2m,R), where k is an odd integer between 1 and m, is a surface group or a free product of nilpotent groups. We also obtain a characterization of groups admitting relatively 1-Anosov representations into SL(4,R), general bounds on the dimension of the Bowditch boundary of groups admitting relatively Anosov representations into SL(d,R), statements relating spheres in the Bowditch boundary to the (non-)existence of relatively Anosov representations, and a characterization of groups of cohomological dimension at least d-1 admitting relatively 1-Anosov representations into SL(d,R). △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 21 pages. Comments welcome!

MSC Class: 22E40 (Primary) 20F67; 20F65; 57M07 (Secondary)

arXiv:2401.02717 [pdf, other]

Complementary Information Mutual Learning for Multimodality Medical Image Segmentation

Authors: Chuyun Shen, Wenhao Li, Haoqing Chen, Xiaoling Wang, Feng** Zhu, Yuxin Li, Xiangfeng Wang, Bo **

Abstract: Radiologists must utilize multiple modal images for tumor segmentation and diagnosis due to the limitations of medical imaging and the diversity of tumor signals. This leads to the development of multimodal learning in segmentation. However, the redundancy among modalities creates challenges for existing subtraction-based joint learning methods, such as misjudging the importance of modalities, ign… ▽ More Radiologists must utilize multiple modal images for tumor segmentation and diagnosis due to the limitations of medical imaging and the diversity of tumor signals. This leads to the development of multimodal learning in segmentation. However, the redundancy among modalities creates challenges for existing subtraction-based joint learning methods, such as misjudging the importance of modalities, ignoring specific modal information, and increasing cognitive load. These thorny issues ultimately decrease segmentation accuracy and increase the risk of overfitting. This paper presents the complementary information mutual learning (CIML) framework, which can mathematically model and address the negative impact of inter-modal redundant information. CIML adopts the idea of addition and removes inter-modal redundant information through inductive bias-driven task decomposition and message passing-based redundancy filtering. CIML first decomposes the multimodal segmentation task into multiple subtasks based on expert prior knowledge, minimizing the information dependence between modalities. Furthermore, CIML introduces a scheme in which each modality can extract information from other modalities additively through message passing. To achieve non-redundancy of extracted information, the redundant filtering is transformed into complementary information learning inspired by the variational information bottleneck. The complementary information learning procedure can be efficiently solved by variational inference and cross-modal spatial attention. Numerical results from the verification task and standard benchmarks indicate that CIML efficiently removes redundant information between modalities, outperforming SOTA methods regarding validation accuracy and segmentation effect. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 35 pages, 18 figures

arXiv:2401.02094 [pdf, other]

Federated Class-Incremental Learning with Prototype Guided Transformer

Authors: Haiyang Guo, Fei Zhu, Wenzhuo Liu, Xu-Yao Zhang, Cheng-Lin Liu

Abstract: Existing federated learning methods have effectively addressed decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to maintain discriminative capabilities for both new and old classes. To effectively mitigate the effects of catastrophic forgetting and data heterogeneit… ▽ More Existing federated learning methods have effectively addressed decentralized learning in scenarios involving data privacy and non-IID data. However, in real-world situations, each client dynamically learns new classes, requiring the global model to maintain discriminative capabilities for both new and old classes. To effectively mitigate the effects of catastrophic forgetting and data heterogeneity under low communication costs, we designed a simple and effective method named PLoRA. On the one hand, we adopt prototype learning to learn better feature representations and leverage the heuristic information between prototypes and class features to design a prototype re-weight module to solve the classifier bias caused by data heterogeneity without retraining the classification layer. On the other hand, our approach utilizes a pre-trained model as the backbone and utilizes LoRA to fine-tune with a tiny amount of parameters when learning new classes. Moreover, PLoRA does not rely on similarity-based module selection strategies, thereby further reducing communication overhead. Experimental results on standard datasets indicate that our method outperforms the state-of-the-art approaches significantly. More importantly, our method exhibits strong robustness and superiority in various scenarios and degrees of data heterogeneity. Our code will be publicly available. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: 11 pages, 4 figures, conference

arXiv:2312.07126 [pdf, other]

Deep Hierarchical Video Compression

Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma

Abstract: Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f… ▽ More Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.06428 [pdf, other]

VisionTraj: A Noise-Robust Trajectory Recovery Framework based on Large-scale Camera Network

Authors: Zhishuai Li, Ziyue Li, Xiaoru Hu, Guoqing Du, Yunhao Nie, Feng Zhu, Lei Bai, Rui Zhao

Abstract: Trajectory recovery based on the snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization. The state-of-the-art solutions devoted to such a vision-based scheme typically incorporate predefined rules or unsupervised iterative feedback, struggling with multi-fold challenges such as lack of open-source datasets for training the whole pipeline, and… ▽ More Trajectory recovery based on the snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization. The state-of-the-art solutions devoted to such a vision-based scheme typically incorporate predefined rules or unsupervised iterative feedback, struggling with multi-fold challenges such as lack of open-source datasets for training the whole pipeline, and the vulnerability to the noises from visual inputs. In response to the dilemma, this paper proposes VisionTraj, the first learning-based model that reconstructs vehicle trajectories from snapshots recorded by road network cameras. Coupled with it, we elaborate on two rational vision-trajectory datasets, which produce extensive trajectory data along with corresponding visual snapshots, enabling supervised vision-trajectory interplay extraction. Following the data creation, based on the results from the off-the-shelf multi-modal vehicle clustering, we first re-formulate the trajectory recovery problem as a generative task and introduce the canonical Transformer as the autoregressive backbone. Then, to identify clustering noises (e.g., false positives) with the bound on the snapshots' spatiotemporal dependencies, a GCN-based soft-denoising module is conducted based on the fine- and coarse-grained Re-ID clusters. Additionally, we harness strong semantic information extracted from the tracklet to provide detailed insights into the vehicle's entry and exit actions during trajectory recovery. The denoising and tracklet components can also act as plug-and-play modules to boost baselines. Experimental results on the two hand-crafted datasets show that the proposed VisionTraj achieves a maximum +11.5% improvement against the sub-best model. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.03667 [pdf, other]

WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

Authors: xujie zhang, Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang

Abstract: Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on war** the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder… ▽ More Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on war** the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the war**-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2312.03408 [pdf, other]

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

Authors: Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, **gdong Wang, Futang Zhu, Chun**g Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao

Abstract: With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively sim… ▽ More With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI. △ Less

Submitted 22 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: This article is a simplified English translation of corresponding Chinese article. Please refer to Chinese version for the complete content

arXiv:2312.01697 [pdf, other]

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang

Abstract: Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did no… ▽ More Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk. △ Less

Submitted 21 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: 24 pages, 5 figures

arXiv:2311.17048 [pdf, other]

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Authors: Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang

Abstract: Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle wi… ▽ More Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC. △ Less

Submitted 9 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

Comments: CVPR 2024, Code available at https://github.com/Show-han/Zeroshot_REC

arXiv:2311.14069 [pdf]

Massive topological edge channels in three-dimensional topological materials induced by extreme surface anisotropy

Authors: Fengfeng Zhu, Chenqiang Hua, Xiao Wang, Lin Miao, Yixi Su, Makoto Hashimoto, Donghui Lu, Zhi-Xun Shen, **-Feng Jia, Yunhao Lu, Dandan Guan, Dong Qian

Abstract: A two-dimensional quantum spin Hall insulator exhibits one-dimensional gapless spin-filtered edge channels allowing for dissipationless transport of charge and spin. However, the sophisticated fabrication requirement of two-dimensional materials and the low capacity of one-dimensional channels hinder the broadening applications. We introduce a method to manipulate a three-dimensional topological m… ▽ More A two-dimensional quantum spin Hall insulator exhibits one-dimensional gapless spin-filtered edge channels allowing for dissipationless transport of charge and spin. However, the sophisticated fabrication requirement of two-dimensional materials and the low capacity of one-dimensional channels hinder the broadening applications. We introduce a method to manipulate a three-dimensional topological material to host a large number of one-dimensional topological edge channels utilizing surface anisotropy. Taking ZrTe5 as a model system, we realize a highly anisotropic surface due to the synergistic effect of the lattice geometry and Coulomb interaction, and achieve massive one-dimensional topological edge channels -- confirmed by electronic characterization using angle-resolved photoemission spectroscopy, in combination with first-principles calculations. Our work provides a new avenue to engineer the topological properties of three-dimensional materials through nanoscale tunning of surface morphology and opens up a promising prospect for the development of low-power-consumption electronic nano devices based on one-dimensional topological edge channels. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.12276 [pdf, other]

The first Ka-band (26.1-35 GHz) blind line survey towards Orion KL

Authors: Xunchuan Liu, Tie Liu, Zhiqiang Shen, Sheng-Li Qin, Qiuyi Luo, Yan Gong, Yu Cheng, Christian Henkel, Qilao Gu, Fengyao Zhu, Tianwei Zhang, Rongbing Zhao, Yajun Wu, Bin Li, Juan Li, Zhang Zhao, **qing Wang, Weiye Zhong, Qinghui Liu, Bo Xia, Li Fu, Zhen Yan, Chao Zhang, Lingling Wang, Qian Ye , et al. (9 additional authors not shown)

Abstract: We conducted a Ka-band (26.1--35 GHz) line survey towards Orion KL using the TianMa 65-m Radio Telescope (TMRT). It is the first blind line survey in the Ka band, and achieves a sensitivity of mK level (1--3 mK at a spectral resolution of $\sim$1 km s$^{-1}$). In total, 592 Gaussian features are extracted. Among them, 257 radio recombination lines (RRLs) are identified. The maximum $Δn$ of RRLs of… ▽ More We conducted a Ka-band (26.1--35 GHz) line survey towards Orion KL using the TianMa 65-m Radio Telescope (TMRT). It is the first blind line survey in the Ka band, and achieves a sensitivity of mK level (1--3 mK at a spectral resolution of $\sim$1 km s$^{-1}$). In total, 592 Gaussian features are extracted. Among them, 257 radio recombination lines (RRLs) are identified. The maximum $Δn$ of RRLs of H, He and C are 20, 15, and 5, respectively. Through stacking, we have detected the $β$ lines of ion RRLs (RRLs of C$^+$ with possible contribution of other ions like O$^+$) for the first time, and tentative signal of the $γ$ lines of ion RRLs can also be seen on the stacked spectrum. Besides, 318 other line features were assigned to 37 molecular species, and ten of these species were not detected in the Q-band survey of TMRT. The vibrationally excited states of nine species were also detected. Emission of most species can be modeled under LTE. A number of transitions of E-CH3OH ($J_2-J_1$) display maser effects, which are confirmed by our modeling, and besides the bum** peak at $J\sim 6$ there is another peak at $J\sim 13$. Methylcyanoacetylene (CH$_3$C$_3$N) is detected in Orion KL for the first time. This work emphasizes that the Ka band, which was long-ignored for spectral line surveys, is very useful for surveying RRLs and molecular lines simultaneously. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: accepted by ApJS

arXiv:2311.09416 [pdf]

Low-level radiofrequency system upgrade for the Dalian Coherent Light Source

Authors: H. L. Ding, J. F. Zhu, H. K. Li, J. W. Han, X. W. Dai, J. Y. Yang, W. Q. Zhang

Abstract: DCLS (Dalian Coherent Light Source) is an FEL (Free-Electron Laser) user facility at EUV (Extreme Ultraviolet). The primary accelerator of DCLS operates at a repetition rate of 20 Hz, and the beam is divided at the end of the linear accelerator through Kicker to make two 10 Hz beamlines work simultaneously. In the past year, we have completed the upgrade of the DCLS LLRF (Low-Level Radiofrequency)… ▽ More DCLS (Dalian Coherent Light Source) is an FEL (Free-Electron Laser) user facility at EUV (Extreme Ultraviolet). The primary accelerator of DCLS operates at a repetition rate of 20 Hz, and the beam is divided at the end of the linear accelerator through Kicker to make two 10 Hz beamlines work simultaneously. In the past year, we have completed the upgrade of the DCLS LLRF (Low-Level Radiofrequency) system, including setting the microwave amplitude and phase for two beamlines based on event timing, optimizing the microwave stability, and generating microwave excitation with the arbitrary shape of amplitude and phase. We added two special event codes and a repetition rate division of 10 Hz in the event timing system and set the microwave amplitude and phase by judging the event code in LLRF. The amplitude and phase stability of the microwave was improved with an intra-pulse feedforward algorithm. In addition, we have also generated microwave excitation with arbitrary amplitude and phase shapes to meet the dual beam operation in the future. Detailed information on functions or algorithms will be presented in this paper. △ Less

Submitted 24 October, 2023; originally announced November 2023.

Comments: Poster presented at LLRF Workshop 2023 (LLRF2023, arXiv: 2310.03199)

Report number: LLRF2023/14

arXiv:2311.09414 [pdf]

A low-delay reference tracking algorithm for microwave measurement and control

Authors: J. F. Zhu, H. L. Ding, H. K. Li, J. W. Han, X. W. Dai, Z. C. Chen, J. Y. Yang, W. Q. Zhang

Abstract: In FEL (Free-Electron Laser) accelerators, LLRF (Low-Level Radiofrequency) systems usually deploy feedback or feedforward algorithms requiring precise microwave measurement. The slow drift of the clock allocation network of LLRF significantly impacts the measured microwave phase, thereby affecting the stability of the closed-loop operation. The reference tracking algorithm is used to eliminate the… ▽ More In FEL (Free-Electron Laser) accelerators, LLRF (Low-Level Radiofrequency) systems usually deploy feedback or feedforward algorithms requiring precise microwave measurement. The slow drift of the clock allocation network of LLRF significantly impacts the measured microwave phase, thereby affecting the stability of the closed-loop operation. The reference tracking algorithm is used to eliminate the measurement drift. The conventional algorithm is to perform phase and amplitude demodulation on the synchronous reference signal from the main oscillator and subtract the reference phase in other measurement channels. The demodulation is usually based on the CORDIC, which requires approximately 16 clock cycles in FPGA (Field Programmable Gate Arrays). This paper uses the multiplication of complex numbers, which only requires four clock cycles of computational delay and achieves phase subtraction point by point. However, experiments show that it causes irrelevant amplitude noise to overlap and increase the amplitude measurement noise. Nevertheless, this reference tracking algorithm is suitable for control algorithms with low-delay requirements of microwave measurement. △ Less

Submitted 24 October, 2023; originally announced November 2023.

Comments: Poster presented at LLRF Workshop 2023 (LLRF2023, arXiv: 2310.03199)

Report number: LLRF2023/18

arXiv:2311.08414 [pdf]

The microwave amplitude and phase setting based on event timing for the DCLS

Authors: J. F. Zhu, H. L. Ding, H. K. Li, J. W. Han, X. W. Dai, B. Xu, L. Shi, J. Y. Yang, W. Q. Zhang

Abstract: The primary accelerator of DCLS (Dalian Coherent Light Source) operates at a repetition rate of 20 Hz now, and the beam is divided at the end of the linear accelera-tor through Kicker to make two 10 Hz beamlines work simultaneously. For the simultaneous emission FEL of two beamlines, the beam energy of the two beamlines is required to be controlled independently, so we need to set the amplitude an… ▽ More The primary accelerator of DCLS (Dalian Coherent Light Source) operates at a repetition rate of 20 Hz now, and the beam is divided at the end of the linear accelera-tor through Kicker to make two 10 Hz beamlines work simultaneously. For the simultaneous emission FEL of two beamlines, the beam energy of the two beamlines is required to be controlled independently, so we need to set the amplitude and phase of each beamline. This paper implements a microwave amplitude and phase setting function based on event timing. We upgraded the EVG/EVR event timing system and LLRF (Low-Level Radiofrequency) system. Two special event codes and a repetition rate division of 10 Hz are added to the event timing system, and we can set the microwave amplitude and phase by judging the event code in LLRF. We ulti-mately perform the microwave triggering at a repetition rate of 10 Hz for each beamline and validate this function through beam experiments. △ Less

Submitted 24 October, 2023; originally announced November 2023.

Comments: Poster presented at LLRF Workshop 2023 (LLRF2023, arXiv: 2310.03199)

Report number: LLRF2023/19

arXiv:2311.06861 [pdf, other]

Energy-efficient Beamforming for RISs-aided Communications: Gradient Based Meta Learning

Authors: Xinquan Wang, Fenghao Zhu, Qianyun Zhou, Qihao Yu, Chongwen Huang, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attribute… ▽ More Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attributed to the highly non-convex optimization space of beamforming matrices at both BSs and RISs, as well as the diversity and mobility of communication scenarios. To address this, we present a greenly gradient based meta learning beamforming (GMLB) approach. Unlike traditional deep learning based methods which take channel information directly as input, GMLB feeds the gradient of sum rate into neural networks. Coherently, we design a differential regulator to address the phase shift optimization of RISs. Moreover, we use the meta learning to iteratively optimize the beamforming matrices of BSs and RISs. These techniques make the proposed method to work well without requiring energy-consuming pre-training. Simulations show that GMLB could achieve higher sum rate than that of typical alternating optimization algorithms with the energy consumption by two orders of magnitude less. △ Less

Submitted 16 February, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: 5 pages, 8 figures. Accepted in IEEE ICC 2024 (GCSN symposium)

arXiv:2311.00567 [pdf]

A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images

Authors: Ni Yao, Hang Hu, Kaicong Chen, Chen Zhao, Yuan Guo, Boya Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Weihua Zhou, Li Tian

Abstract: Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross… ▽ More Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC. △ Less

Submitted 12 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: 16 pages, 6 figures

Showing 51–100 of 644 results for author: Zhu, F