Search | arXiv e-print repository

Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters

Authors: Zechen Bai, Peng Chen, Xiaolan Peng, Lu Liu, Hui Chen, Mike Zheng Shou, Feng Tian

Abstract: Animating virtual characters has always been a fundamental research problem in virtual reality (VR). Facial animations play a crucial role as they effectively convey emotions and attitudes of virtual humans. However, creating such facial animations can be challenging, as current methods often involve utilization of expensive motion capture devices or significant investments of time and effort from… ▽ More Animating virtual characters has always been a fundamental research problem in virtual reality (VR). Facial animations play a crucial role as they effectively convey emotions and attitudes of virtual humans. However, creating such facial animations can be challenging, as current methods often involve utilization of expensive motion capture devices or significant investments of time and effort from human animators in tuning animation parameters. In this paper, we propose a holistic solution to automatically animate virtual human faces. In our solution, a deep learning model was first trained to retarget the facial expression from input face images to virtual human faces by estimating the blendshape coefficients. This method offers the flexibility of generating animations with characters of different appearances and blendshape topologies. Second, a practical toolkit was developed using Unity 3D, making it compatible with the most popular VR applications. The toolkit accepts both image and video as input to animate the target virtual human faces and enables users to manipulate the animation results. Furthermore, inspired by the spirit of Human-in-the-loop (HITL), we leveraged user feedback to further improve the performance of the model and toolkit, thereby increasing the customization properties to suit user preferences. The whole solution, for which we will make the code public, has the potential to accelerate the generation of facial animations for use in VR applications. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 9 pages. To appear in IEEE-VR

arXiv:2402.13502 [pdf, other]

Statistical Analyses of Solar Prominences and Active Region Features in 304 Å Filtergrams detected via Deep Learning

Authors: T. Zhang, Q. Hao, P. F. Chen

Abstract: Solar active regions (ARs) are areas on the Sun with very strong magnetic fields where various activities take place. Prominences are one of the typical solar features in the solar atmosphere, whose eruptions often lead to solar flares and coronal mass ejections (CMEs). Therefore, studying their morphological features and their relationship with solar activity is useful in predicting eruptive even… ▽ More Solar active regions (ARs) are areas on the Sun with very strong magnetic fields where various activities take place. Prominences are one of the typical solar features in the solar atmosphere, whose eruptions often lead to solar flares and coronal mass ejections (CMEs). Therefore, studying their morphological features and their relationship with solar activity is useful in predicting eruptive events and in understanding the long-term evolution of solar activities. A huge amount of data have been collected from various ground-based telescopes and satellites. The massive data make human inspection difficult. For this purpose, we developed an automated detection method for prominences and ARs above the solar limb based on deep learning techniques. We applied it to process the 304 Ådata obtained by SDO/AIA from 2010 May 13 to 2020 December 31. Besides the butterfly diagrams and latitudinal migrations of the prominences and ARs during solar cycle 24, the variations of their morphological features (such as the locations, areas, heights, and widths) with the calendar years and the latitude bands were analyzed. Most of these statistical results based on our new method are in agreement with previous studies, which also guarantees the validity of our method. The N-S asymmetry indices of the prominences and ARs show that the northern hemisphere dominates in solar cycle 24, except for 2012--2015, and 2020 for ARs. The high-latitude prominences show much stronger N-S asymmetry that the northern hemisphere is dominant in $\sim$2011 and $\sim$2015 and the southern hemisphere is dominant during 2016--2019. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 27 pages, 32 figures, Accepted for publication in ApJS

arXiv:2402.11592 [pdf, other]

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Authors: Yihua Zhang, **zhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

Abstract: In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications… ▽ More In the evolving landscape of natural language processing (NLP), fine-tuning pre-trained Large Language Models (LLMs) with first-order (FO) optimizers like SGD and Adam has become standard. Yet, as LLMs grow {in size}, the substantial memory overhead from back-propagation (BP) for FO gradient computation presents a significant challenge. Addressing this issue is crucial, especially for applications like on-device training where memory efficiency is paramount. This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during LLM fine-tuning, building on the initial concept introduced by MeZO. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques, through a comprehensive, first-of-its-kind benchmarking study across five LLM families (Roberta, OPT, LLaMA, Vicuna, Mistral), three task complexities, and five fine-tuning schemes. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance. We further introduce novel enhancements to ZO optimization, including block-wise descent, hybrid training, and gradient sparsity. Our study offers a promising direction for achieving further memory-efficient LLM fine-tuning. Codes to reproduce all our experiments are at https://github.com/ZO-Bench/ZO-LLM . △ Less

Submitted 27 May, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.08875 [pdf, other]

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Authors: Yang Qian, Yinan Sun, Ali Kargarandehkordi, Onur Cezmi Mutlu, Saimourya Surabhi, **yi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington

Abstract: The increasing variety and quantity of tagged multimedia content on platforms such as TikTok provides an opportunity to advance computer vision modeling. We have curated a distinctive dataset of 283,582 unique video clips categorized under 386 hashtags relating to modern human actions. We release this dataset as a valuable resource for building domain-specific foundation models for human movement… ▽ More The increasing variety and quantity of tagged multimedia content on platforms such as TikTok provides an opportunity to advance computer vision modeling. We have curated a distinctive dataset of 283,582 unique video clips categorized under 386 hashtags relating to modern human actions. We release this dataset as a valuable resource for building domain-specific foundation models for human movement modeling tasks such as action recognition. To validate this dataset, which we name TikTokActions, we perform two sets of experiments. First, we pretrain the state-of-the-art VideoMAEv2 with a ViT-base backbone on TikTokActions subset, and then fine-tune and evaluate on popular datasets such as UCF101 and the HMDB51. We find that the performance of the model pre-trained using our Tik-Tok dataset is comparable to models trained on larger action recognition datasets (95.3% on UCF101 and 53.24% on HMDB51). Furthermore, our investigation into the relationship between pre-training dataset size and fine-tuning performance reveals that beyond a certain threshold, the incremental benefit of larger training sets diminishes. This work introduces a useful TikTok video dataset that is available for public use and provides insights into the marginal benefit of increasing pre-training dataset sizes for video-based foundation models. △ Less

Submitted 19 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 10 pages

arXiv:2402.05956 [pdf, other]

Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting

Authors: Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, Chenjuan Guo

Abstract: Transformers for time series forecasting mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. We propose Pathformer, a multi-scale Transformer with adaptive pathways. It integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different… ▽ More Transformers for time series forecasting mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. We propose Pathformer, a multi-scale Transformer with adaptive pathways. It integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different temporal resolutions using patches of various sizes. Based on the division of each scale, dual attention is performed over these patches to capture global correlations and local details as temporal dependencies. We further enrich the multi-scale Transformer with adaptive pathways, which adaptively adjust the multi-scale modeling process based on the varying temporal dynamics of the input, improving the accuracy and generalization of Pathformer. Extensive experiments on eleven real-world datasets demonstrate that Pathformer not only achieves state-of-the-art performance by surpassing all current models but also exhibits stronger generalization abilities under various transfer scenarios. The code is made available at https://github.com/decisionintelligence/pathformer. △ Less

Submitted 6 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted by the 12th International Conference on Learning Representations (ICLR 2024)

arXiv:2402.05637 [pdf, other]

Learning pseudo-contractive denoisers for inverse problems

Authors: Deliang Wei, Peng Chen, Fang Li

Abstract: Deep denoisers have shown excellent performance in solving inverse problems in signal and image processing. In order to guarantee the convergence, the denoiser needs to satisfy some Lipschitz conditions like non-expansiveness. However, enforcing such constraints inevitably compromises recovery performance. This paper introduces a novel training strategy that enforces a weaker constraint on the dee… ▽ More Deep denoisers have shown excellent performance in solving inverse problems in signal and image processing. In order to guarantee the convergence, the denoiser needs to satisfy some Lipschitz conditions like non-expansiveness. However, enforcing such constraints inevitably compromises recovery performance. This paper introduces a novel training strategy that enforces a weaker constraint on the deep denoiser called pseudo-contractiveness. By studying the spectrum of the Jacobian matrix, relationships between different denoiser assumptions are revealed. Effective algorithms based on gradient descent and Ishikawa process are derived, and further assumptions of strict pseudo-contractiveness yield efficient algorithms using half-quadratic splitting and forward-backward splitting. The proposed algorithms theoretically converge strongly to a fixed point. A training strategy based on holomorphic transformation and functional calculi is proposed to enforce the pseudo-contractive denoiser assumption. Extensive experiments demonstrate superior performance of the pseudo-contractive denoiser compared to related denoisers. The proposed methods are competitive in terms of visual effects and quantitative values. △ Less

Submitted 8 February, 2024; originally announced February 2024.

MSC Class: 68T07; 68U10; 68U10; 47J07; 94A08; 94A08; 90C25

arXiv:2402.05457 [pdf, other]

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Authors: Chen Chen, Ruizhe Li, Yuchen Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, Ensiong Chng, Chao-Han Huck Yang

Abstract: Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introd… ▽ More Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct map** from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the acoustic modality. Experimental evidence collected from various ASR tasks shows that UADF surpasses existing fusion mechanisms in several ways. It yields significant improvements in word error rate (WER) while mitigating data uncertainty issues in LLM and addressing the poor generalization relied with sole modality during fusion. We also demonstrate that UADF seamlessly adapts to audio-visual speech recognition. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted to ICLR 2024, 17 pages. This work will be open sourced under MIT license

arXiv:2402.05410 [pdf, other]

SpirDet: Towards Efficient, Accurate and Lightweight Infrared Small Target Detector

Authors: Qianchen Mao, Qiang Li, Bingshu Wang, Yongjun Zhang, Tao Dai, C. L. Philip Chen

Abstract: In recent years, the detection of infrared small targets using deep learning methods has garnered substantial attention due to notable advancements. To improve the detection capability of small targets, these methods commonly maintain a pathway that preserves high-resolution features of sparse and tiny targets. However, it can result in redundant and expensive computations. To tackle this challeng… ▽ More In recent years, the detection of infrared small targets using deep learning methods has garnered substantial attention due to notable advancements. To improve the detection capability of small targets, these methods commonly maintain a pathway that preserves high-resolution features of sparse and tiny targets. However, it can result in redundant and expensive computations. To tackle this challenge, we propose SpirDet, a novel approach for efficient detection of infrared small targets. Specifically, to cope with the computational redundancy issue, we employ a new dual-branch sparse decoder to restore the feature map. Firstly, the fast branch directly predicts a sparse map indicating potential small target locations (occupying only 0.5\% area of the map). Secondly, the slow branch conducts fine-grained adjustments at the positions indicated by the sparse map. Additionally, we design an lightweight DO-RepEncoder based on reparameterization with the Downsampling Orthogonality, which can effectively reduce memory consumption and inference latency. Extensive experiments show that the proposed SpirDet significantly outperforms state-of-the-art models while achieving faster inference speed and fewer parameters. For example, on the IRSTD-1K dataset, SpirDet improves $MIoU$ by 4.7 and has a $7\times$ $FPS$ acceleration compared to the previous state-of-the-art model. The code will be open to the public. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.04699 [pdf, other]

Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!

Authors: Shashank Kotyan, Po-Yuan Mao, Pin-Yu Chen, Danilo Vasconcellos Vargas

Abstract: Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on deep neural networks' white-box nature to generate these adversarial samples or synthetically alter the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorith… ▽ More Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on deep neural networks' white-box nature to generate these adversarial samples or synthetically alter the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples. Our EvoSeed framework uses auxiliary Conditional Diffusion and Classifier models to operate in a black-box setting. We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Classifier Model. Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers. Our research opens new avenues to understanding the limitations of current safety mechanisms and the risk of plausible attacks against classifier systems using image generation. Project Website can be accessed at: https://shashankkotyan.github.io/EvoSeed. △ Less

Submitted 22 May, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.02417 [pdf, other]

doi 10.1088/2053-1583/acf775

Revealing flat bands and hybridization gaps in a twisted bilayer graphene device with microARPES

Authors: Zhihao Jiang, Kimberly Hsieh, Alfred J. H. Jones, Paulina Majchrzak, Chakradhar Sahoo, Kenji Watanabe, Takashi Taniguchi, Jill A. Miwa, Yong P. Chen, Søren Ulstrup

Abstract: Controlling the electronic structure of two-dimensional materials using the combination of twist angle and electrostatic do** is an effective means to induce emergent phenomena. In bilayer graphene with an interlayer twist angle near the magic angle, the electronic dispersion is strongly modified by a manifold of hybridizing moiré Dirac cones leading to flat band segments with strong electronic… ▽ More Controlling the electronic structure of two-dimensional materials using the combination of twist angle and electrostatic do** is an effective means to induce emergent phenomena. In bilayer graphene with an interlayer twist angle near the magic angle, the electronic dispersion is strongly modified by a manifold of hybridizing moiré Dirac cones leading to flat band segments with strong electronic correlations. Numerous technical challenges arising from spatial inhomogeneity of interlayer interactions, twist angle and device functionality have so far limited momentum-resolved electronic structure measurements of these systems to static conditions. Here, we present a detailed characterization of the electronic structure exhibiting miniband dispersions for twisted bilayer graphene, near the magic angle, integrated in a functional device architecture using micro-focused angle-resolved photoemission spectroscopy. The optimum conditions for visualizing the miniband dispersion are determined by exploiting the spatial resolution and photon energy tunability of the light source and applied to extract a hybridization gap size of $(0.14 \pm 0.03)$~eV and flat band segments extending across a moiré mini Brillouin zone. \textit{In situ} electrostatic gating of the sample enables significant electron-do**, causing the conduction band states to shift below the Fermi energy. Our work emphasizes key challenges in probing the electronic structure of magic angle bilayer graphene devices and outlines conditions for exploring the do**-dependent evolution of the dispersion that underpins the ability to control many-body interactions in the material. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 21 pages, 5 figures

Journal ref: 2D Mater. 10, 045027 (2023)

arXiv:2402.02155 [pdf, ps, other]

Penalty-based Methods for Simple Bilevel Optimization under Hölderian Error Bounds

Authors: Pengyu Chen, Xu Shi, Rujun Jiang, Jiulin Wang

Abstract: This paper investigates simple bilevel optimization problems where the upper-level objective minimizes a composite convex function over the optimal solutions of a composite convex lower-level problem. Existing methods for such problems either only guarantee asymptotic convergence, have slow sublinear rates, or require strong assumptions. To address these challenges, we develop a novel penalty-base… ▽ More This paper investigates simple bilevel optimization problems where the upper-level objective minimizes a composite convex function over the optimal solutions of a composite convex lower-level problem. Existing methods for such problems either only guarantee asymptotic convergence, have slow sublinear rates, or require strong assumptions. To address these challenges, we develop a novel penalty-based approach that employs the accelerated proximal gradient (APG) method. Under an $α$-Hölderian error bound condition on the lower-level objective, our algorithm attains an $(ε,l_F^{-β}ε^β)$-optimal solution for any $β>0$ within $\mathcal{O}\left(\sqrt{\frac{L_{f_1}}{ε}}\right)+\mathcal{O}\left(\sqrt{\frac{l_F^{\max\{α,β\}}L_{g_1}}{ε^{\max\{α,β\}}}}\right)$ iterations, where $l_F$, $L_{f_1}$ and $L_{g_1}$ denote the Lipschitz constants of the upper-level objective, the gradients of the smooth parts of the upper- and lower-level objectives, respectively. If the smooth part of the upper-level objective is strongly convex, the result improves further. We also establish the complexity results when both upper- and lower-level objectives are general convex nonsmooth functions. Numerical experiments demonstrate the effectiveness of our algorithms. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2402.02140 [pdf, other]

Generative Visual Compression: A Review

Authors: Bolin Chen, Shanzhi Yin, Peilin Chen, Shiqi Wang, Yan Ye

Abstract: Artificial Intelligence Generated Content (AIGC) is leading a new technical revolution for the acquisition of digital content and impelling the progress of visual compression towards competitive performance gains and diverse functionalities over traditional codecs. This paper provides a thorough review on the recent advances of generative visual compression, illustrating great potentials and promi… ▽ More Artificial Intelligence Generated Content (AIGC) is leading a new technical revolution for the acquisition of digital content and impelling the progress of visual compression towards competitive performance gains and diverse functionalities over traditional codecs. This paper provides a thorough review on the recent advances of generative visual compression, illustrating great potentials and promising applications in ultra-low bitrate communication, user-specified reconstruction/filtering, and intelligent machine analysis. In particular, we review the visual data compression methodologies with deep generative models, and summarize how compact representation and high-fidelity reconstruction could be actualized via generative techniques. In addition, we generalize related generative compression technologies for machine vision and intelligent analytics. Finally, we discuss the fundamental challenges on generative visual compression techniques and envision their future research directions. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2402.01911 [pdf, other]

From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers

Authors: Bharat Runwal, Tejaswini Pedapati, Pin-Yu Chen

Abstract: Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence… ▽ More Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the multilayer perception (MLP) blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. We demonstrate the effectiveness of our approach by utilizing mainstream PEFT techniques including QLoRA, LoRA, Adapter, Prompt/Prefix Tuning to facilitate efficient model adaptation across diverse downstream tasks. Experiments show that our proposed method DEFT, Density-Efficient Fine-Tuning, can reduce the activation density consistently and up to $\boldsymbol{50.72\%}$ on RoBERTa$_\mathrm{Large}$, and $\boldsymbol {53.19\%}$ (encoder density) and $\boldsymbol{90.60\%}$ (decoder density) on Flan-T5$_\mathrm{XXL}$ ($\boldsymbol{11B}$) compared to PEFT using GLUE and QA (SQuAD) benchmarks respectively while maintaining competitive performance on downstream tasks. We also showcase that DEFT works complementary with quantized and pruned models △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: Preprint

arXiv:2402.01162 [pdf, other]

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

Authors: Hanwei Zhu, Xiangjie Sui, Baoliang Chen, Xuelin Liu, Peilin Chen, Yuming Fang, Shiqi Wang

Abstract: While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of col… ▽ More While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.17203 [pdf, other]

CPR++: Object Localization via Single Coarse Point Supervision

Authors: Xuehui Yu, Pengfei Chen, Kuiran Wang, Xumeng Han, Guorong Li, Zhenjun Han, Qixiang Ye, Jianbin Jiao

Abstract: Point-based object localization (POL), which pursues high-performance object sensing under low-cost data annotation, has attracted increased attention. However, the point annotation mode inevitably introduces semantic variance due to the inconsistency of annotated points. Existing POL heavily rely on strict annotation rules, which are difficult to define and apply, to handle the problem. In this s… ▽ More Point-based object localization (POL), which pursues high-performance object sensing under low-cost data annotation, has attracted increased attention. However, the point annotation mode inevitably introduces semantic variance due to the inconsistency of annotated points. Existing POL heavily rely on strict annotation rules, which are difficult to define and apply, to handle the problem. In this study, we propose coarse point refinement (CPR), which to our best knowledge is the first attempt to alleviate semantic variance from an algorithmic perspective. CPR reduces the semantic variance by selecting a semantic centre point in a neighbourhood region to replace the initial annotated point. Furthermore, We design a sampling region estimation module to dynamically compute a sampling region for each object and use a cascaded structure to achieve end-to-end optimization. We further integrate a variance regularization into the structure to concentrate the predicted scores, yielding CPR++. We observe that CPR++ can obtain scale information and further reduce the semantic variance in a global region, thus guaranteeing high-performance object localization. Extensive experiments on four challenging datasets validate the effectiveness of both CPR and CPR++. We hope our work can inspire more research on designing algorithms rather than annotation rules to address the semantic variance problem in POL. The dataset and code will be public at github.com/ucas-vg/PointTinyBenchmark. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: Accpted by TPAMI 2024

arXiv:2401.15951 [pdf, other]

Observation of quantum strong Mpemba effect

Authors: Jie Zhang, Gang Xia, Chun-Wang Wu, Ting Chen, Qian Zhang, Yi Xie, Wen-Bo Su, Wei Wu, Cheng-Wei Qiu, **-xing Chen, Weibin Li, Hui **g, Yan-Li Zhou

Abstract: An ancient and counterintuitive phenomenon know as the Mpemba effect (water can cool faster when initially heated up) showcases the critical role of initial conditions in relaxation processes. How to realize and utilize this effect for speeding up relaxation is an important but challenging task in purely quantum system till now. Here, we report the first experiment, as far as we know,about the str… ▽ More An ancient and counterintuitive phenomenon know as the Mpemba effect (water can cool faster when initially heated up) showcases the critical role of initial conditions in relaxation processes. How to realize and utilize this effect for speeding up relaxation is an important but challenging task in purely quantum system till now. Here, we report the first experiment, as far as we know,about the strong Mpemba effect in a single trapped ion system in which an exponentially expedited relaxation in time is observed by preparing an optimal initial state with no excitation of the slowest decaying mode. Also, we find that the condition of realizing such effect coincides with the Liouvillian exceptional point, featuring the coalescence of both the eigenvalues and the eigenmodes of the system. Our work provides an efficient strategy to exponentially accelerate relaxations of quantum system to their stationary state, and suggests a link unexplored yet between the Mpemba effect and the non-Hermitian physics. It could open up the door to engineer a wide range of dissipative quantum systems by utilizing the anomalous Mpemba effect, for applications in quantum simulation and quantum information processing. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.15148 [pdf, other]

doi 10.1051/0004-6361/202449350

Spectroscopic observations of progenitor activity 100 days before a Type Ibn supernova

Authors: S. J. Brennan, J. Sollerman, I. Irani, S. Schulze, P. Chen, K. K. Das, K. De, C. Fransson, A. Gal-Yam, A. Gkini, K. R. Hinds, R. Lunnan, D. Perley, YJ. Qin, R. Stein, J. Wise, L. Yan, E. A. Zimmerman, S. Anand, R. J. Bruch, R. Dekany, A. J. Drake, C. Fremling, B. Healy, V. Karambelkar , et al. (8 additional authors not shown)

Abstract: Obtaining spectroscopic observations of the progenitors of core-collapse supernovae is often unfeasible due to an inherent lack of knowledge as to which stars will go supernova and when they will explode. In this letter, we present photometric and spectroscopic observations of the progenitor activity of SN 2023fyq in the preceding 150 days before the He-rich progenitor exploded as a Type Ibn super… ▽ More Obtaining spectroscopic observations of the progenitors of core-collapse supernovae is often unfeasible due to an inherent lack of knowledge as to which stars will go supernova and when they will explode. In this letter, we present photometric and spectroscopic observations of the progenitor activity of SN 2023fyq in the preceding 150 days before the He-rich progenitor exploded as a Type Ibn supernova. The progenitor of SN 2023fyq shows an exponential rise in flux prior to core-collapse. Complex He I emission line features are observed, with a P-Cygni like profile, as well as an evolving broad base with velocities on the order of 10,000 km/s, possibly due to electron scattering. The luminosity and evolution of SN 2023fyq are consistent with a faint Type Ibn, reaching a peak r-band magnitude of 18.1 mag, although there is some uncertainty in the distance to the host, NGC 4388, located in the Virgo cluster. We present additional evidence of asymmetric He-rich material being present prior to the explosion of SN 2023fyq, as well as after, suggesting this material has survived the ejecta-CSM interaction. Broad [O I] and the Ca II triplet lines are observed at late phases, confirming that SN 2023fyq was a genuine supernova rather than a non-terminal interacting transient. SN 2023fyq provides insight into the final moments of a massive star's life, highlighting that the progenitor is likely highly unstable before core-collapse. △ Less

Submitted 25 March, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

Comments: 7 Pages, 5 Figures, accepted to A&A Letters

Journal ref: A&A 684, L18 (2024)

arXiv:2401.14044 [pdf]

Electrical switching of the perpendicular Neel order in a collinear antiferromagnet

Authors: Wenqing He, Tianyi Zhang, Yongjian Zhou, Caihua Wan, Hao Wu, Baoshan Cui, Jihao Xia, Ran Zhang, Tengyu Guo, Peng Chen, Mingkun Zhao, Leina Jiang, Alexander Grutter, Purnima P. Balakrishnan, Andrew J. Caruana, Christy J. Kinane, Sean Langridge, Guoqiang Yu, Cheng Song, Xiufeng Han

Abstract: Electrical manipulation of magnetic order by current-induced spin torques lays the foundation for spintronics. One promising approach is encoding information in the Néel vector of antiferromagnetic (AFM) materials, particularly to collinear antiferromagnets with the perpendicular magnetic anisotropy (PMA), as the negligible stray fields and terahertz spin dynamics can enable memory devices with hi… ▽ More Electrical manipulation of magnetic order by current-induced spin torques lays the foundation for spintronics. One promising approach is encoding information in the Néel vector of antiferromagnetic (AFM) materials, particularly to collinear antiferromagnets with the perpendicular magnetic anisotropy (PMA), as the negligible stray fields and terahertz spin dynamics can enable memory devices with higher integration density and ultrafast speed. Here we demonstrate that the Néel order information in a prototypical collinear AFM insulator with PMA, Cr2O3, can be reliably readout via the anomalous Hall effect and efficiently switched by the spin-orbit torque (SOT) effect with a low current density of 5.8*106 A/cm2. Moreover, using Cr2O3 as a mediator, we electrically switch the magnetization of a Y3Fe5O12 film exchange-coupled to the Cr2O3 layer, unambiguously confirming the Néel order switching of the Cr2O3 layer. This work provides a significant basis for develo** AFM memory devices based on collinear AFM materials with PMA. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.14034 [pdf, other]

Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition

Authors: Chuankun Li, Shuai Li, Yanbo Gao, ** Chen, Jian Li, Wanqing Li

Abstract: Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated… ▽ More Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.13886 [pdf]

doi 10.1038/s41567-023-02349-0

Observation of possible excitonic charge density waves and metal-insulator transitions in atomically thin semimetals

Authors: Qiang Gao, Yang-hao Chan, Pengfei Jiao, Haiyang Chen, Shuaishuai Yin, Kanjanaporn Tangprapha, Yichen Yang, Xiaolong Li, Zhengtai Liu, Dawei Shen, Shengwei Jiang, Peng Chen

Abstract: Charge density wave (CDW) is a collective quantum phenomenon with a charge modulation in solids1-2. Condensation of electron and hole pairs with finite momentum will lead to such an ordered state3-7. However, lattice symmetry breaking manifested as the softening of phonon modes can occur simultaneously, which makes it difficult to disentangle the origin of the transition8-14. Here, we report a con… ▽ More Charge density wave (CDW) is a collective quantum phenomenon with a charge modulation in solids1-2. Condensation of electron and hole pairs with finite momentum will lead to such an ordered state3-7. However, lattice symmetry breaking manifested as the softening of phonon modes can occur simultaneously, which makes it difficult to disentangle the origin of the transition8-14. Here, we report a condensed phase in low dimensional HfTe2, whereas angle-resolved photoemission spectroscopy (ARPES) measurements show a metal-insulator transition by lowering the temperature in single triatomic layer (TL) HfTe2. A full gap opening, renormalization of the bands, and emergence of replica bands at the M point are observed in the low temperatures, indicating formation of a CDW in the ground state.Raman spectroscopy shows no sign of lattice distortion within the detection limit. The results are corroborated by first-principles calculations, demonstrating the electronic origin of the CDW. By adding more layers, the phase transition is suppressed and completely destroyed at 3 TL because of the increased screening around the Fermi surface. Interestingly, a small amount of electron do** in 1 TL film during the growth significantly raises the transition temperature (TC), which is attributed to a reduced screening effect and a more balanced electron and hole carrier density. Our results indicate a CDW formation mechanism consistent with the excitonic insulator phase in low dimensional HfTe2 and open up opportunity for realization of novel quantum states based on exciton condensation. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: https://www.nature.com/articles/s41567-023-02349-0 published in Nature Physics

arXiv:2401.13280 [pdf, other]

DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In Machine-Assisted Skin Disease Detection

Authors: Ming-Chang Chiu, Yingfei Wang, Yen-Ju Kuo, Pin-Yu Chen

Abstract: Skin tone as a demographic bias and inconsistent human labeling poses challenges in dermatology AI. We take another angle to investigate color contrast's impact, beyond skin tones, on malignancy detection in skin disease datasets: We hypothesize that in addition to skin tones, the color difference between the lesion area and skin also plays a role in malignancy detection performance of dermatology… ▽ More Skin tone as a demographic bias and inconsistent human labeling poses challenges in dermatology AI. We take another angle to investigate color contrast's impact, beyond skin tones, on malignancy detection in skin disease datasets: We hypothesize that in addition to skin tones, the color difference between the lesion area and skin also plays a role in malignancy detection performance of dermatology AI models. To study this, we first propose a robust labeling method to quantify color contrast scores of each image and validate our method by showing small labeling variations. More importantly, applying our method to \textit{the only} diverse-skin tone and pathologically-confirmed skin disease dataset DDI, yields \textbf{DDI-CoCo Dataset}, and we observe a performance gap between the high and low color difference groups. This disparity remains consistent across various state-of-the-art (SoTA) image classification models, which supports our hypothesis. Furthermore, we study the interaction between skin tone and color difference effects and suggest that color difference can be an additional reason behind model performance bias between skin tones. Our work provides a complementary angle to dermatology AI for improving skin disease detection. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures, 2 tables, Accepted to ICASSP 2024

arXiv:2401.12728 [pdf, other]

Filamentary Network and Magnetic Field Structures Revealed with BISTRO in the High-Mass Star-Forming Region NGC2264 : Global Properties and Local Magnetogravitational Configurations

Authors: Jia-Wei Wang, Patrick M. Koch, Seamus D. Clarke, Gary Fuller, Nicolas Peretto, Ya-Wen Tang, Hsi-Wei Yen, Shih-** Lai, Nagayoshi Ohashi, Doris Arzoumanian, Doug Johnstone, Ray Furuya, Shu-ichiro Inutsuka, Chang Won Lee, Derek Ward-Thompson, Valentin J. M. Le Gouellec, Hong-Li Liu, Lapo Fanciullo, Jihye Hwang, Kate Pattle, Frédérick Poidevin, Mehrnoosh Tahani, Takashi Onaka, Mark G. Rawlings, Eun Jung Chung , et al. (132 additional authors not shown)

Abstract: We report 850 $μ$m continuum polarization observations toward the filamentary high-mass star-forming region NGC 2264, taken as part of the B-fields In STar forming Regions Observations (BISTRO) large program on the James Clerk Maxwell Telescope (JCMT). These data reveal a well-structured non-uniform magnetic field in the NGC 2264C and 2264D regions with a prevailing orientation around 30 deg from… ▽ More We report 850 $μ$m continuum polarization observations toward the filamentary high-mass star-forming region NGC 2264, taken as part of the B-fields In STar forming Regions Observations (BISTRO) large program on the James Clerk Maxwell Telescope (JCMT). These data reveal a well-structured non-uniform magnetic field in the NGC 2264C and 2264D regions with a prevailing orientation around 30 deg from north to east. Field strengths estimates and a virial analysis for the major clumps indicate that NGC 2264C is globally dominated by gravity while in 2264D magnetic, gravitational, and kinetic energies are roughly balanced. We present an analysis scheme that utilizes the locally resolved magnetic field structures, together with the locally measured gravitational vector field and the extracted filamentary network. From this, we infer statistical trends showing that this network consists of two main groups of filaments oriented approximately perpendicular to one another. Additionally, gravity shows one dominating converging direction that is roughly perpendicular to one of the filament orientations, which is suggestive of mass accretion along this direction. Beyond these statistical trends, we identify two types of filaments. The type-I filament is perpendicular to the magnetic field with local gravity transitioning from parallel to perpendicular to the magnetic field from the outside to the filament ridge. The type-II filament is parallel to the magnetic field and local gravity. We interpret these two types of filaments as originating from the competition between radial collapsing, driven by filament self-gravity, and the longitudinal collapsing, driven by the region's global gravity. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted for publication in the Astrophysical Journal. 43 pages, 32 figures, and 4 tables (including Appendix)

arXiv:2401.11946 [pdf, other]

A Dynamic YOLO-Based Sequence-Matching Model for Efficient Coverless Image Steganography

Authors: Jiajun Liu, Lina Tan, Zhili Zhou, Yi Li, Peng Chen

Abstract: Many existing coverless steganography methods establish a map** relationship between cover images and hidden data. There exists an issue that the number of images stored in the database grows exponentially as the steganographic capacity rises. The need for a high steganographic capacity makes it challenging to build an image database. To improve the image library utilization and anti-attack capa… ▽ More Many existing coverless steganography methods establish a map** relationship between cover images and hidden data. There exists an issue that the number of images stored in the database grows exponentially as the steganographic capacity rises. The need for a high steganographic capacity makes it challenging to build an image database. To improve the image library utilization and anti-attack capability of the steganography system, we present an efficient coverless scheme based on dynamically matched substrings. YOLO is employed for selecting optimal objects, and a map** dictionary is established between these objects and scrambling factors. With the aid of this dictionary, each image is effectively assigned to a specific scrambling factor, which is used to scramble the receiver's sequence key. To achieve sufficient steganography capability based on a limited image library, all substrings of the scrambled sequences hold the potential to hide data. After completing the secret information matching, the ideal number of stego images will be obtained from the database. According to experimental results, this technology outperforms most previous works on data load, transmission security, and hiding capacity. Under typical geometric attacks, it can recover 79.85\% of secret information on average. Furthermore, only approximately 200 random images are needed to meet a capacity of 19 bits per image. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.11436 [pdf, other]

Geometric Prior Guided Feature Representation Learning for Long-Tailed Classification

Authors: Yanbiao Ma, Licheng Jiao, Fang Liu, Shuyuan Yang, Xu Liu, Puhua Chen

Abstract: Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the generalization ability of the model. Although numerous approaches of class re-balancing perform well for moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class recover the underlying true distribution when the observed distribution from a few tail sample… ▽ More Real-world data are long-tailed, the lack of tail samples leads to a significant limitation in the generalization ability of the model. Although numerous approaches of class re-balancing perform well for moderate class imbalance problems, additional knowledge needs to be introduced to help the tail class recover the underlying true distribution when the observed distribution from a few tail samples does not represent its true distribution properly, thus allowing the model to learn valuable information outside the observed domain. In this work, we propose to leverage the geometric information of the feature distribution of the well-represented head class to guide the model to learn the underlying distribution of the tail class. Specifically, we first systematically define the geometry of the feature distribution and the similarity measures between the geometries, and discover four phenomena regarding the relationship between the geometries of different feature distributions. Then, based on four phenomena, feature uncertainty representation is proposed to perturb the tail features by utilizing the geometry of the head class feature distribution. It aims to make the perturbed features cover the underlying distribution of the tail class as much as possible, thus improving the model's generalization performance in the test domain. Finally, we design a three-stage training scheme enabling feature uncertainty modeling to be successfully applied. Experiments on CIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that our proposed approach outperforms other similar methods on most metrics. In addition, the experimental phenomena we discovered are able to provide new perspectives and theoretical foundations for subsequent studies. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: This work was accepted by the IJCV

arXiv:2401.10446 [pdf, other]

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng

Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by e… ▽ More Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license

arXiv:2401.08577 [pdf, other]

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Authors: Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang Gan

Abstract: Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this a… ▽ More Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into large language models, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Project page: https://vis-www.cs.umass.edu/multiply

arXiv:2401.08276 [pdf, other]

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

Authors: Yipo Huang, Quan Yuan, Xiangfei Sheng, Zhichao Yang, Haoning Wu, Pengfei Chen, Yuzhe Yang, Leida Li, Weisi Lin

Abstract: With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind gro** may impede… ▽ More With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind gro** may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at https://github.com/yipoh/AesBench. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.08107 [pdf, other]

Deep Shape-Texture Statistics for Completely Blind Image Quality Evaluation

Authors: Yixuan Li, Peilin Chen, Hanwei Zhu, Keyan Ding, Leida Li, Shiqi Wang

Abstract: Opinion-Unaware Blind Image Quality Assessment (OU-BIQA) models aim to predict image quality without training on reference images and subjective quality scores. Thereinto, image statistical comparison is a classic paradigm, while the performance is limited by the representation ability of visual descriptors. Deep features as visual descriptors have advanced IQA in recent research, but they are dis… ▽ More Opinion-Unaware Blind Image Quality Assessment (OU-BIQA) models aim to predict image quality without training on reference images and subjective quality scores. Thereinto, image statistical comparison is a classic paradigm, while the performance is limited by the representation ability of visual descriptors. Deep features as visual descriptors have advanced IQA in recent research, but they are discovered to be highly texture-biased and lack of shape-bias. On this basis, we find out that image shape and texture cues respond differently towards distortions, and the absence of either one results in an incomplete image representation. Therefore, to formulate a well-round statistical description for images, we utilize the shapebiased and texture-biased deep features produced by Deep Neural Networks (DNNs) simultaneously. More specifically, we design a Shape-Texture Adaptive Fusion (STAF) module to merge shape and texture information, based on which we formulate qualityrelevant image statistics. The perceptual quality is quantified by the variant Mahalanobis Distance between the inner and outer Shape-Texture Statistics (DSTS), wherein the inner and outer statistics respectively describe the quality fingerprints of the distorted image and natural images. The proposed DSTS delicately utilizes shape-texture statistical relations between different data scales in the deep domain, and achieves state-of-the-art (SOTA) quality prediction performance on images with artificial and authentic distortions. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.05989 [pdf, other]

doi 10.1103/PhysRevD.109.094002

The role of electromagnetic interaction in the $X(3872)$ and its analogs

Authors: ** Chen, Zhan-Wei Liu, Zi-Le Zhang, Si-Qiang Luo, Fu-Lai Wang, Jun-Zhang Wang, Xiang Liu

Abstract: We investigate the role of the electromagnetic interaction in the formation and decay of the $X(3872)$. The binding properties of the $X(3872)$ are studied by assuming the molecular nature and considering the $S$-$D$ wave mixing, isospin breaking, and coupled channel effects, and in particular the correction from the electromagnetic interaction. The radiative decays can better reflect the differen… ▽ More We investigate the role of the electromagnetic interaction in the formation and decay of the $X(3872)$. The binding properties of the $X(3872)$ are studied by assuming the molecular nature and considering the $S$-$D$ wave mixing, isospin breaking, and coupled channel effects, and in particular the correction from the electromagnetic interaction. The radiative decays can better reflect the difference between the charged and neutral $D\bar D^*$ components, since the electromagnetic interaction explicitly breaks the isospin symmetry. We further study the radiative decay widths with the obtained wave functions for different $D\bar D^*$ channels. We also explore other similar hidden-charm molecular states. The electromagnetic interaction can make the molecule tighter. Our result of the radiative decay width for $X(3872)\rightarrow γJ/ψ$ is in agreement with the experiment. The branching ratio $R_{γψ}$ is less than 1 in our framework, which supports the Belle and BESIII measurements. △ Less

Submitted 1 May, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 17 pages, 8 figures, 7 tables

Journal ref: Phys.Rev.D 109, 094002 (2024)

arXiv:2401.05800 [pdf, other]

Graph Spatiotemporal Process for Multivariate Time Series Anomaly Detection with Missing Values

Authors: Yu Zheng, Huan Yee Koh, Ming **, Lianhua Chi, Haishuai Wang, Khoa T. Phan, Yi-** Phoebe Chen, Shirui Pan, Wei Xiang

Abstract: The detection of anomalies in multivariate time series data is crucial for various practical applications, including smart power grids, traffic flow forecasting, and industrial process control. However, real-world time series data is usually not well-structured, posting significant challenges to existing approaches: (1) The existence of missing values in multivariate time series data along variabl… ▽ More The detection of anomalies in multivariate time series data is crucial for various practical applications, including smart power grids, traffic flow forecasting, and industrial process control. However, real-world time series data is usually not well-structured, posting significant challenges to existing approaches: (1) The existence of missing values in multivariate time series data along variable and time dimensions hinders the effective modeling of interwoven spatial and temporal dependencies, resulting in important patterns being overlooked during model training; (2) Anomaly scoring with irregularly-sampled observations is less explored, making it difficult to use existing detectors for multivariate series without fully-observed values. In this work, we introduce a novel framework called GST-Pro, which utilizes a graph spatiotemporal process and anomaly scorer to tackle the aforementioned challenges in detecting anomalies on irregularly-sampled multivariate time series. Our approach comprises two main components. First, we propose a graph spatiotemporal process based on neural controlled differential equations. This process enables effective modeling of multivariate time series from both spatial and temporal perspectives, even when the data contains missing values. Second, we present a novel distribution-based anomaly scoring mechanism that alleviates the reliance on complete uniform observations. By analyzing the predictions of the graph spatiotemporal process, our approach allows anomalies to be easily detected. Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods, regardless of whether there are missing values present in the data. Our code is available: https://github.com/huankoh/GST-Pro. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: Accepted by Information Fusion

arXiv:2401.05561 [pdf, other]

TrustLLM: Trustworthiness in Large Language Models

Authors: Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang , et al. (45 additional authors not shown)

Abstract: Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in… ▽ More Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness. △ Less

Submitted 17 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: This work is still under work and we welcome your contribution

arXiv:2401.03371 [pdf]

Advancing Noise-Resilient Twist Angle Characterization in Bilayer Graphene through Raman Spectroscopy via GAN-CNN Modeling

Authors: Dan Hu, Ting-Fung Chung, Yong P. Chen, Ya** Qi

Abstract: In this study, we introduce an innovative methodology for robust twist angle identification in bilayer graphene using Raman spectroscopy, featuring the integration of generative adversarial network and convolutional neural network (GAN-CNN). Our proposed approach showcases remarkable resistance to noise interference, particularly in ultra-low Signal-to-Noise Ratio (SNR) conditions. We demonstrate… ▽ More In this study, we introduce an innovative methodology for robust twist angle identification in bilayer graphene using Raman spectroscopy, featuring the integration of generative adversarial network and convolutional neural network (GAN-CNN). Our proposed approach showcases remarkable resistance to noise interference, particularly in ultra-low Signal-to-Noise Ratio (SNR) conditions. We demonstrate the GAN-CNN model's robust learning capability, even when SNR reaches minimal levels. The model's exceptional noise resilience negates the necessity for preprocessing steps, facilitating accurate classification, and substantially reducing computational expenses. Empirical results reveal the model's prowess, achieving heightened accuracy in twist angle identification. Specifically, our GAN-CNN model achieves a test accuracy exceeding 99.9% and a recall accuracy of 99.9%, relying on an augmented dataset containing 4209 spectra. This work not only contributes to the evolution of noise-resistant spectral analysis methodologies but also provides crucial insights into the application of advanced deep learning techniques for bilayer graphene characterization through Raman spectroscopy. The findings presented herein have broader implications for enhancing the precision and efficiency of material characterization methodologies, laying the foundation for future advancements in the field. △ Less

Submitted 6 January, 2024; originally announced January 2024.

arXiv:2401.02651 [pdf, other]

Benchmarking PathCLIP for Pathology Image Analysis

Authors: Sunyi Zheng, Xiaonan Cui, Yuxuan Sun, **gxiong Li, Honglin Li, Yunlong Zhang, **yi Chen, Xue** **g, Zhaoxiang Ye, Lin Yang

Abstract: Accurate image classification and retrieval are of importance for clinical diagnosis and treatment decision-making. The recent contrastive language-image pretraining (CLIP) model has shown remarkable proficiency in understanding natural images. Drawing inspiration from CLIP, PathCLIP is specifically designed for pathology image analysis, utilizing over 200,000 image and text pairs in training. Whi… ▽ More Accurate image classification and retrieval are of importance for clinical diagnosis and treatment decision-making. The recent contrastive language-image pretraining (CLIP) model has shown remarkable proficiency in understanding natural images. Drawing inspiration from CLIP, PathCLIP is specifically designed for pathology image analysis, utilizing over 200,000 image and text pairs in training. While the performance the PathCLIP is impressive, its robustness under a wide range of image corruptions remains unknown. Therefore, we conduct an extensive evaluation to analyze the performance of PathCLIP on various corrupted images from the datasets of Osteosarcoma and WSSS4LUAD. In our experiments, we introduce seven corruption types including brightness, contrast, Gaussian blur, resolution, saturation, hue, and markup at four severity levels. Through experiments, we find that PathCLIP is relatively robustness to image corruptions and surpasses OpenAI-CLIP and PLIP in zero-shot classification. Among the seven corruptions, blur and resolution can cause server performance degradation of the PathCLIP. This indicates that ensuring the quality of images is crucial before conducting a clinical test. Additionally, we assess the robustness of PathCLIP in the task of image-image retrieval, revealing that PathCLIP performs less effectively than PLIP on Osteosarcoma but performs better on WSSS4LUAD under diverse corruptions. Overall, PathCLIP presents impressive zero-shot classification and retrieval performance for pathology images, but appropriate care needs to be taken when using it. We hope this study provides a qualitative impression of PathCLIP and helps understand its differences from other CLIP models. △ Less

Submitted 12 June, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

arXiv:2401.02611 [pdf, other]

MOODv2: Masked Image Modeling for Out-of-Distribution Detection

Authors: **gyao Li, Pengguang Chen, Shaozuo Yu, Shu Liu, Jiaya Jia

Abstract: The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pret… ▽ More The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2401.01921 [pdf, other]

The Cytnx Library for Tensor Networks

Authors: Kai-Hsin Wu, Chang-Teng Lin, Ke Hsu, Hao-Ti Hung, Manuel Schneider, Chia-Min Chung, Ying-Jer Kao, Pochung Chen

Abstract: We introduce a tensor network library designed for classical and quantum physics simulations called Cytnx (pronounced as sci-tens). This library provides almost an identical interface and syntax for both C++ and Python, allowing users to effortlessly switch between two languages. Aiming at a quick learning process for new users of tensor network algorithms, the interfaces resemble the popular Pyth… ▽ More We introduce a tensor network library designed for classical and quantum physics simulations called Cytnx (pronounced as sci-tens). This library provides almost an identical interface and syntax for both C++ and Python, allowing users to effortlessly switch between two languages. Aiming at a quick learning process for new users of tensor network algorithms, the interfaces resemble the popular Python scientific libraries like NumPy, Scipy, and PyTorch. Not only multiple global Abelian symmetries can be easily defined and implemented, Cytnx also provides a new tool called Network that allows users to store large tensor networks and perform tensor network contractions in an optimal order automatically. With the integration of cuQuantum, tensor calculations can also be executed efficiently on GPUs. We present benchmark results for tensor operations on both devices, CPU and GPU. We also discuss features and higher-level interfaces to be added in the future. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01517 [pdf, other]

The rotation curve and mass distribution of M31

Authors: Xiangwei Zhang, Bingqiu Chen, Pinjian Chen, Jiarui Sun, Zhijia Tian

Abstract: To gain a better understanding of the Andromeda galaxy M31 and its role in the Local Group, measuring its mass precisely is essential. In this work, we have constructed the rotation curve of M31 out to $\sim$125 kpc using 13,679 M31 objects obtained from various sources, including the LAMOST data release 9 (LAMOST DR9), the DESI survey, and relevant literature. We divide all objects in our sample… ▽ More To gain a better understanding of the Andromeda galaxy M31 and its role in the Local Group, measuring its mass precisely is essential. In this work, we have constructed the rotation curve of M31 out to $\sim$125 kpc using 13,679 M31 objects obtained from various sources, including the LAMOST data release 9 (LAMOST DR9), the DESI survey, and relevant literature. We divide all objects in our sample into bulge, disk and halo components. For the sources in the M31 disk, we have measured their circular velocities by a kinematic model with asymmetric drift corrections. For the bulge and halo objects, we calculate their velocity dispersions and use the spherical and projected Jeans equation to obtain the circular velocities. Our findings indicate a nearly isotropic nature for the M31 bulge, while the halo exhibits tangential anisotropy. The results show that the rotation curve remains constant at $\sim$220 km s$^{-1}$ up to radius $\sim$25 kpc and gradually decreases to $\sim$170 km s$^{-1}$ further out. Based on the newly determined rotation curve, we have constructed a mass distribution model for M31. Our measurement of the M31 virial mass is $M_{\rm vir} = 1.14^{+0.51}_{-0.35} \times 10^{12} M_\odot$ within $r_{\rm vir} = 220 \pm 25$ kpc. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:2312.17677 [pdf, other]

Prompt Fuzzing for Fuzz Driver Generation

Authors: Yunlong Lyu, Yuxuan Xie, Peng Chen, Hao Chen

Abstract: Crafting high-quality fuzz drivers not only is time-consuming but also requires a deep understanding of the library. However, the state-of-the-art automatic fuzz driver generation techniques fall short of expectations. While fuzz drivers derived from consumer code can reach deep states, they have limited coverage. Conversely, interpretative fuzzing can explore most API calls but requires numerous… ▽ More Crafting high-quality fuzz drivers not only is time-consuming but also requires a deep understanding of the library. However, the state-of-the-art automatic fuzz driver generation techniques fall short of expectations. While fuzz drivers derived from consumer code can reach deep states, they have limited coverage. Conversely, interpretative fuzzing can explore most API calls but requires numerous attempts within a large search space. We propose PromptFuzz, a coverage-guided fuzzer for prompt fuzzing that iteratively generates fuzz drivers to explore undiscovered library code. To explore API usage in fuzz drivers during prompt fuzzing, we propose several key techniques: instructive program generation, erroneous program validation, coverage-guided prompt mutation, and constrained fuzzer scheduling. We implemented PromptFuzz and evaluated it on 14 real-world libraries. Compared with OSS-Fuzz and Hopper (the state-of-the-art fuzz driver generation tool), fuzz drivers generated by PromptFuzz achieved 1.61 and 1.63 times higher branch coverage than those by OSS-Fuzz and Hopper, respectively. Moreover, the fuzz drivers generated by PromptFuzz detected 33 genuine, new bugs out of a total of 49 crashes, out of which 30 bugs have been confirmed by their respective communities. △ Less

Submitted 29 May, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: To appear in the ACM CCS 2024

arXiv:2312.17611 [pdf, other]

P2M2-Net: Part-Aware Prompt-Guided Multimodal Point Cloud Completion

Authors: Linlian Jiang, Pan Chen, Ye Wang, Tieru Wu, Rui Ma

Abstract: Inferring missing regions from severely occluded point clouds is highly challenging. Especially for 3D shapes with rich geometry and structure details, inherent ambiguities of the unknown parts are existing. Existing approaches either learn a one-to-one map** in a supervised manner or train a generative model to synthesize the missing points for the completion of 3D point cloud shapes. These met… ▽ More Inferring missing regions from severely occluded point clouds is highly challenging. Especially for 3D shapes with rich geometry and structure details, inherent ambiguities of the unknown parts are existing. Existing approaches either learn a one-to-one map** in a supervised manner or train a generative model to synthesize the missing points for the completion of 3D point cloud shapes. These methods, however, lack the controllability for the completion process and the results are either deterministic or exhibiting uncontrolled diversity. Inspired by the prompt-driven data generation and editing, we propose a novel prompt-guided point cloud completion framework, coined P2M2-Net, to enable more controllable and more diverse shape completion. Given an input partial point cloud and a text prompt describing the part-aware information such as semantics and structure of the missing region, our Transformer-based completion network can efficiently fuse the multimodal features and generate diverse results following the prompt guidance. We train the P2M2-Net on a new large-scale PartNet-Prompt dataset and conduct extensive experiments on two challenging shape completion benchmarks. Quantitative and qualitative results show the efficacy of incorporating prompts for more controllable part-aware point cloud completion and generation. Code and data are available at https://github.com/JLU-ICL/P2M2-Net. △ Less

Submitted 29 December, 2023; originally announced December 2023.

Comments: Best Poster Award of CAD/Graphics 2023

arXiv:2312.17080 [pdf, other]

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia

Abstract: In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoni… ▽ More In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach. △ Less

Submitted 5 June, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: Code: https://github.com/dvlab-research/MR-GSM8K

arXiv:2312.16467 [pdf, other]

Transfer and Alignment Network for Generalized Category Discovery

Authors: Wenbin An, Feng Tian, Wenkai Shi, Yan Chen, Yaqiang Wu, Qianying Wang, ** Chen

Abstract: Generalized Category Discovery is a crucial real-world task. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. To mitigate these two issues, we propose a Transfer and Alignment… ▽ More Generalized Category Discovery is a crucial real-world task. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. Our code and data are available at https://github.com/Lackel/TAN. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024

arXiv:2312.15960 [pdf, other]

MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming Tasks

Authors: **gyao Li, Pengguang Chen, Jiaya Jia

Abstract: Large Language Models (LLMs) have showcased impressive capabilities in handling straightforward programming tasks. However, their performance tends to falter when confronted with more challenging programming problems. We observe that conventional models often generate solutions as monolithic code blocks, restricting their effectiveness in tackling intricate questions. To overcome this limitation,… ▽ More Large Language Models (LLMs) have showcased impressive capabilities in handling straightforward programming tasks. However, their performance tends to falter when confronted with more challenging programming problems. We observe that conventional models often generate solutions as monolithic code blocks, restricting their effectiveness in tackling intricate questions. To overcome this limitation, we present Modular-of-Thought Coder (MoTCoder). We introduce a pioneering framework for MoT instruction tuning, designed to promote the decomposition of tasks into logical sub-tasks and sub-modules. Our investigations reveal that, through the cultivation and utilization of sub-modules, MoTCoder significantly improves both the modularity and correctness of the generated solutions, leading to substantial relative pass@1 improvements of 12.9% on APPS and 9.43% on CodeContests. Our codes are available at https://github.com/dvlab-research/MoTCoder. △ Less

Submitted 5 January, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

Comments: Model: https://huggingface.co/**gyaoLi/MoTCoder-15B-v1.0. Code: https://github.com/dvlab-research/MoTCoder

arXiv:2312.15944 [pdf, other]

BAL: Balancing Diversity and Novelty for Active Learning

Authors: **gyao Li, Pengguang Chen, Shaozuo Yu, Shu Liu, Jiaya Jia

Abstract: The objective of Active Learning is to strategically label a subset of the dataset to maximize performance within a predetermined labeling budget. In this study, we harness features acquired through self-supervised learning. We introduce a straightforward yet potent metric, Cluster Distance Difference, to identify diverse data. Subsequently, we introduce a novel framework, Balancing Active Learnin… ▽ More The objective of Active Learning is to strategically label a subset of the dataset to maximize performance within a predetermined labeling budget. In this study, we harness features acquired through self-supervised learning. We introduce a straightforward yet potent metric, Cluster Distance Difference, to identify diverse data. Subsequently, we introduce a novel framework, Balancing Active Learning (BAL), which constructs adaptive sub-pools to balance diverse and uncertain data. Our approach outperforms all established active learning methods on widely recognized benchmarks by 1.20%. Moreover, we assess the efficacy of our proposed framework under extended settings, encompassing both larger and smaller labeling budgets. Experimental results demonstrate that, when labeling 80% of the samples, the performance of the current SOTA method declines by 0.74%, whereas our proposed BAL achieves performance comparable to the full dataset. Codes are available at https://github.com/JulietLJY/BAL. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: Our paper is accepted by TPAMI

arXiv:2312.15895 [pdf, other]

Semantic-aware SAM for Point-Prompted Instance Segmentation

Authors: Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jianbin Jiao, Zhenjun Han

Abstract: Single-point annotation in visual tasks, with the goal of minimizing labelling costs, is becoming increasingly prominent in research. Recently, visual foundation models, such as Segment Anything (SAM), have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However, SAM's class-agnostic output and high confidence in local segmentation introdu… ▽ More Single-point annotation in visual tasks, with the goal of minimizing labelling costs, is becoming increasingly prominent in research. Recently, visual foundation models, such as Segment Anything (SAM), have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However, SAM's class-agnostic output and high confidence in local segmentation introduce 'semantic ambiguity', posing a challenge for precise category-specific segmentation. In this paper, we introduce a cost-effective category-specific segmenter using SAM. To tackle this challenge, we have devised a Semantic-Aware Instance Segmentation Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts. SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation, with a specific focus on object category information. Moreover, we introduce the Point Distance Guidance and Box Mining Strategy to mitigate inherent challenges: 'group' and 'local' issues in weakly supervised segmentation. These strategies serve to further enhance the overall segmentation performance. The experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed SAPNet, emphasizing its semantic matching capabilities and its potential to advance point-prompted instance segmentation. The code will be made publicly available. △ Less

Submitted 26 May, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

Comments: 16 pages, 8 figures, CVPR2024

arXiv:2312.14810 [pdf, other]

Accurate, scalable, and efficient Bayesian Optimal Experimental Design with derivative-informed neural operators

Authors: **woo Go, Peng Chen

Abstract: We consider optimal experimental design (OED) problems in selecting the most informative observation sensors to estimate model parameters in a Bayesian framework. Such problems are computationally prohibitive when the parameter-to-observable (PtO) map is expensive to evaluate, the parameters are high-dimensional, and the optimization for sensor selection is combinatorial and high-dimensional. To a… ▽ More We consider optimal experimental design (OED) problems in selecting the most informative observation sensors to estimate model parameters in a Bayesian framework. Such problems are computationally prohibitive when the parameter-to-observable (PtO) map is expensive to evaluate, the parameters are high-dimensional, and the optimization for sensor selection is combinatorial and high-dimensional. To address these challenges, we develop an accurate, scalable, and efficient computational framework based on derivative-informed neural operators (DINOs). The derivative of the PtO map is essential for accurate evaluation of the optimality criteria of OED in our consideration. We take the key advantage of DINOs, a class of neural operators trained with derivative information, to achieve high approximate accuracy of not only the PtO map but also, more importantly, its derivative. Moreover, we develop scalable and efficient computation of the optimality criteria based on DINOs and propose a modified swap** greedy algorithm for its optimization. We demonstrate that the proposed method is scalable to preserve the accuracy for increasing parameter dimensions and achieves high computational efficiency, with an over 1000x speedup accounting for both offline construction and online evaluation costs, compared to high-fidelity Bayesian OED solutions for a three-dimensional nonlinear convection-diffusion-reaction example with tens of thousands of parameters. △ Less

Submitted 27 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

MSC Class: 62K05; 35Q62; 62F15; 35R30; 35Q93; 65C60; 90C27 ACM Class: G.1.8; I.5.2; I.6.4

arXiv:2312.14018 [pdf, ps, other]

Enabling Secure Wireless Communications via Movable Antennas

Authors: Zhenqiao Cheng, Nanxi Li, Jianchi Zhu, Xiaoming She, Chongjun Ouyang, Peng Chen

Abstract: A pioneering secure transmission scheme is proposed, which harnesses movable antennas (MAs) to optimize antenna positions for augmenting the physical layer security. Particularly, an MA-enabled secure wireless system is considered, where a multi-antenna transmitter communicates with a single-antenna receiver in the presence of an eavesdropper. The beamformer and antenna positions at the transmitte… ▽ More A pioneering secure transmission scheme is proposed, which harnesses movable antennas (MAs) to optimize antenna positions for augmenting the physical layer security. Particularly, an MA-enabled secure wireless system is considered, where a multi-antenna transmitter communicates with a single-antenna receiver in the presence of an eavesdropper. The beamformer and antenna positions at the transmitter are jointly optimized under two criteria: power consumption minimization and secrecy rate maximization. For each scenario, a novel suboptimal algorithm was proposed to tackle the resulting nonconvex optimization problem, capitalizing on the approaches of alternating optimization and gradient descent. Numerical results demonstrate that the proposed MA systems significantly improve physical layer security compared to various benchmark schemes relying on conventional fixed-position antennas (FPAs). △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Accepted by IEEE ICASSP 2024

arXiv:2312.14002 [pdf, other]

Tensor Network Finite-Size Scaling for Two-Dimensional 3-state Clock Model

Authors: Debasmita Maiti, Sing-Hong Chan, Pochung Chen

Abstract: We benchmark recently proposed tensor network based finite-size scaling analysis in Phys. Rev. B 107, 205123 (2023) against two-dimensional classical 3-state clock model. Due to the higher complexity of the model, more complicated crossover behavior is observed. We advocate that the crossover behavior can be understood from the perspective of finite bond dimension inducing relevant perturbation. T… ▽ More We benchmark recently proposed tensor network based finite-size scaling analysis in Phys. Rev. B 107, 205123 (2023) against two-dimensional classical 3-state clock model. Due to the higher complexity of the model, more complicated crossover behavior is observed. We advocate that the crossover behavior can be understood from the perspective of finite bond dimension inducing relevant perturbation. This leads to a general strategy to best estimate the critical properties for a given set of control parameters.For the critical temperature $T_c$, the error at the order of $10^{-6}$ can be reached with bond dimension $D=70$. On the other hand, with bond dimension $D=60$, the error of the critical exponents $ν, β, α$ is at the order of $10^{-3}$.Increasing the bond dimension to $D=90$, the error of $ν$ and $β$ can be reduced to the order of $10^{-4}$, but for some calculations numerical instability starts to emerge. In all cases our results indicate that the errors can be systematically reduced by increasing the bond dimension and the stacking number. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2312.12436 [pdf, other]

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Authors: Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun

Abstract: The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the gr… ▽ More The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models. △ Less

Submitted 20 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Total 120 pages. See our project at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

arXiv:2312.11911 [pdf, other]

EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Map**

Authors: Weipeng Guan, Peiyu Chen, Huibin Zhao, Yu Wang, Peng Lu

Abstract: Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging t… ▽ More Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging the robustness of feature matching and the precision of direct alignment. Specifically, we develop an event-based 2D-2D alignment to construct the photometric constraint, and tightly integrate it with the event-based reprojection constraint. The map** module recovers the dense and colorful depth of the scene through the image-guided event-based map** method. Subsequently, the appearance, texture, and surface mesh of the 3D scene can be reconstructed by fusing the dense depth map from multiple viewpoints using truncated signed distance function (TSDF) fusion. To the best of our knowledge, this is the first non-learning work to realize event-based dense map**. Numerical evaluations are performed on both publicly available and self-collected datasets, which qualitatively and quantitatively demonstrate the superior performance of our method. Our EVI-SAM effectively balances accuracy and robustness while maintaining computational efficiency, showcasing superior pose tracking and dense map** performance in challenging scenarios. Video Demo: https://youtu.be/Nn40U4e5Si8. △ Less

Submitted 23 May, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.11686 [pdf]

All-optical modulation with single-photons using electron avalanche

Authors: Demid V. Sychev, Peigang Chen, Morris Yang, Colton Fruhling, Alexei Lagutchev, Alexander V. Kildishev, Alexandra Boltasseva, Vladimir M. Shalaev

Abstract: The distinctive characteristics of light such as high-speed propagation, low-loss, low cross-talk and power consumption as well as quantum properties, make it uniquely suitable for various critical applications in communication, high-resolution imaging, optical computing, and emerging quantum information technologies. One limiting factor though is the weak optical nonlinearity of conventional medi… ▽ More The distinctive characteristics of light such as high-speed propagation, low-loss, low cross-talk and power consumption as well as quantum properties, make it uniquely suitable for various critical applications in communication, high-resolution imaging, optical computing, and emerging quantum information technologies. One limiting factor though is the weak optical nonlinearity of conventional media that poses challenges for the control and manipulation of light, especially with ultra-low, few-photon-level intensities. Notably, creating a photonic transistor working at single-photon intensities remains an outstanding challenge. In this work, we demonstrate all-optical modulation using a beam with single-photon intensity. Such low-energy control is enabled by the electron avalanche process in a semiconductor triggered by the impact ionization of charge carriers. This corresponds to achieving a nonlinear refractive index of n2~7*10^-3m^2/W, which is two orders of magnitude higher than in the best nonlinear optical media (Table S1). Our approach opens up the possibility of terahertz-speed optical switching at the single-photon level, which could enable novel photonic devices and future quantum photonic information processing and computing, fast logic gates, and beyond. Importantly, this approach could lead to industry-ready CMOS-compatible and chip-integrated optical modulation platforms operating with single photons. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.11583 [pdf, other]

AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System

Authors: Chengyuan Zhu, Yiyuan Yang, Kaixiang Yang, Haifeng Zhang, Qinmin Yang, C. L. Philip Chen

Abstract: The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical im… ▽ More The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods - their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios. △ Less

Submitted 25 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

Showing 151–200 of 2,649 results for author: Chen, P