Search | arXiv e-print repository

Fidelity and criticality in the nonreciprocal Aubry-Andr{é}-Harper model

Authors: Chen-Chang Zeng, Zhen Cai, Guang-Heng Wang, Gaoyong Sun

Abstract: We study the critical behaviors of the ground and first excited states in the one-dimensional nonreciprocal Aubry-Andr{é}-Harper model using both the self-normal and biorthogonal fidelity susceptibilities. We demonstrate that fidelity susceptibilities serve as a probe for the phase transition in the nonreciprocal AAH model. For ground states, characterized by real eigenenergies across the entire r… ▽ More We study the critical behaviors of the ground and first excited states in the one-dimensional nonreciprocal Aubry-Andr{é}-Harper model using both the self-normal and biorthogonal fidelity susceptibilities. We demonstrate that fidelity susceptibilities serve as a probe for the phase transition in the nonreciprocal AAH model. For ground states, characterized by real eigenenergies across the entire regime, both fidelity susceptibilities near the critical points scale as $N^{2}$, akin to the Hermitian AAH model. However, for the first-excited states, where $\mathcal{PT}$ transitions occur, the fidelity susceptibilities exhibit distinct scaling laws, contingent upon whether the lattice consists of even or odd sites. For even lattices, both the self-normal and and biorthogonal fidelity susceptibilities near the critical points continue to scale as $N^{2}$. In contrast, for odd lattices, the biorthogonal fidelity susceptibilities diverge, while the self-normal fidelity susceptibilities exhibit linear behavior, indicating a novel scaling law. △ Less

Submitted 30 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: 7 pages, 4 figures

arXiv:2404.16425 [pdf, other]

Soft X-ray prompt emission from a high-redshift gamma-ray burst EP240315a

Authors: Y. Liu, H. Sun, D. Xu, D. S. Svinkin, J. Delaunay, N. R. Tanvir, H. Gao, C. Zhang, Y. Chen, X. -F. Wu, B. Zhang, W. Yuan, J. An, G. Bruni, D. D. Frederiks, G. Ghirlanda, J. -W. Hu, A. Li, C. -K. Li, J. -D. Li, D. B. Malesani, L. Piro, G. Raman, R. Ricci, E. Troja , et al. (170 additional authors not shown)

Abstract: Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a,… ▽ More Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a, whose bright peak was also detected by the Swift Burst Alert Telescope and Konus-Wind through off-line analyses. At a redshift of $z=4.859$, EP240315a showed a much longer and more complicated light curve in the soft X-ray band than in gamma-rays. Benefiting from a large field-of-view ($\sim$3600 deg$^2$) and a high sensitivity, EP-WXT captured the earlier engine activation and extended late engine activity through a continuous detection. With a peak X-ray flux at the faint end of previously known high-$z$ GRBs, the detection of EP240315a demonstrates the great potential for EP to study the early universe via GRBs. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 41 pages, 8 figures, 7 tables

arXiv:2404.15963 [pdf, other]

Cosmic Himalayas: The Highest Quasar Density Peak Identified in a 10,000 deg$^2$ Sky with Spatial Discrepancies between Galaxies, Quasars, and IGM HI

Authors: Yongming Liang, Masami Ouchi, Dongsheng Sun, Nobunari Kashikawa, Zheng Cai, Sebastiano Cantalupo, Kentaro Nagamine, Hidenobu Yajima, Takanobu Kirihara, Haibin Zhang, Mingyu Li, Rhythm Shimakawa, Xiaohui Fan, Kei Ito, Masayuki Tanaka, Yuichi Harikane, J. Xavier Prochaska, Andrea Travascio, Weichen Wang, Martin Elvis, Giuseppina Fabbiano, Junya Arita, Masafusa Onoue, John D. Silverman, Dongdong Shi , et al. (5 additional authors not shown)

Abstract: We report the identification of a quasar overdensity in the BOSSJ0210 field, dubbed Cosmic Himalayas, consisting of 11 quasars at $z=2.16-2.20$, the densest overdensity of quasars ($17σ$) in the $\sim$10,000 deg$^2$ of the Sloan Digital Sky Survey. We present the spatial distributions of galaxies and quasars and an HI absorption map of the intergalactic medium (IGM). On the map of 465 galaxies sel… ▽ More We report the identification of a quasar overdensity in the BOSSJ0210 field, dubbed Cosmic Himalayas, consisting of 11 quasars at $z=2.16-2.20$, the densest overdensity of quasars ($17σ$) in the $\sim$10,000 deg$^2$ of the Sloan Digital Sky Survey. We present the spatial distributions of galaxies and quasars and an HI absorption map of the intergalactic medium (IGM). On the map of 465 galaxies selected from the MAMMOTH-Subaru survey, we find two galaxy density peaks that do not fall on the quasar overdensity but instead exist at the northwest and southeast sides, approximately 25 $h^{-1}$ comoving-Mpc apart from the quasar overdensity. With a spatial resolution of 15 $h^{-1}$ comoving Mpc in projection, we produce a three-dimensional HI tomography map by the IGM Ly$α$ forest in the spectra of 23 SDSS/eBOSS quasars behind the quasar overdensity. Surprisingly, the quasar overdensity coincides with neither an absorption peak nor a transmission peak of IGM HI but lies near the border separating opaque and transparent volumes, with the more luminous quasars located in an environment with lesser IGM HI. Hence remarkably, the overdensity region traced by the 11 quasars, albeit all in coherently active states, has no clear coincidence with peaks of galaxies or HI absorption densities. Current physical scenarios with mixtures of HI overdensities and quasar photoionization cannot fully interpret the emergence of Cosmic Himalayas, suggesting this peculiar structure is an excellent laboratory to unveil the interplay between galaxies, quasars, and the IGM. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 19 pages, 11 figures, submitted to ApJ, comments are welcome

arXiv:2404.15506 [pdf, other]

Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Authors: Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, Shaojie Shen

Abstract: We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recov… ▽ More We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2. △ Less

Submitted 21 March, 2024; originally announced April 2024.

Comments: Our project page is at https://JUGGHM.github.io/Metric3Dv2. arXiv admin note: substantial text overlap with arXiv:2307.10984

arXiv:2404.15127 [pdf, other]

MedDr: Diagnosis-Guided Bootstrap** for Large-Scale Medical Vision-Language Learning

Authors: Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, Hao Chen

Abstract: The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrap** strategy that exploits both image and label information to con… ▽ More The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrap** strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.11551 [pdf, other]

doi 10.1093/mnras/stae976

Euclid view of dusty star forming galaxies at z>~1.5 detected in wide area submillimetre surveys

Authors: Dipanjan Mitra, Mattia Negrello, Gianfranco De Zotti, Zhen-Yi Cai

Abstract: We investigate the constraints provided by the Euclid space observatory on the physical properties of dusty star forming galaxies (DSFGs) at z>~1.5 detected in wide area sub millimetre surveys with Herschel. We adopt a physical model for the high z progenitors of spheroidal galaxies, which form the bulk of the DSFGs at z>~1.5. We improve the model by combining the output of the equations of the mo… ▽ More We investigate the constraints provided by the Euclid space observatory on the physical properties of dusty star forming galaxies (DSFGs) at z>~1.5 detected in wide area sub millimetre surveys with Herschel. We adopt a physical model for the high z progenitors of spheroidal galaxies, which form the bulk of the DSFGs at z>~1.5. We improve the model by combining the output of the equations of the model with a formalism for the spectral energy distribution(SED). After optimising the SED parameters to reproduce the measured infrared luminosity function and the number counts of DSFGs, we simulated a sample of DSFGs over 100 sq deg and then applied a 5 sigma detection limit of 37 mJy at 250 microns. We estimated the redshifts from the Euclid data and then fitted the Euclid and Herschel photometry with the code CIGALE to extract the physicsl parameters. We found that 100 % of the Herschel galaxies are detected in all 4 Euclid bands above 3 sigma. For 87% of the sources the accuracy on 1+z is better than 15%. The sample comprises mostly massive log(Mstar/Msun)~10.5-12.9, highly star forming, log(SFR/Msun/yr)~1.5-4, dusty, log(Mdust/Msun)~7.5-9.9, galaxies. The measured stellar mass have a dispersion of 0.19 dex around the true value, thus showing that Euclid will provide reliable stellar mass estimates for the majority of the bright DSFGs at z>~1.5 detected by Herschel. We also explored the effect of complementing the Euclid photometry with that from Vera C. Rubin Observatory/LSST. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 24 pages, 24 figures

arXiv:2404.10573 [pdf, other]

AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation

Authors: Lijun Liu, Jiali Yang, Jianfei Song, Xinglin Yang, Lele Niu, Zeqi Cai, Hui Shi, Tingjun Hou, Chang-yu Hsieh, Weiran Shen, Yafeng Deng

Abstract: Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifyin… ▽ More Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications. △ Less

Submitted 17 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.08491 [pdf, other]

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Yufeng He, Kaikai An, Baobao Chang

Abstract: Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tun… ▽ More Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at https://github.com/pkunlp-icler/ALSACE. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: NAACL 2024

arXiv:2404.08180 [pdf, other]

Sizes of active galactic nuclei inhomogeneous disks -- large in microlensing, small in reverberation map**

Authors: Guowei Ren, Mouyuan Sun, Jun-Xian Wang, Zhen-Yi Cai

Abstract: Magnetohydrodynamics (MHD) turbulence can drive significant temperature fluctuations in the accretion disk of an active galactic nucleus (AGN). As a result, the disk can be highly inhomogeneous and has a half-light radius larger than the static Shakura \& Sunyaev Disk (SSD), in agreement with quasar microlensing observations. Meanwhile, the accretion-disk sizes can also be determined using continu… ▽ More Magnetohydrodynamics (MHD) turbulence can drive significant temperature fluctuations in the accretion disk of an active galactic nucleus (AGN). As a result, the disk can be highly inhomogeneous and has a half-light radius larger than the static Shakura \& Sunyaev Disk (SSD), in agreement with quasar microlensing observations. Meanwhile, the accretion-disk sizes can also be determined using continuum reverberation map**s which measure interband cross correlations and time lags. The interband time lags are often understood in the X-ray reprocessing scenario. Here we show that the interband continuum time lags of the X-ray reprocessing of an inhomogeneous disk are similar to or even smaller than those of a static SSD. Consequently, the X-ray reprocessing of an inhomogeneous disk cannot account for the recent continuum reverberation map**s of some Seyfert 1 AGNs, whose measured time lags are larger than those of a static SSD. In contrast to the tight correlation between UV/optical variations, the cross correlation between X-ray and disk emission is rather weak in this model; this behavior is consistent with recent continuum reverberation map**s. Moreover, the time lags in this model are anti-correlated with the amplitude of disk temperature fluctuations. Our results suggest that the temperature fluctuations should be properly considered when modeling interband continuum time lags. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 14 pages, 7 figures, Accepted to ApJ

arXiv:2404.08045 [pdf, other]

JWST Discovery of $40+$ Microlensed Stars in a Magnified Galaxy, the "Dragon" behind Abell 370

Authors: Yoshinobu Fudamoto, Fengwu Sun, Jose M. Diego, Liang Dai, Masamune Oguri, Adi Zitrin, Erik Zackrisson, Mathilde Jauzac, David J. Lagattuta, Eiichi Egami, Edoardo Iani, Rogier A. Windhorst, Katsuya T. Abe, Franz Erik Bauer, Fuyan Bian, Rachana Bhatawdekar, Thomas J. Broadhurst, Zheng Cai, Chian-Chou Chen, Wenlei Chen, Seth H. Cohen, Christopher J. Conselice, Daniel Espada, Nicholas Foo, Brenda L. Frye , et al. (21 additional authors not shown)

Abstract: Strong gravitational magnification by massive galaxy clusters enable us to detect faint background sources, resolve their detailed internal structures, and in the most extreme cases identify and study individual stars in distant galaxies. Highly magnified individual stars allow for a wide range of applications, including studies of stellar populations in distant galaxies and constraining small-sca… ▽ More Strong gravitational magnification by massive galaxy clusters enable us to detect faint background sources, resolve their detailed internal structures, and in the most extreme cases identify and study individual stars in distant galaxies. Highly magnified individual stars allow for a wide range of applications, including studies of stellar populations in distant galaxies and constraining small-scale dark matter structures. However, these applications have been hampered by the small number of events observed, as typically one or a few stars are identified from each distant galaxy. Here, we report the discovery of 46 significant microlensed stars in a single strongly-lensed high-redshift galaxy behind the Abell 370 cluster at redshift of 0.725 when the Universe was half of its current age (dubbed the ``Dragon arc''), based on two observations separated by one year with the James Webb Space Telescope ({\it JWST}). These events are mostly found near the expected lensing critical curves, suggesting that these are magnified individual stars that appear as transients from intracluster stellar microlenses. Through multi-wavelength photometry and colors, we constrain stellar types and find that many of them are consistent with red giants/supergiants magnified by factors of thousands. This finding reveals an unprecedented high occurrence of microlensing events in the Dragon arc, and proves that {\it JWST}'s time-domain observations open up the possibility of conducting statistical studies of high-redshift stars and subgalactic scale perturbations in the lensing dark matter field. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 15 pages, 4 figures, 1 table submitted to Nature Astronomy

arXiv:2404.06995 [pdf, other]

Model-free Change-point Detection Using Modern Classifiers

Authors: Rohit Kanrar, Feiyu Jiang, Zhanrui Cai

Abstract: In contemporary data analysis, it is increasingly common to work with non-stationary complex datasets. These datasets typically extend beyond the classical low-dimensional Euclidean space, making it challenging to detect shifts in their distribution without relying on strong structural assumptions. This paper introduces a novel offline change-point detection method that leverages modern classifier… ▽ More In contemporary data analysis, it is increasingly common to work with non-stationary complex datasets. These datasets typically extend beyond the classical low-dimensional Euclidean space, making it challenging to detect shifts in their distribution without relying on strong structural assumptions. This paper introduces a novel offline change-point detection method that leverages modern classifiers developed in the machine-learning community. With suitable data splitting, the test statistic is constructed through sequential computation of the Area Under the Curve (AUC) of a classifier, which is trained on data segments on both ends of the sequence. It is shown that the resulting AUC process attains its maxima at the true change-point location, which facilitates the change-point estimation. The proposed method is characterized by its complete nonparametric nature, significant versatility, considerable flexibility, and absence of stringent assumptions pertaining to the underlying data or any distributional shifts. Theoretically, we derive the limiting pivotal distribution of the proposed test statistic under null, as well as the asymptotic behaviors under both local and fixed alternatives. The weak consistency of the change-point estimator is provided. Extensive simulation studies and the analysis of two real-world datasets illustrate the superior performance of our approach compared to existing model-free change-point detection methods. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.05445 [pdf, other]

Unsupervised Training of Convex Regularizers using Maximum Likelihood Estimation

Authors: Hong Ye Tan, Ziruo Cai, Marcelo Pereyra, Subhadip Mukherjee, Junqi Tang, Carola-Bibiane Schönlieb

Abstract: Unsupervised learning is a training approach in the situation where ground truth data is unavailable, such as inverse imaging problems. We present an unsupervised Bayesian training approach to learning convex neural network regularizers using a fixed noisy dataset, based on a dual Markov chain estimation method. Compared to classical supervised adversarial regularization methods, where there is ac… ▽ More Unsupervised learning is a training approach in the situation where ground truth data is unavailable, such as inverse imaging problems. We present an unsupervised Bayesian training approach to learning convex neural network regularizers using a fixed noisy dataset, based on a dual Markov chain estimation method. Compared to classical supervised adversarial regularization methods, where there is access to both clean images as well as unlimited to noisy copies, we demonstrate close performance on natural image Gaussian deconvolution and Poisson denoising tasks. △ Less

Submitted 8 April, 2024; originally announced April 2024.

MSC Class: 62C12; 62F15; 65C40; 65J22

arXiv:2404.05064 [pdf, other]

A Structure-Guided Gauss-Newton Method for Shallow ReLU Neural Network

Authors: Zhiqiang Cai, Tong Ding, Min Liu, Xinyu Liu, Jianlin Xia

Abstract: In this paper, we propose a structure-guided Gauss-Newton (SgGN) method for solving least squares problems using a shallow ReLU neural network. The method effectively takes advantage of both the least squares structure and the neural network structure of the objective function. By categorizing the weights and biases of the hidden and output layers of the network as nonlinear and linear parameters,… ▽ More In this paper, we propose a structure-guided Gauss-Newton (SgGN) method for solving least squares problems using a shallow ReLU neural network. The method effectively takes advantage of both the least squares structure and the neural network structure of the objective function. By categorizing the weights and biases of the hidden and output layers of the network as nonlinear and linear parameters, respectively, the method iterates back and forth between the nonlinear and linear parameters. The nonlinear parameters are updated by a damped Gauss-Newton method and the linear ones are updated by a linear solver. Moreover, at the Gauss-Newton step, a special form of the Gauss-Newton matrix is derived for the shallow ReLU neural network and is used for efficient iterations. It is shown that the corresponding mass and Gauss-Newton matrices in the respective linear and nonlinear steps are symmetric and positive definite under reasonable assumptions. Thus, the SgGN method naturally produces an effective search direction without the need of additional techniques like shifting in the Levenberg-Marquardt method to achieve invertibility of the Gauss-Newton matrix. The convergence and accuracy of the method are demonstrated numerically for several challenging function approximation problems, especially those with discontinuities or sharp transition layers that pose significant challenges for commonly used training algorithms in machine learning. △ Less

Submitted 7 April, 2024; originally announced April 2024.

MSC Class: 65D15; 65K10

arXiv:2404.04469 [pdf, other]

Mixed-Query Transformer: A Unified Image Segmentation Architecture

Authors: Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto

Abstract: Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single… ▽ More Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights. To enable this, we propose a mixed query strategy, which can effectively and dynamically accommodate different types of objects without heuristic designs. In addition, the unified architecture allows us to use data augmentation with synthetic masks and captions to further improve model generalization. Experiments demonstrate that MQ-Former can not only effectively handle multiple segmentation datasets and tasks compared to specialized state-of-the-art models with competitive performance, but also generalize better to open-set segmentation tasks, evidenced by over 7 points higher performance than the prior art on the open-vocabulary SeginW benchmark. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.04458 [pdf, other]

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Authors: Simindokht Jahangard, Zhixi Cai, Shiki Wen, Hamid Rezatofighi

Abstract: Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short, necessitating a comprehensive approach that considers individual behaviour, intra-group dynamics, and social group levels for a thorough understanding. To address dataset limitations, this paper introduces JRDB-Social, an extension of JRDB. Designed to fill… ▽ More Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short, necessitating a comprehensive approach that considers individual behaviour, intra-group dynamics, and social group levels for a thorough understanding. To address dataset limitations, this paper introduces JRDB-Social, an extension of JRDB. Designed to fill gaps in human understanding across diverse indoor and outdoor social contexts, JRDB-Social provides annotations at three levels: individual attributes, intra-group interactions, and social group context. This dataset aims to enhance our grasp of human social dynamics for robotic applications. Utilizing the recent cutting-edge multi-modal large language models, we evaluated our benchmark to explore their capacity to decipher social human behaviour. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024. Project page: https://jrdb.erc.monash.edu/dataset/social

arXiv:2404.01372 [pdf]

Strong interactions and isospin symmetry breaking in a supermoiré lattice

Authors: Yonglong Xie, Andrew T. Pierce, Jeong Min Park, Daniel E. Parker, Jie Wang, Patrick Ledwith, Zhuozhen Cai, Kenji Watanabe, Takashi Taniguchi, Eslam Khalaf, Ashvin Vishwanath, Pablo Jarillo-Herrero, Amir Yacoby

Abstract: In multilayer moiré heterostructures, the interference of multiple twist angles ubiquitously leads to tunable ultra-long-wavelength patterns known as supermoiré lattices. However, their impact on the system's many-body electronic phase diagram remains largely unexplored. We present local compressibility measurements revealing numerous incompressible states resulting from supermoiré-lattice-scale i… ▽ More In multilayer moiré heterostructures, the interference of multiple twist angles ubiquitously leads to tunable ultra-long-wavelength patterns known as supermoiré lattices. However, their impact on the system's many-body electronic phase diagram remains largely unexplored. We present local compressibility measurements revealing numerous incompressible states resulting from supermoiré-lattice-scale isospin symmetry breaking driven by strong interactions. By using the supermoiré lattice occupancy as a probe of isospin symmetry, we observe an unexpected doubling of the miniband filling near $ν=-2$, possibly indicating a hidden phase transition or normal-state pairing proximal to the superconducting phase. Our work establishes supermoiré lattices as a tunable parameter for designing novel quantum phases and an effective tool for unraveling correlated phenomena in moiré materials. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.01284 [pdf, other]

Large Motion Model for Unified Multi-Modal Motion Generation

Authors: Mingyuan Zhang, Daisheng **, Chenyang Gu, Fangzhou Hong, Zhongang Cai, **gfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu

Abstract: Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on develo** specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation t… ▽ More Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on develo** specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: Homepage: https://mingyuan-zhang.github.io/projects/LMM.html

arXiv:2404.00139 [pdf]

Security Risks Concerns of Generative AI in the IoT

Authors: Honghui Xu, Yingshu Li, Olusesi Balogun, Shaoen Wu, Yue Wang, Zhipeng Cai

Abstract: In an era where the Internet of Things (IoT) intersects increasingly with generative Artificial Intelligence (AI), this article scrutinizes the emergent security risks inherent in this integration. We explore how generative AI drives innovation in IoT and we analyze the potential for data breaches when using generative AI and the misuse of generative AI technologies in IoT ecosystems. These risks… ▽ More In an era where the Internet of Things (IoT) intersects increasingly with generative Artificial Intelligence (AI), this article scrutinizes the emergent security risks inherent in this integration. We explore how generative AI drives innovation in IoT and we analyze the potential for data breaches when using generative AI and the misuse of generative AI technologies in IoT ecosystems. These risks not only threaten the privacy and efficiency of IoT systems but also pose broader implications for trust and safety in AI-driven environments. The discussion in this article extends to strategic approaches for mitigating these risks, including the development of robust security protocols, the multi-layered security approaches, and the adoption of AI technological solutions. Through a comprehensive analysis, this article aims to shed light on the critical balance between embracing AI advancements and ensuring stringent security in IoT, providing insights into the future direction of these intertwined technologies. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: 6 pages, 2 figures

arXiv:2403.20188 [pdf, other]

Distributed Swarm Learning for Edge Internet of Things

Authors: Yue Wang, Zhi Tian, FXin Fan, Zhipeng Cai, Cameron Nowzari, Kai Zeng

Abstract: The rapid growth of Internet of Things (IoT) has led to the widespread deployment of smart IoT devices at wireless edge for collaborative machine learning tasks, ushering in a new era of edge learning. With a huge number of hardware-constrained IoT devices operating in resource-limited wireless networks, edge learning encounters substantial challenges, including communication and computation bottl… ▽ More The rapid growth of Internet of Things (IoT) has led to the widespread deployment of smart IoT devices at wireless edge for collaborative machine learning tasks, ushering in a new era of edge learning. With a huge number of hardware-constrained IoT devices operating in resource-limited wireless networks, edge learning encounters substantial challenges, including communication and computation bottlenecks, device and data heterogeneity, security risks, privacy leakages, non-convex optimization, and complex wireless environments. To address these issues, this article explores a novel framework known as distributed swarm learning (DSL), which combines artificial intelligence and biological swarm intelligence in a holistic manner. By harnessing advanced signal processing and communications, DSL provides efficient solutions and robust tools for large-scale IoT at the edge of wireless networks. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2210.16705

arXiv:2403.17934 [pdf, other]

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Authors: Qing** Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai

Abstract: Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer… ▽ More Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via crop**, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Homepage: https://ttxskk.github.io/AiOS/

arXiv:2403.17297 [pdf, other]

InternLM2 Technical Report

Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.15694 [pdf, other]

Group Benefits Instances Selection for Data Purification

Authors: Zhenhuang Cai, Chuanyi Zhang, Dan Huang, Yuanbo Chen, Xiuyun Guan, Yazhou Yao

Abstract: Manually annotating datasets for training deep models is very labor-intensive and time-consuming. To overcome such inferiority, directly leveraging web images to conduct training data becomes a natural choice. Nevertheless, the presence of label noise in web data usually degrades the model performance. Existing methods for combating label noise are typically designed and tested on synthetic noisy… ▽ More Manually annotating datasets for training deep models is very labor-intensive and time-consuming. To overcome such inferiority, directly leveraging web images to conduct training data becomes a natural choice. Nevertheless, the presence of label noise in web data usually degrades the model performance. Existing methods for combating label noise are typically designed and tested on synthetic noisy datasets. However, they tend to fail to achieve satisfying results on real-world noisy datasets. To this end, we propose a method named GRIP to alleviate the noisy label problem for both synthetic and real-world datasets. Specifically, GRIP utilizes a group regularization strategy that estimates class soft labels to improve noise robustness. Soft label supervision reduces overfitting on noisy labels and learns inter-class similarities to benefit classification. Furthermore, an instance purification operation globally identifies noisy labels by measuring the difference between each training sample and its class soft label. Through operations at both group and instance levels, our approach integrates the advantages of noise-robust and noise-cleaning methods and remarkably alleviates the performance degradation caused by noisy labels. Comprehensive experimental results on synthetic and real-world datasets demonstrate the superiority of GRIP over the existing state-of-the-art methods. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: accepted by IEEE Intelligent Systems

arXiv:2403.15407 [pdf, other]

X-AMR Annotation Tool

Authors: Shafiuddin Rehan Ahmed, Jon Z. Cai, Martha Palmer, James H. Martin

Abstract: This paper presents a novel Cross-document Abstract Meaning Representation (X-AMR) annotation tool designed for annotating key corpus-level event semantics. Leveraging machine assistance through the Prodigy Annotation Tool, we enhance the user experience, ensuring ease and efficiency in the annotation process. Through empirical analyses, we demonstrate the effectiveness of our tool in augmenting a… ▽ More This paper presents a novel Cross-document Abstract Meaning Representation (X-AMR) annotation tool designed for annotating key corpus-level event semantics. Leveraging machine assistance through the Prodigy Annotation Tool, we enhance the user experience, ensuring ease and efficiency in the annotation process. Through empirical analyses, we demonstrate the effectiveness of our tool in augmenting an existing event corpus, highlighting its advantages when integrated with GPT-4. Code and annotations: https://github.com/ahmeshaf/gpt_coref △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: EACL 2024 System Demonstration

arXiv:2403.14979 [pdf, other]

Efficient aerodynamic coefficients prediction with a long sequence neural network

Authors: Zemin Cai, Zhengyuan Fan, Tianshu Liu

Abstract: Traditionally, deriving aerodynamic parameters for an airfoil via Computational Fluid Dynamics requires significant time and effort. However, recent approaches employ neural networks to replace this process, it still grapples with challenges like lack of end-to-end training and interpretability. A novel and more efficient neural network is proposed in this paper, called AirfoilNet. AirfoilNet seam… ▽ More Traditionally, deriving aerodynamic parameters for an airfoil via Computational Fluid Dynamics requires significant time and effort. However, recent approaches employ neural networks to replace this process, it still grapples with challenges like lack of end-to-end training and interpretability. A novel and more efficient neural network is proposed in this paper, called AirfoilNet. AirfoilNet seamlessly merges mathematical computations with neural networks, thereby augmenting interpretability. It encodes grey-scale airfoil images into a lower-dimensional space for computation with Reynolds number, angle of attack, and geometric coordinates of airfoils. The calculated features are then fed into prediction heads for aerodynamic coefficient predictions, and the entire process is end-to-end. Furthermore, two different prediction heads, Gated Recurrent Unit Net(GRUNet) and Residual Multi-Layer Perceptron(ResMLP), designed to support our iteratively refined prediction scheme. Comprehensive analysis of experimental results underscores AirfoilNet's robust prediction accuracy, generalization capability, and swift inference. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.13678 [pdf, other]

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Authors: Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Abstract: Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual mul… ▽ More Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors, especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble, leading to less-than-ideal outcomes in the task of facial action unit detection. To overcome these shortcomings, we propose a novel approach utilizing audio-visual multimodal data. This method enhances audio feature extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features alongside a pre-trained VGGish network. Moreover, this paper adaptively captures fusion features across modalities by modeling the temporal relationships, and ultilizes a pre-trained GPT-2 model for sophisticated context-aware fusion of multimodal information. Our method notably improves the accuracy of AU detection by understanding the temporal and contextual nuances of the data, showcasing significant advancements in the comprehension of intricate scenarios. These findings underscore the potential of integrating temporal dynamics and contextual interpretation, paving the way for future research endeavors. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.13027 [pdf, other]

Towards Better Statistical Understanding of Watermarking LLMs

Authors: Zhongze Cai, Shang Liu, Hanzhao Wang, Huaiyang Zhong, Xiaocheng Li

Abstract: In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better under… ▽ More In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free'' and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.12959 [pdf, other]

WHAC: World-grounded Humans and Cameras

Authors: Wanqi Yin, Zhongang Cai, Ruisi Wang, Fanzhou Wang, Chen Wei, Haiyi Mei, Weiye Xiao, Zhitao Yang, Qing** Sun, Atsushi Yamashita, Ziwei Liu, Lei Yang

Abstract: Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our a… ▽ More Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Homepage: https://wqyin.github.io/projects/WHAC/

arXiv:2403.12884 [pdf, other]

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Authors: Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi

Abstract: Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encode… ▽ More Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12425 [pdf, other]

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Authors: Jun Yu, Gongpeng Zhao, Yongqi Wang, Zhihong Wei, Yang Zheng, Zerui Zhang, Zhongpeng Cai, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Abstract: This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we… ▽ More This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset. △ Less

Submitted 20 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages,3 figures

arXiv:2403.11942 [pdf, other]

Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal Modeling

Authors: Jun Yu, Zhihong Wei, Zhongpeng Cai, Gongpeng Zhao, Zerui Zhang, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Abstract: Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expressio… ▽ More Facial Expression Recognition (FER) plays a crucial role in computer vision and finds extensive applications across various fields. This paper aims to present our approach for the upcoming 6th Affective Behavior Analysis in-the-Wild (ABAW) competition, scheduled to be held at CVPR2024. In the facial expression recognition task, The limited size of the FER dataset poses a challenge to the expression recognition model's generalization ability, resulting in subpar recognition performance. To address this problem, we employ a semi-supervised learning technique to generate expression category pseudo-labels for unlabeled face data. At the same time, we uniformly sampled the labeled facial expression samples and implemented a debiased feedback learning strategy to address the problem of category imbalance in the dataset and the possible data bias in semi-supervised learning. Moreover, to further compensate for the limitation and bias of features obtained only from static images, we introduced a Temporal Encoder to learn and capture temporal relationships between neighbouring expression image features. In the 6th ABAW competition, our method achieved outstanding results on the official validation set, a result that fully confirms the effectiveness and competitiveness of our proposed method. △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11082 [pdf, other]

RobustSentEmbed: Robust Sentence Embeddings Using Adversarial Self-Supervised Contrastive Learning

Authors: Javad Rafiei Asl, Prajwal Panzade, Eduardo Blanco, Daniel Takabi, Zhipeng Cai

Abstract: Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to… ▽ More Pre-trained language models (PLMs) have consistently demonstrated outstanding performance across a diverse spectrum of natural language processing tasks. Nevertheless, despite their success with unseen data, current PLM-based representations often exhibit poor robustness in adversarial settings. In this paper, we introduce RobustSentEmbed, a self-supervised sentence embedding framework designed to improve both generalization and robustness in diverse text representation tasks and against a diverse set of adversarial attacks. Through the generation of high-risk adversarial perturbations and their utilization in a novel objective function, RobustSentEmbed adeptly learns high-quality and robust sentence embeddings. Our experiments confirm the superiority of RobustSentEmbed over state-of-the-art representations. Specifically, Our framework achieves a significant reduction in the success rate of various adversarial attacks, notably reducing the BERTAttack success rate by almost half (from 75.51\% to 38.81\%). The framework also yields improvements of 1.59\% and 0.23\% in semantic textual similarity tasks and various transfer tasks, respectively. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Findings) 2024. [https://openreview.net/forum?id=9dEAg4lJEA]

arXiv:2403.10719 [pdf]

X-ray Nano-imaging of a Heterogeneous Structural Phase Transition in V2O3

Authors: Ziming Shao, Aileen Luo, Eti Barazani, Tao Zhou, Zhonghou Cai, Martin V. Holt, Yoav Kalcheim, Andrej Singer

Abstract: Controlling the Mott transition through strain engineering is crucial for advancing the development and application of memristive and neuromorphic computing devices. Yet, Mott insulators are heterogeneous due to intrinsic phase boundaries and extrinsic defects, posing significant challenges to fully understanding the impact of local microscopic distortions on the local Mott transition. Addressing… ▽ More Controlling the Mott transition through strain engineering is crucial for advancing the development and application of memristive and neuromorphic computing devices. Yet, Mott insulators are heterogeneous due to intrinsic phase boundaries and extrinsic defects, posing significant challenges to fully understanding the impact of local microscopic distortions on the local Mott transition. Addressing these challenges demands structural characterizations at the relevant length scale. Here, using a synchrotron-based scanning X-ray nanoprobe, we studied the real-space structural heterogeneity during the structural phase transition in a V2O3 thin film. Through temperature-dependent metal-insulator phase coexistence map**, we report a variation in the local transition temperature of up to 7 K across the film and the presence of the transition hysteresis at the nanoscale. Furthermore, a detailed quantitative analysis demonstrates that the spatial heterogeneity of the transition is closely tied to the tilting of crystallographic planes in the pure insulating phase. Our work highlights the impact of local heterogeneity on the Mott transition and lays the groundwork for future innovations in harnessing strain heterogeneity within Mott systems for the next-generation computational technologies. △ Less

Submitted 30 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.09326 [pdf, other]

HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

Authors: Duotun Wang, Hengyu Meng, Zeyu Cai, Zhi**g Shao, Qianxi Liu, Lin Wang, Mingming Fan, Xiaohang Zhan, Zeyu Wang

Abstract: We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce… ▽ More We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce a trainable parameter as a weighting factor for the Jacobian at each triangle to adaptively change local shapes while maintaining global correspondences and facial features. Moreover, to ensure the coherence of the resulting shape and appearance from different viewpoints, we use pretrained image diffusion models for differentiable rendering with regularization terms to refine the deformation under text guidance. Extensive experiments demonstrate that our method can generate diverse head avatars with an articulated mesh that can be edited seamlessly in 3D graphics software, facilitating downstream applications such as more efficient animation with inherited blend shapes and semantic consistency. △ Less

Submitted 10 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

Comments: 12 pages, 17 figures

ACM Class: I.2.6; I.3.8

arXiv:2403.05989 [pdf, other]

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Authors: Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen

Abstract: Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train i… ▽ More Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many map**s, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.05425 [pdf, ps, other]

An Adaptive Dimension Reduction Estimation Method for High-dimensional Bayesian Optimization

Authors: Shouri Hu, Jiawei Li, Zhibo Cai

Abstract: Bayesian optimization (BO) has shown impressive results in a variety of applications within low-to-moderate dimensional Euclidean spaces. However, extending BO to high-dimensional settings remains a significant challenge. We address this challenge by proposing a two-step optimization framework. Initially, we identify the effective dimension reduction (EDR) subspace for the objective function using… ▽ More Bayesian optimization (BO) has shown impressive results in a variety of applications within low-to-moderate dimensional Euclidean spaces. However, extending BO to high-dimensional settings remains a significant challenge. We address this challenge by proposing a two-step optimization framework. Initially, we identify the effective dimension reduction (EDR) subspace for the objective function using the minimum average variance estimation (MAVE) method. Subsequently, we construct a Gaussian process model within this EDR subspace and optimize it using the expected improvement criterion. Our algorithm offers the flexibility to operate these steps either concurrently or in sequence. In the sequential approach, we meticulously balance the exploration-exploitation trade-off by distributing the sampling budget between subspace estimation and function optimization, and the convergence rate of our algorithm in high-dimensional contexts has been established. Numerical experiments validate the efficacy of our method in challenging scenarios. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: First draft

arXiv:2403.05265 [pdf, other]

MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts

Authors: Zinan Zeng, Sen Ye, Zijian Cai, Heng Wang, Yuhan Liu, Haokai Zhang, Minnan Luo

Abstract: Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a… ▽ More Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review's textual content, and the review's metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE's superiority in robustness and generalization. △ Less

Submitted 13 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.04268 [pdf]

Qubit-Wise Architecture Search Method for Variational Quantum Circuits

Authors: Jialin Chen, Zhiqiang Cai, Ke Xu, Di Wu, Wei Cao

Abstract: Considering the noise level limit, one crucial aspect for quantum machine learning is to design a high-performing variational quantum circuit architecture with small number of quantum gates. As the classical neural architecture search (NAS), quantum architecture search methods (QAS) employ methods like reinforcement learning, evolutionary algorithms and supernet optimiza-tion to improve the search… ▽ More Considering the noise level limit, one crucial aspect for quantum machine learning is to design a high-performing variational quantum circuit architecture with small number of quantum gates. As the classical neural architecture search (NAS), quantum architecture search methods (QAS) employ methods like reinforcement learning, evolutionary algorithms and supernet optimiza-tion to improve the search efficiency. In this paper, we propose a novel qubit-wise architec-ture search (QWAS) method, which progres-sively search one-qubit configuration per stage, and combine with Monte Carlo Tree Search al-gorithm to find good quantum architectures by partitioning the search space into several good and bad subregions. The numerical experimental results indicate that our proposed method can balance the exploration and exploitation of cir-cuit performance and size in some real-world tasks, such as MNIST, Fashion and MOSI. As far as we know, QWAS achieves the state-of-art re-sults of all tasks in the terms of accuracy and circuit size. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.02620 [pdf]

Polarization-Encoded Lenticular Nano-Printing with Single-Layer Metasurfaces

Authors: Lin Deng, Ziqiang Cai, Yongmin Liu

Abstract: Metasurface-based nano-printing has enabled ultrahigh-resolution grayscale or color image display. However, the maximum number of independent nano-printing images allowed by one single-layer metasurface is still limited despite many multiplexing methods that have been proposed to increase the design degree of freedom. In this work, we substantially push the multiplexing limit of nano-printing by t… ▽ More Metasurface-based nano-printing has enabled ultrahigh-resolution grayscale or color image display. However, the maximum number of independent nano-printing images allowed by one single-layer metasurface is still limited despite many multiplexing methods that have been proposed to increase the design degree of freedom. In this work, we substantially push the multiplexing limit of nano-printing by transforming images at different observation angles into map** the corresponding images to different positions in the Fourier space, and simultaneously controlling the complex electric field across multiple polarization channels. Our proposed Polarization-Encoded Lenticular Nano-Printing (Pollen), aided by a modified evolutionary algorithm, allows the display of several images based on the viewing angle, similar to traditional lenticular printing but without requiring a lenticular layer. In addition, it extends the display capability to encompass multiple polarization states. Empowered by the ability to control the complex amplitude of three polarization channels, we numerically and experimentally demonstrate the generation of 13 distinguished gray-scale Chinese ink wash painting images, 49 binary patterns, and three sets of 3D nano-printing images, totaling 25 unique visuals. These results present the largest number of recorded images with ultra-high resolution to date. Our innovative Pollen technique is expected to benefit the development of modern optical applications, including but not limited to optical encryption, optical data storage, lightweight display, and augmented reality and virtual reality. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 12 pages, 5 figures

arXiv:2403.02586 [pdf, other]

Improving Event Definition Following For Zero-Shot Event Detection

Authors: Zefan Cai, Po-Nien Kung, Ashima Suvarna, Mingyu Derek Ma, Hritik Bansal, Baobao Chang, P. Jeffrey Brantingham, Wei Wang, Nanyun Peng

Abstract: Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types, and prompt them with unseen event definitions. These approaches yield sporadic successes, yet generally fall short of expectations. In this work, we aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of ev… ▽ More Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types, and prompt them with unseen event definitions. These approaches yield sporadic successes, yet generally fall short of expectations. In this work, we aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. To verify our hypothesis, we construct an automatically generated Diverse Event Definition (DivED) dataset and conduct comparative studies. Our experiments reveal that a large number of event types (200) and diverse event definitions can significantly boost event extraction performance; on the other hand, the performance does not scale with over ten examples per event type. Beyond scaling, we incorporate event ontology information and hard-negative samples during training, further boosting the performance. Based on these findings, we fine-tuned a LLaMA-2-7B model on our DivED dataset, yielding performance that surpasses SOTA large language models like GPT-3.5 across three open benchmarks on zero-shot event detection. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2403.02399 [pdf, other]

The true number density of massive galaxies in the early Universe revealed by JWST/MIRI

Authors: Tao Wang, Hanwen Sun, Luwenjia Zhou, Ke Xu, Cheng Cheng, Zhaozhou Li, Yangyao Chen, H. J. Mo, Avishai Dekel, Xianzhong Zheng, Zheng Cai, Tiacheng Yang, Y. -S. Dai, David Elbaz, J. -S. Huang

Abstract: One of the main challenges in galaxy formation that has emerged recently is the early assembly of massive galaxies. The observed number density and the maximum stellar mass ($M_{\star}$) of massive galaxies in the early Universe appear to be higher than model predictions, which may pose a serious problem to the LCDM cosmology. A major limitation in many previous studies is the large uncertainty in… ▽ More One of the main challenges in galaxy formation that has emerged recently is the early assembly of massive galaxies. The observed number density and the maximum stellar mass ($M_{\star}$) of massive galaxies in the early Universe appear to be higher than model predictions, which may pose a serious problem to the LCDM cosmology. A major limitation in many previous studies is the large uncertainty in estimating $M_{\star}$ due to the lack of constraints in the rest-frame near-infrared part of the spectral energy distribution, which is critical to determining $M_{\star}$ accurately. Here we use data from a large JWST/MIRI survey in the PRIMER program to carry out a systematic analysis of massive galaxies at $z \sim 3-8$, leveraging photometric constraints at rest-frame $\gtrsim 1 μ$m. We find a significant reduction in the number and mass densities of massive galaxies at $z > 5$ compared to earlier results that did not use the MIRI photometry. Within the standard $Λ$CDM cosmology, our results require a moderate increase in the baryon-to-star conversion efficiency ($ε$) towards higher redshifts and higher $M_{\star}$. For the most massive galaxies at $z\sim 8$, the required $ε$ is $\sim 0.3$, in comparison to $ε\sim 0.14$ for typical low-redshift galaxies. Our findings are consistent with models assuming suppressed stellar feedback due to the high gas density and the associated short free-fall time expected for massive halos at high redshift. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 23 pages, 10 figures, submitted

arXiv:2403.01691 [pdf, other]

How long will the quasar UV/optical flickering be damped?

Authors: Shuying Zhou, Mouyuan Sun, Zhen-Yi Cai, Guowei Ren, Jun-Xian Wang, Yongquan Xue

Abstract: The UV/optical light curves of Active Galactic Nuclei (AGNs) are commonly described by the Damped Random Walk (DRW) model. However, the physical interpretation of the dam** timescale, a key parameter in the DRW model, remains unclear. Particularly, recent observations indicate a weak dependence of the dam** timescale upon both wavelength and accretion rate, clearly being inconsistent with the… ▽ More The UV/optical light curves of Active Galactic Nuclei (AGNs) are commonly described by the Damped Random Walk (DRW) model. However, the physical interpretation of the dam** timescale, a key parameter in the DRW model, remains unclear. Particularly, recent observations indicate a weak dependence of the dam** timescale upon both wavelength and accretion rate, clearly being inconsistent with the accretion-disk theory. In this study, we investigate the dam** timescale in the framework of the Corona Heated Accretion disk Reprocessing (CHAR) model, a physical model that describes AGN variability. We find that while the CHAR model can reproduce the observed power spectral densities of the 20-year light curves for 190 sources from \cite{Stone2022}, the observed dam** timescale, as well as its weak dependence on wavelength, can also be well recovered through fitting the mock light curves with DRW. We further demonstrate that such weak dependence is artificial due to the effect of inadequate durations of light curves, which leads to best-fitting dam** timescales lower than the intrinsic ones. After eliminating this effect, the CHAR model indeed yields a strong dependence of the intrinsic dam** timescale on the bolometric luminosity and rest-frame wavelength. Our results highlight the demand for sufficiently long light curves in AGN variability studies and important applications of the CHAR model in such studies. △ Less

Submitted 3 March, 2024; originally announced March 2024.

Comments: 19 pages, 16 figures, accepted to ApJ

arXiv:2403.01676 [pdf, ps, other]

Production of $X_b$ via radiative transition of $Υ(10753)$

Authors: Shi-Dong Liu, Hao-Dong Cai, Zu-Xin Cai, Hong-Shuo Gao, Gang Li, Fan Wang, Ju-Jun Xie

Abstract: We studied the radiative transitions between the $Υ(10753)$, the $S$-$D$ mixed state of the $Υ(4S)$ and $Υ_1(3\,{}^3D_1)$, and the $X_b$, the heavy quark flavor symmetry counterpart of the $X(3782)$ in the bottomonium sector. The radiative transition was assumed to occur through the intermediate bottom mesons, including $P$-wave $B_1^{(\prime)}$ mesons as well as the $S$-wave $B^{(*)}$ ones. The c… ▽ More We studied the radiative transitions between the $Υ(10753)$, the $S$-$D$ mixed state of the $Υ(4S)$ and $Υ_1(3\,{}^3D_1)$, and the $X_b$, the heavy quark flavor symmetry counterpart of the $X(3782)$ in the bottomonium sector. The radiative transition was assumed to occur through the intermediate bottom mesons, including $P$-wave $B_1^{(\prime)}$ mesons as well as the $S$-wave $B^{(*)}$ ones. The consideration of the $B_1^{(\prime)}$ mesons leads to the couplings to be in $S$-wave, and hence enhances the contributions of the intermediate meson loops. The radiative decay width for the $Υ(10753)\toγX_b$ is predicted to be order of $10~\mathrm{keV}$, corresponding to a branching fraction of $10^{-4}$. Based on the theoretical results, we strongly suggest to search for the $X_b$ in the $e^+e^-\toγX_b$ with $X_b\toππχ_{b1}$ near $\sqrt{s}=10.754~\mathrm{GeV}$, and it is hoped that the calculations here could be tested by the future Belle II experiments. △ Less

Submitted 9 May, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 7 pages, 4 figures, accepted by PRD(20240510)

arXiv:2403.00873 [pdf, ps, other]

Blockchain-empowered Federated Learning: Benefits, Challenges, and Solutions

Authors: Zeju Cai, Jianguo Chen, Yuting Fan, Zibin Zheng, Keqin Li

Abstract: Federated learning (FL) is a distributed machine learning approach that protects user data privacy by training models locally on clients and aggregating them on a parameter server. While effective at preserving privacy, FL systems face limitations such as single points of failure, lack of incentives, and inadequate security. To address these challenges, blockchain technology is integrated into FL… ▽ More Federated learning (FL) is a distributed machine learning approach that protects user data privacy by training models locally on clients and aggregating them on a parameter server. While effective at preserving privacy, FL systems face limitations such as single points of failure, lack of incentives, and inadequate security. To address these challenges, blockchain technology is integrated into FL systems to provide stronger security, fairness, and scalability. However, blockchain-empowered FL (BC-FL) systems introduce additional demands on network, computing, and storage resources. This survey provides a comprehensive review of recent research on BC-FL systems, analyzing the benefits and challenges associated with blockchain integration. We explore why blockchain is applicable to FL, how it can be implemented, and the challenges and existing solutions for its integration. Additionally, we offer insights on future research directions for the BC-FL system. △ Less

Submitted 5 July, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2402.17502 [pdf, other]

FedLPPA: Learning Personalized Prompt and Aggregation for Federated Weakly-supervised Medical Image Segmentation

Authors: Li Lin, Yixiang Liu, Jiewei Wu, Pu** Cheng, Zhiyuan Cai, Kenneth K. Y. Wong, Xiaoying Tang

Abstract: Federated learning (FL) effectively mitigates the data silo challenge brought about by policies and privacy concerns, implicitly harnessing more data for deep model training. However, traditional centralized FL models grapple with diverse multi-center data, especially in the face of significant data heterogeneity, notably in medical contexts. In the realm of medical image segmentation, the growing… ▽ More Federated learning (FL) effectively mitigates the data silo challenge brought about by policies and privacy concerns, implicitly harnessing more data for deep model training. However, traditional centralized FL models grapple with diverse multi-center data, especially in the face of significant data heterogeneity, notably in medical contexts. In the realm of medical image segmentation, the growing imperative to curtail annotation costs has amplified the importance of weakly-supervised techniques which utilize sparse annotations such as points, scribbles, etc. A pragmatic FL paradigm shall accommodate diverse annotation formats across different sites, which research topic remains under-investigated. In such context, we propose a novel personalized FL framework with learnable prompt and aggregation (FedLPPA) to uniformly leverage heterogeneous weak supervision for medical image segmentation. In FedLPPA, a learnable universal knowledge prompt is maintained, complemented by multiple learnable personalized data distribution prompts and prompts representing the supervision sparsity. Integrated with sample features through a dual-attention mechanism, those prompts empower each local task decoder to adeptly adjust to both the local distribution and the supervision form. Concurrently, a dual-decoder strategy, predicated on prompt similarity, is introduced for enhancing the generation of pseudo-labels in weakly-supervised learning, alleviating overfitting and noise accumulation inherent to local data, while an adaptable aggregation method is employed to customize the task decoder on a parameter-wise basis. Extensive experiments on four distinct medical image segmentation tasks involving different modalities underscore the superiority of FedLPPA, with its efficacy closely parallels that of fully supervised centralized training. Our code and data will be available. △ Less

Submitted 31 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 12 pages, 10 figures

arXiv:2402.15527 [pdf, other]

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Authors: Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang

Abstract: We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs). Departing from previous benchmarks focusing on simplistic tasks and individual model capability, PCA-Bench introduces three complex scenarios: autonomous driving, domestic robotics, and open-world games. Given task instructions and diverse contexts, t… ▽ More We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs). Departing from previous benchmarks focusing on simplistic tasks and individual model capability, PCA-Bench introduces three complex scenarios: autonomous driving, domestic robotics, and open-world games. Given task instructions and diverse contexts, the model is required to seamlessly integrate multiple capabilities of Perception, Cognition, and Action in a reasoning chain to make accurate decisions. Moreover, PCA-Bench features error localization capabilities, scrutinizing model inaccuracies in areas such as perception, knowledge, or reasoning. This enhances the reliability of deploying MLLMs. To balance accuracy and efficiency in evaluation, we propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs. The results reveal significant performance disparities between open-source models and powerful proprietary models like GPT-4 Vision. To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments. EIE generates 7,510 training examples in PCA-Bench and enhances the performance of open-source MLLMs, occasionally surpassing GPT-4 Vision (+3\% in decision accuracy), thereby validating the effectiveness of EIE. Our findings suggest that robust MLLMs like GPT4-Vision show promise for decision-making in embodied agents, opening new avenues for MLLM research. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: Code and Data released at https://github.com/pkunlp-icler/PCA-EVAL. Leaderboard at: https://docs.qq.com/sheet/DVUd4WUpGRHRqUnNV. This article supersedes its workshop version arxiv: 2310.02071. arXiv admin note: text overlap with arXiv:2310.02071

arXiv:2402.11095 [pdf, other]

GIM: Learning Generalizable Image Matcher From Internet Videos

Authors: Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias Müller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, Cheng Wang

Abstract: Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of exis… ▽ More Image matching is a fundamental computer vision problem. While learning-based methods achieve state-of-the-art performance on existing benchmarks, they generalize poorly to in-the-wild images. Such methods typically need to train separate models for different scene types and are impractical when the scene type is unknown in advance. One of the underlying problems is the limited scalability of existing data construction pipelines, which limits the diversity of standard image matching datasets. To address this problem, we propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture using internet videos, an abundant and diverse data source. Given an architecture, GIM first trains it on standard domain-specific datasets and then combines it with complementary matching methods to create dense labels on nearby frames of novel videos. These labels are filtered by robust fitting, and then enhanced by propagating them to distant frames. The final model is trained on propagated data with strong augmentations. We also propose ZEB, the first zero-shot evaluation benchmark for image matching. By mixing data from diverse domains, ZEB can thoroughly assess the cross-domain generalization performance of different methods. Applying GIM consistently improves the zero-shot performance of 3 state-of-the-art image matching architectures; with 50 hours of YouTube videos, the relative zero-shot performance improves by 8.4%-18.1%. GIM also enables generalization to extreme cross-domain data such as Bird Eye View (BEV) images of projected 3D point clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently outperforms domain-specific baselines when evaluated on downstream tasks inherent to their respective domains. The video presentation is available at https://www.youtube.com/watch?v=FU_MJLD8LeY. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted to ICLR 2024 for spotlight presentation

arXiv:2402.11000 [pdf, other]

ASGEA: Exploiting Logic Rules from Align-Subgraphs for Entity Alignment

Authors: Yangyifei Luo, Zhuo Chen, Lingbing Guo, Qian Li, Wenxuan Zeng, Zhixin Cai, Jianxin Li

Abstract: Entity alignment (EA) aims to identify entities across different knowledge graphs that represent the same real-world objects. Recent embedding-based EA methods have achieved state-of-the-art performance in EA yet faced interpretability challenges as they purely rely on the embedding distance and neglect the logic rules behind a pair of aligned entities. In this paper, we propose the Align-Subgraph… ▽ More Entity alignment (EA) aims to identify entities across different knowledge graphs that represent the same real-world objects. Recent embedding-based EA methods have achieved state-of-the-art performance in EA yet faced interpretability challenges as they purely rely on the embedding distance and neglect the logic rules behind a pair of aligned entities. In this paper, we propose the Align-Subgraph Entity Alignment (ASGEA) framework to exploit logic rules from Align-Subgraphs. ASGEA uses anchor links as bridges to construct Align-Subgraphs and spreads along the paths across KGs, which distinguishes it from the embedding-based methods. Furthermore, we design an interpretable Path-based Graph Neural Network, ASGNN, to effectively identify and integrate the logic rules across KGs. We also introduce a node-level multi-modal attention mechanism coupled with multi-modal enriched anchors to augment the Align-Subgraph. Our experimental results demonstrate the superior performance of ASGEA over the existing embedding-based methods in both EA and Multi-Modal EA (MMEA) tasks. △ Less

Submitted 5 March, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Ongoing work; 16 pages, 9 Tables, 8 Figures; Code: https://github.com/lyyf2002/ASGEA

arXiv:2402.09511 [pdf, other]

Biased Estimator Channels for Classical Shadows

Authors: Zhenyu Cai, Adrian Chapman, Hamza Jnane, Bálint Koczor

Abstract: Extracting classical information from quantum systems is of fundamental importance, and classical shadows allow us to extract a large amount of information using relatively few measurements. Conventional shadow estimators are unbiased and thus approach the true mean in the infinite-sample limit. In this work, we consider a biased scheme, intentionally introducing a bias by rescaling the convention… ▽ More Extracting classical information from quantum systems is of fundamental importance, and classical shadows allow us to extract a large amount of information using relatively few measurements. Conventional shadow estimators are unbiased and thus approach the true mean in the infinite-sample limit. In this work, we consider a biased scheme, intentionally introducing a bias by rescaling the conventional classical shadows estimators can reduce the error in the finite-sample regime. The approach is straightforward to implement and requires no quantum resources. We analytically prove average case as well as worst- and best-case scenarios, and rigorously prove that it is, in principle, always worth biasing the estimators. We illustrate our approach in a quantum simulation task of a $12$-qubit spin-ring problem and demonstrate how estimating expected values of non-local perturbations can be significantly more efficient using our biased scheme. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 13 pages, 5 figures

arXiv:2402.09059 [pdf, other]

I can't see it but I can Fine-tune it: On Encrypted Fine-tuning of Transformers using Fully Homomorphic Encryption

Authors: Prajwal Panzade, Daniel Takabi, Zhipeng Cai

Abstract: In today's machine learning landscape, fine-tuning pretrained transformer models has emerged as an essential technique, particularly in scenarios where access to task-aligned training data is limited. However, challenges surface when data sharing encounters obstacles due to stringent privacy regulations or user apprehension regarding personal information disclosure. Earlier works based on secure m… ▽ More In today's machine learning landscape, fine-tuning pretrained transformer models has emerged as an essential technique, particularly in scenarios where access to task-aligned training data is limited. However, challenges surface when data sharing encounters obstacles due to stringent privacy regulations or user apprehension regarding personal information disclosure. Earlier works based on secure multiparty computation (SMC) and fully homomorphic encryption (FHE) for privacy-preserving machine learning (PPML) focused more on privacy-preserving inference than privacy-preserving training. In response, we introduce BlindTuner, a privacy-preserving fine-tuning system that enables transformer training exclusively on homomorphically encrypted data for image classification. Our extensive experimentation validates BlindTuner's effectiveness by demonstrating comparable accuracy to non-encrypted models. Notably, our findings highlight a substantial speed enhancement of 1.5x to 600x over previous work in this domain. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted for the presentation at PPAI @The 38th Annual AAAI Conference on Artificial Intelligence 2024

arXiv:2402.07866 [pdf, other]

Virtual Channel Purification

Authors: Zhenhuan Liu, Xingjian Zhang, Yue-Yang Fei, Zhenyu Cai

Abstract: Quantum error mitigation is a key approach for extracting target state properties on state-of-the-art noisy machines and early fault-tolerant devices. Using the ideas from flag fault tolerance and virtual state purification, we develop the virtual channel purification (VCP) protocol, which consumes similar qubit and gate resources as virtual state purification but offers up to exponentially strong… ▽ More Quantum error mitigation is a key approach for extracting target state properties on state-of-the-art noisy machines and early fault-tolerant devices. Using the ideas from flag fault tolerance and virtual state purification, we develop the virtual channel purification (VCP) protocol, which consumes similar qubit and gate resources as virtual state purification but offers up to exponentially stronger error suppression with increased system size and more noisy operation copies. Furthermore, VCP removes most of the assumptions required in virtual state purification. Essentially, VCP is the first quantum error mitigation protocol that does not require specific knowledge about the noise models, the target quantum state, and the target problem while still offering rigorous performance guarantees for practical noise regimes. Further connections are made between VCP and quantum error correction to produce one of the first protocols that combine quantum error correction and quantum error mitigation beyond concatenation. We can remove all noise in the channel while paying only the same sampling cost as low-order purification, reaching beyond the standard bias-variance trade-off in quantum error mitigation. Our protocol can also be adapted to key tasks in quantum networks like channel capacity activation and entanglement distribution. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Showing 51–100 of 986 results for author: Cai, Z