Search | arXiv e-print repository

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Authors: Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, Jimmy Lin

Abstract: Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the tradi… ▽ More Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnarök, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnarök, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnarök framework and baselines to achieve a unified standard for future RAG systems. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.13869 [pdf, other]

Global Human-guided Counterfactual Explanations for Molecular Properties via Reinforcement Learning

Authors: Danqing Wang, Antonis Antoniades, Kha-Dinh Luong, Edwin Zhang, Mert Kosan, Jiachen Li, Ambuj Singh, William Yang Wang, Lei Li

Abstract: Counterfactual explanations of Graph Neural Networks (GNNs) offer a powerful way to understand data that can naturally be represented by a graph structure. Furthermore, in many domains, it is highly desirable to derive data-driven global explanations or rules that can better explain the high-level properties of the models and data in question. However, evaluating global counterfactual explanations… ▽ More Counterfactual explanations of Graph Neural Networks (GNNs) offer a powerful way to understand data that can naturally be represented by a graph structure. Furthermore, in many domains, it is highly desirable to derive data-driven global explanations or rules that can better explain the high-level properties of the models and data in question. However, evaluating global counterfactual explanations is hard in real-world datasets due to a lack of human-annotated ground truth, which limits their use in areas like molecular sciences. Additionally, the increasing scale of these datasets provides a challenge for random search-based methods. In this paper, we develop a novel global explanation model RLHEX for molecular property prediction. It aligns the counterfactual explanations with human-defined principles, making the explanations more interpretable and easy for experts to evaluate. RLHEX includes a VAE-based graph generator to generate global explanations and an adapter to adjust the latent representation space to human-defined principles. Optimized by Proximal Policy Optimization (PPO), the global explanations produced by RLHEX cover 4.12% more input graphs and reduce the distance between the counterfactual explanation set and the input set by 0.47% on average across three molecular datasets. RLHEX provides a flexible framework to incorporate different human-designed principles into the counterfactual explanation generation process, aligning these explanations with domain expertise. The code and data are released at https://github.com/dqwang122/RLHEX. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted by KDD 2024

arXiv:2406.13085 [pdf]

Ultralow thermal conductance across the [FePt/h-BN/FePt] interface

Authors: chengchao Xu, Enbo Zhang, Bo-Yuan Yang, B. S. D. Ch. S. Varaprasad, David E. Laughlin, Jian-Gang, Zhu

Abstract: Heat transfer in nanocomposite materials has attracted great interest for various applications. Multilayer structures provide an important platform to study interfacial thermal transport and to engineer materials with ultralow thermal conductivity. Here we report on the fabrication and thermal characterization of [h-BN/$L1_0$-FePt]xN multilayers, where hexagonal boron nitride (h-BN) nanosheets (2.… ▽ More Heat transfer in nanocomposite materials has attracted great interest for various applications. Multilayer structures provide an important platform to study interfacial thermal transport and to engineer materials with ultralow thermal conductivity. Here we report on the fabrication and thermal characterization of [h-BN/$L1_0$-FePt]xN multilayers, where hexagonal boron nitride (h-BN) nanosheets (2.5 nm thick) and $L1_0$-FePt layers (6.5 nm thick) alternate periodically. Differential three-omega($3ω$) measurements reveal an ultralow effective thermal conductivity of $ 0.60 \pm 0.05 W \cdot m^{-1}K^{-1}$ across the multilayer films, and a low thermal boundary conductance (TBC) of $ 67.9 \pm 6.6 MW \cdot m^{-2}K^{-1}$ for the [FePt/h-BN(2.5nm)/FePt] interface at room temperature. We attribute the ultralow thermal conductivity to the weak van der Waals bonding at h-BN/FePt interfaces, which dominates the thermal resistance of the multilayer structure. These findings provide insights into the thermal transport in 2D-material/metal multilayer nanostructures and suggest the [h-BN/FePt] superlattice as a promising material for nanoscale thermal barrier coating. Furthermore, the obtained TBC lays the foundation for analyzing heat transfer in FePt-(h-BN) nanogranular films, a promising magnetic recording media which can potentially provide high thermal gradient for heat-assisted magnetic recording (HAMR). This work advances the understanding of thermal transport in 2D-material/metal nanocomposites and demonstrates interface engineering as an effective approach to achieve materials with ultralow thermal conductivity. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 22 page, 5 figures

arXiv:2406.11741 [pdf, other]

Transcendence: Generative Models Can Outperform The Experts That Train Them

Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach

Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities… ▽ More Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting. △ Less

Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Code, models, and data at https://transcendence.eddie.win

arXiv:2406.10338 [pdf, other]

Bursty Star Formation in Dwarfs is Sensitive to Numerical Choices in Supernova Feedback Models

Authors: Eric Zhang, Laura V Sales, Federico Marinacci, Paul Torrey, Mark Vogelsberger, Volker Springel, Hui Li, Rüdiger Pakmor, Thales A Gutcke

Abstract: Simulations of galaxy formation are mostly unable to resolve the energy-conserving phase of individual supernova events, having to resort to subgrid models to distribute the energy and momentum resulting from stellar feedback. However, the properties of these simulated galaxies, including the morphology, stellar mass formed and the burstiness of the star formation history, are highly sensitive to… ▽ More Simulations of galaxy formation are mostly unable to resolve the energy-conserving phase of individual supernova events, having to resort to subgrid models to distribute the energy and momentum resulting from stellar feedback. However, the properties of these simulated galaxies, including the morphology, stellar mass formed and the burstiness of the star formation history, are highly sensitive to numerical choices adopted in these subgrid models. Using the {\small SMUGGLE} stellar feedback model, we compute idealized simulations of a $M_{\rm vir} \sim 10^{10} \, \msun$ dwarf galaxy, a regime where most simulation codes predict significant burstiness in star formation, resulting in strong gas flows that lead to the formation of dark matter cores. We find that by varying only the directional distribution of momentum imparted from supernovae to the surrounding gas, while holding the total momentum per supernova constant, bursty star formation may be amplified or completely suppressed, and the total stellar mass formed can vary by as much as a factor of $\sim 3$. In particular, when momentum is primarily directed perpendicular to the gas disk, less bursty and lower overall star formation rates result, yielding less gas turbulence, more disky morphologies and a retention of cuspy dark matter density profiles. An improved understanding of the non-linear coupling of stellar feedback into inhomogeneous gaseous media is thus needed to make robust predictions for stellar morphologies and dark matter core formation in dwarfs independent of uncertain numerical choices in the baryonic treatment. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Submitted ApJ; 15 pages, 12 figures; comments welcome

arXiv:2406.10057 [pdf, other]

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Authors: Enming Zhang, Ruobing Yao, Huanyong Liu, Junhui Yu, Jiale Wang

Abstract: With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first compr… ▽ More With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{https://github.com/360AILAB-NLP/FlowCE} \end{abstract} △ Less

Submitted 18 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09215 [pdf, other]

On Softmax Direct Preference Optimization for Recommendation

Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua

Abstract: Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-t… ▽ More Recommender systems aim to predict personalized rankings based on user preference data. With the rise of Language Models (LMs), LM-based recommenders have been widely explored due to their extensive world knowledge and powerful reasoning abilities. Most of the LM-based recommenders convert historical interactions into language prompts, pairing with a positive item as the target response and fine-tuning LM with a language modeling loss. However, the current objective fails to fully leverage preference data and is not optimized for personalized ranking tasks, which hinders the performance of LM-based recommenders. Inspired by the current advancement of Direct Preference Optimization (DPO) in human preference alignment and the success of softmax loss in recommendations, we propose Softmax-DPO (S-DPO) to instill ranking information into the LM to help LM-based recommenders distinguish preferred items from negatives, rather than solely focusing on positives. Specifically, we incorporate multiple negatives in user preference data and devise an alternative version of DPO loss tailored for LM-based recommenders, connected to softmax sampling strategies. Theoretically, we bridge S-DPO with the softmax loss over negative sampling and find that it has a side effect of mining hard negatives, which assures its exceptional capabilities in recommendation tasks. Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while mitigating the data likelihood decline issue of DPO. Our codes are available at https://github.com/chenyuxin1999/S-DPO. △ Less

Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.05633 [pdf, other]

Heterogeneous Treatment Effects in Panel Data

Authors: Retsef Levi, Elisabeth Paulson, Georgia Perakis, Emily Zhang

Abstract: We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters w… ▽ More We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters with similar treatment effects using a regression tree, and then leverages the (assumed) low-rank structure of the panel data to estimate the average treatment effect for each cluster. Our theoretical results establish the convergence of the resulting estimates to the true treatment effects. Computation experiments with semi-synthetic data show that our method achieves superior accuracy compared to alternative approaches, using a regression tree with no more than 40 leaves. Hence, our method provides more accurate and interpretable estimates than alternative methods. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05436 [pdf, other]

Introducing Competitive Mechanism to Differential Evolution for Numerical Optimization

Authors: Rui Zhong, Yang Cao, Enzhi Zhang, Masaharu Munetomo

Abstract: This paper introduces a novel competitive mechanism into differential evolution (DE), presenting an effective DE variant named competitive DE (CDE). CDE features a simple yet efficient mutation strategy: DE/winner-to-best/1. Essentially, the proposed DE/winner-to-best/1 strategy can be recognized as an intelligent integration of the existing mutation strategies of DE/rand-to-best/1 and DE/cur-to-b… ▽ More This paper introduces a novel competitive mechanism into differential evolution (DE), presenting an effective DE variant named competitive DE (CDE). CDE features a simple yet efficient mutation strategy: DE/winner-to-best/1. Essentially, the proposed DE/winner-to-best/1 strategy can be recognized as an intelligent integration of the existing mutation strategies of DE/rand-to-best/1 and DE/cur-to-best/1. The incorporation of DE/winner-to-best/1 and the competitive mechanism provide new avenues for advancing DE techniques. Moreover, in CDE, the scaling factor $F$ and mutation rate $Cr$ are determined by a random number generator following a normal distribution, as suggested by previous research. To investigate the performance of the proposed CDE, comprehensive numerical experiments are conducted on CEC2017 and engineering simulation optimization tasks, with CMA-ES, JADE, and other state-of-the-art optimizers and DE variants employed as competitor algorithms. The experimental results and statistical analyses highlight the promising potential of CDE as an alternative optimizer for addressing diverse optimization challenges. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: Accepted by The 30th Int'l Conf on Parallel and Distributed Processing Techniques and Applications (PDPTA'24)

arXiv:2406.00005 [pdf, other]

Disentangling Specificity for Abstractive Multi-document Summarization

Authors: Congbo Ma, Wei Emma Zhang, Hu Wang, Haojie Zhuang, Mingyu Guo

Abstract: Multi-document summarization (MDS) generates a summary from a document set. Each document in a set describes topic-relevant concepts, while per document also has its unique contents. However, the document specificity receives little attention from existing MDS approaches. Neglecting specific information for each document limits the comprehensiveness of the generated summaries. To solve this proble… ▽ More Multi-document summarization (MDS) generates a summary from a document set. Each document in a set describes topic-relevant concepts, while per document also has its unique contents. However, the document specificity receives little attention from existing MDS approaches. Neglecting specific information for each document limits the comprehensiveness of the generated summaries. To solve this problem, in this paper, we propose to disentangle the specific content from documents in one document set. The document-specific representations, which are encouraged to be distant from each other via a proposed orthogonal constraint, are learned by the specific representation learner. We provide extensive analysis and have interesting findings that specific information and document set representations contribute distinctive strengths and their combination yields a more comprehensive solution for the MDS. Also, we find that the common (i.e. shared) information could not contribute much to the overall performance under the MDS settings. Implemetation codes are available at https://github.com/congboma/DisentangleSum. △ Less

Submitted 12 May, 2024; originally announced June 2024.

Comments: The IEEE World Congress on Computational Intelligence (WCCI 2024)

arXiv:2405.14225 [pdf, other]

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

Authors: Zhiyuan Liu, Yaorui Shi, An Zhang, Sihang Li, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

Abstract: Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-tex… ▽ More Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: ACL 2024 Findings, 9 pages

arXiv:2405.12833 [pdf, other]

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

Authors: Xinyi Wang, Grazziela Figueredo, Ruizhe Li, Wei Emma Zhang, Weitong Chen, Xin Chen

Abstract: Automatic radiology report generation can alleviate the workload for physicians and minimize regional disparities in medical resources, therefore becoming an important topic in the medical image analysis field. It is a challenging task, as the computational model needs to mimic physicians to obtain information from multi-modal input data (i.e., medical images, clinical information, medical knowled… ▽ More Automatic radiology report generation can alleviate the workload for physicians and minimize regional disparities in medical resources, therefore becoming an important topic in the medical image analysis field. It is a challenging task, as the computational model needs to mimic physicians to obtain information from multi-modal input data (i.e., medical images, clinical information, medical knowledge, etc.), and produce comprehensive and accurate reports. Recently, numerous works emerged to address this issue using deep learning-based methods, such as transformers, contrastive learning, and knowledge-base construction. This survey summarizes the key techniques developed in the most recent works and proposes a general workflow for deep learning-based report generation with five main components, including multi-modality data acquisition, data preparation, feature learning, feature fusion/interaction, and report generation. The state-of-the-art methods for each of these components are highlighted. Additionally, training strategies, public datasets, evaluation methods, current challenges, and future directions in this field are summarized. We have also conducted a quantitative comparison between different methods under the same experimental setting. This is the most up-to-date survey that focuses on multi-modality inputs and data fusion for radiology report generation. The aim is to provide comprehensive and rich information for researchers interested in automatic clinical report generation and medical image analysis, especially when using multimodal inputs, and assist them in develo** new algorithms to advance the field. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.12735 [pdf, other]

Multiple chemical tracers finally unveil the intricate NGC\,1333 IRAS\,4A outflow system. FAUST XVI

Authors: Layal Chahine, Cecilia Ceccarelli, Marta De Simone, Claire J. Chandler, Claudio Codella, Linda Podio, Ana López-Sepulcre, Nami Sakai, Laurent Loinard, Mathilde Bouvier, Paola Caselli, Charlotte Vastel, Eleonora Bianchi, Nicolás Cuello, Francesco Fontani, Doug Johnstone, Giovanni Sabatini, Tomoyuki Hanawa, Ziwei E. Zhang, Yuri Aikawa, Gemma Busquet, Emmanuel Caux, Aurore Durán, Eric Herbst, François Ménard , et al. (32 additional authors not shown)

Abstract: The exploration of outflows in protobinary systems presents a challenging yet crucial endeavour, offering valuable insights into the dynamic interplay between protostars and their evolution. In this study, we examine the morphology and dynamics of jets and outflows within the IRAS\,4A protobinary system. This analysis is based on ALMA observations of SiO(5--4), H$_2$CO(3$_{0,3}$--2$_{0,3}$), and H… ▽ More The exploration of outflows in protobinary systems presents a challenging yet crucial endeavour, offering valuable insights into the dynamic interplay between protostars and their evolution. In this study, we examine the morphology and dynamics of jets and outflows within the IRAS\,4A protobinary system. This analysis is based on ALMA observations of SiO(5--4), H$_2$CO(3$_{0,3}$--2$_{0,3}$), and HDCO(4$_{1,4}$--3$_{1,3}$) with a spatial resolution of $\sim$150\,au. Leveraging an astrochemical approach involving the use of diverse tracers beyond traditional ones has enabled the identification of novel features and a comprehensive understanding of the broader outflow dynamics. Our analysis reveals the presence of two jets in the redshifted emission, emanating from IRAS\,4A1 and IRAS\,4A2, respectively. Furthermore, we identify four distinct outflows in the region for the first time, with each protostar, 4A1 and 4A2, contributing to two of them. We characterise the morphology and orientation of each outflow, challenging previous suggestions of bends in their trajectories. The outflow cavities of IRAS\,4A1 exhibit extensions of 10$''$ and 13$''$ with position angles (PA) of 0$^{\circ}$ and -12$^{\circ}$, respectively, while those of IRAS\,4A2 are more extended, spanning 18$''$ and 25$''$ with PAs of 29$^{\circ}$ and 26$^{\circ}$. We propose that the misalignment of the cavities is due to a jet precession in each protostar, a notion supported by the observation that the more extended cavities of the same source exhibit lower velocities, indicating they may stem from older ejection events. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.12564 [pdf, other]

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

Authors: Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

Abstract: Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to pro… ▽ More Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: ACL 2024, 9 pages

arXiv:2405.12380 [pdf, other]

Large scale scattering using fast solvers based on neural operators

Authors: Zongren Zou, Adar Kahana, Enrui Zhang, Eli Turkel, Rishikesh Ranade, Jay Pathak, George Em Karniadakis

Abstract: We extend a recently proposed machine-learning-based iterative solver, i.e. the hybrid iterative transferable solver (HINTS), to solve the scattering problem described by the Helmholtz equation in an exterior domain with a complex absorbing boundary condition. The HINTS method combines neural operators (NOs) with standard iterative solvers, e.g. Jacobi and Gauss-Seidel (GS), to achieve better perf… ▽ More We extend a recently proposed machine-learning-based iterative solver, i.e. the hybrid iterative transferable solver (HINTS), to solve the scattering problem described by the Helmholtz equation in an exterior domain with a complex absorbing boundary condition. The HINTS method combines neural operators (NOs) with standard iterative solvers, e.g. Jacobi and Gauss-Seidel (GS), to achieve better performance by leveraging the spectral bias of neural networks. In HINTS, some iterations of the conventional iterative method are replaced by inferences of the pre-trained NO. In this work, we employ HINTS to solve the scattering problem for both 2D and 3D problems, where the standard iterative solver fails. We consider square and triangular scatterers of various sizes in 2D, and a cube and a model submarine in 3D. We explore and illustrate the extrapolation capability of HINTS in handling diverse geometries of the scatterer, which is achieved by training the NO on non-scattering scenarios and then deploying it in HINTS to solve scattering problems. The accurate results demonstrate that the NO in HINTS method remains effective without retraining or fine-tuning it whenever a new scatterer is given. Taken together, our results highlight the adaptability and versatility of the extended HINTS methodology in addressing diverse scattering problems. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.01461 [pdf, other]

SATO: Stable Text-to-Motion Framework

Authors: Wenshuo Chen, Hongru Xiao, Erhang Zhang, Lijie Hu, Lei Wang, Mengyuan Liu, Chen Chen

Abstract: Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, re… ▽ More Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while kee** its high accuracy performance. △ Less

Submitted 3 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.16935 [pdf, other]

Disentangling spin excitation continua in classical and quantum magnets using 2D nonlinear spectroscopy

Authors: Emily Z. Zhang, Ciarán Hickey, Yong Baek Kim

Abstract: Inelastic neutron scattering (INS) has traditionally been one of the primary methods for investigating quantum magnets, particularly in identifying a continuum of excitations as a hallmark of spin fractionalization in quantum spin liquids (QSLs). However, INS faces severe limitations due to its inability to distinguish between such QSL signatures and similar excitation continua arising from highly… ▽ More Inelastic neutron scattering (INS) has traditionally been one of the primary methods for investigating quantum magnets, particularly in identifying a continuum of excitations as a hallmark of spin fractionalization in quantum spin liquids (QSLs). However, INS faces severe limitations due to its inability to distinguish between such QSL signatures and similar excitation continua arising from highly frustrated magnetic orders with large unit cells or classical spin liquids. In contrast, two-dimensional coherent spectroscopy (2DCS) has emerged as a powerful tool to probe nonlinear excitation dynamics, offering insights into the underlying mechanisms behind these broad spectral features. In this paper, we utilize classical molecular dynamics (MD) techniques to explore the 2DCS responses of frustrated magnets with dominant Kitaev interactions. Comparing the classical and quantum versions of the pure Kitaev model our results indicate both clear similarities, in the form of sharp line features, and clear distinctions, in the locations of these features and in selection rules. Moreover, in the extended $KΓΓ'$ model, we show that the 2DCS response of the Kitaev spin liquid is completely distinct from that of large unit cell magnetic orders, despite both generating a broad continuum in INS. Additionally, we demonstrate the extreme sensitivity of classical 2DCS to thermal fluctuations and discuss the potential significance of quantum coherence in experimental settings. Overall, our work illustrates the potential of 2DCS in resolving the complex physics underlying ambiguous spin excitation continua, thereby enhancing our understanding of the dynamics in these frustrated systems. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 10 pages, 5+1 figures

arXiv:2404.14949 [pdf, other]

Multi-Modal Prompt Learning on Blind Image Quality Assessment

Authors: Wensheng Pan, Timin Gao, Yan Zhang, Runze Hu, Xiawu Zheng, Enwei Zhang, Yuting Gao, Yutao Liu, Yunhang Shen, Ke Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

Abstract: Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semant… ▽ More Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts. △ Less

Submitted 18 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.10357 [pdf, other]

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Authors: Enming Zhang, Bingke Zhu, Yingying Chen, Qinghai Miao, Ming Tang, **qiao Wang

Abstract: Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation rest… ▽ More Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow. △ Less

Submitted 16 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.09707 [pdf, other]

Adaptive Patching for High-resolution Image Segmentation with Transformers

Authors: Enzhi Zhang, Isaac Lyngaas, Peng Chen, Xiao Wang, Jun Igarashi, Yuankai Huo, Mohamed Wahib, Masaharu Munetomo

Abstract: Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti… ▽ More Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model, if we are to use smaller patch sizes that are favorable in segmentation. The solution is to either use custom complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods in HPC by adaptively patching the images, as a pre-processing step, based on the image details to reduce the number of patches being fed to the model, by orders of magnitude. This method has a negligible overhead, and works seamlessly with any attention-based model, i.e. it is a pre-processing step that can be adopted by any attention-based model without friction. We demonstrate superior segmentation quality over SoTA segmentation models for real-world pathology datasets while gaining a geomean speedup of $6.9\times$ for resolutions up to $64K^2$, on up to $2,048$ GPUs. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09235 [pdf, other]

PDRs4All IX. Sulfur elemental abundance in the Orion Bar

Authors: Asunción Fuente, Evelyne Roueff, Franck Le Petit, Jacques Le Bourlot, Emeric Bron, Mark G. Wolfire, James F. Babb, Pei-Gen Yan, Takashi Onaka, John H. Black, Ilane Schroetter, Dries Van De Putte, Ameek Sidhu, Amélie Canin, Boris Trahin, Felipe Alarcón, Ryan Chown, Olga Kannavou, Olivier Berné, Emilie Habart, Els Peeters, Javier R. Goicoechea, Marion Zannese, Raphael Meshaka, Yoko Okada , et al. (9 additional authors not shown)

Abstract: One of the main problems in astrochemistry is determining the amount of sulfur in volatiles and refractories in the interstellar medium. The detection of the main sulfur reservoirs (icy H$_2$S and atomic gas) has been challenging, and estimates are based on the reliability of models to account for the abundances of species containing less than 1% of the total sulfur. The high sensitivity of the Ja… ▽ More One of the main problems in astrochemistry is determining the amount of sulfur in volatiles and refractories in the interstellar medium. The detection of the main sulfur reservoirs (icy H$_2$S and atomic gas) has been challenging, and estimates are based on the reliability of models to account for the abundances of species containing less than 1% of the total sulfur. The high sensitivity of the James Webb Space Telescope provides an unprecedented opportunity to estimate the sulfur abundance through the observation of the [S I] 25.249 $μ$m line. We used the [S III] 18.7 $μ$m, [S IV] 10.5 $μ$m, and [S l] 25.249 $μ$m lines to estimate the amount of sulfur in the ionized and molecular gas along the Orion Bar. For the theoretical part, we used an upgraded version of the Meudon photodissociation region (PDR) code to model the observations. New inelastic collision rates of neutral atomic sulfur with ortho- and para- molecular hydrogen were calculated to predict the line intensities. The [S III] 18.7 $μ$m and [S IV] 10.5 $μ$m lines are detected over the imaged region with a shallow increase (by a factor of 4) toward the HII region. We estimate a moderate sulfur depletion, by a factor of $\sim$2, in the ionized gas. The corrugated interface between the molecular and atomic phases gives rise to several edge-on dissociation fronts we refer to as DF1, DF2, and DF3. The [S l] 25.249 $μ$m line is only detected toward DF2 and DF3, the dissociation fronts located farthest from the HII region. The detailed modeling of DF3 using the Meudon PDR code shows that the emission of the [S l] 25.249 $μ$m line is coming from warm ($>$ 40 K) molecular gas located at A$_{\rm V}$ $\sim$ 1$-$5 mag from the ionization front. Moreover, the intensity of the [S l] 25.249 $μ$m line is only accounted for if we assume the presence of undepleted sulfur. △ Less

Submitted 4 June, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: 16 pages, 6 figures. Accepted for publication in Astronomy and Astrophysics

arXiv:2404.08675 [pdf, other]

RecGPT: Generative Personalized Prompts for Sequential Recommendation via ChatGPT Training Paradigm

Authors: Yabin Zhang, Wenhui Yu, Erhan Zhang, Xu Chen, Lantao Hu, Peng Jiang, Kun Gai

Abstract: ChatGPT has achieved remarkable success in natural language understanding. Considering that recommendation is indeed a conversation between users and the system with items as words, which has similar underlying pattern with ChatGPT, we design a new chat framework in item index level for the recommendation task. Our novelty mainly contains three parts: model, training and inference. For the model p… ▽ More ChatGPT has achieved remarkable success in natural language understanding. Considering that recommendation is indeed a conversation between users and the system with items as words, which has similar underlying pattern with ChatGPT, we design a new chat framework in item index level for the recommendation task. Our novelty mainly contains three parts: model, training and inference. For the model part, we adopt Generative Pre-training Transformer (GPT) as the sequential recommendation model and design a user modular to capture personalized information. For the training part, we adopt the two-stage paradigm of ChatGPT, including pre-training and fine-tuning. In the pre-training stage, we train GPT model by auto-regression. In the fine-tuning stage, we train the model with prompts, which include both the newly-generated results from the model and the user's feedback. For the inference part, we predict several user interests as user representations in an autoregressive manner. For each interest vector, we recall several items with the highest similarity and merge the items recalled by all interest vectors into the final result. We conduct experiments with both offline public datasets and online A/B test to demonstrate the effectiveness of our proposed method. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.03111 [pdf, other]

doi 10.1051/0004-6361/202449295

PDRs4All VIII: Mid-IR emission line inventory of the Orion Bar

Authors: Dries Van De Putte, Raphael Meshaka, Boris Trahin, Emilie Habart, Els Peeters, Olivier Berné, Felipe Alarcón, Amélie Canin, Ryan Chown, Ilane Schroetter, Ameek Sidhu, Christiaan Boersma, Emeric Bron, Emmanuel Dartois, Javier R. Goicoechea, Karl D. Gordon, Takashi Onaka, Alexander G. G. M. Tielens, Laurent Verstraete, Mark G. Wolfire, Alain Abergel, Edwin A. Bergin, Jeronimo Bernard-Salas, Jan Cami, Sara Cuadrado , et al. (113 additional authors not shown)

Abstract: Mid-infrared emission features probe the properties of ionized gas, and hot or warm molecular gas. The Orion Bar is a frequently studied photodissociation region (PDR) containing large amounts of gas under these conditions, and was observed with the MIRI IFU aboard JWST as part of the "PDRs4All" program. The resulting IR spectroscopic images of high angular resolution (0.2") reveal a rich observat… ▽ More Mid-infrared emission features probe the properties of ionized gas, and hot or warm molecular gas. The Orion Bar is a frequently studied photodissociation region (PDR) containing large amounts of gas under these conditions, and was observed with the MIRI IFU aboard JWST as part of the "PDRs4All" program. The resulting IR spectroscopic images of high angular resolution (0.2") reveal a rich observational inventory of mid-IR emission lines, and spatially resolve the substructure of the PDR, with a mosaic cutting perpendicularly across the ionization front and three dissociation fronts. We extracted five spectra that represent the ionized, atomic, and molecular gas layers, and measured the most prominent gas emission lines. An initial analysis summarizes the physical conditions of the gas and the potential of these data. We identified around 100 lines, report an additional 18 lines that remain unidentified, and measured the line intensities and central wavelengths. The H I recombination lines originating from the ionized gas layer bordering the PDR, have intensity ratios that are well matched by emissivity coefficients from H recombination theory, but deviate up to 10% due contamination by He I lines. We report the observed emission lines of various ionization stages of Ne, P, S, Cl, Ar, Fe, and Ni, and show how certain line ratios vary between the five regions. We observe the pure-rotational H$_2$ lines in the vibrational ground state from 0-0 S(1) to 0-0 S(8), and in the first vibrationally excited state from 1-1 S(5) to 1-1 S(9). We derive H$_2$ excitation diagrams, and approximate the excitation with one thermal (~700 K) component representative of an average gas temperature, and one non-thermal component (~2700 K) probing the effect of UV pum**. We compare these results to an existing model for the Orion Bar PDR and highlight the differences with the observations. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 26 pages, 12 figures, 3 tables. Submitted to A&A, under review (1st revision)

Journal ref: A&A 687, A86 (2024)

arXiv:2403.18108 [pdf, other]

FAUST XIII. Dusty cavity and molecular shock driven by IRS7B in the Corona Australis cluster

Authors: G. Sabatini, L. Podio, C. Codella, Y. Watanabe, M. De Simone, E. Bianchi, C. Ceccarelli, C. J. Chandler, N. Sakai, B. Svoboda, L. Testi, Y. Aikawa, N. Balucani, M. Bouvier, P. Caselli, E. Caux, L. Chahine, S. Charnley, N. Cuello, F. Dulieu, L. Evans, D. Fedele, S. Feng, F. Fontani, T. Hama , et al. (32 additional authors not shown)

Abstract: The origin of the chemical diversity observed around low-mass protostars probably resides in the earliest history of these systems. We aim to investigate the impact of protostellar feedback on the chemistry and grain growth in the circumstellar medium of multiple stellar systems. In the context of the ALMA Large Program FAUST, we present high-resolution (50 au) observations of CH$_3$OH, H$_2$CO, a… ▽ More The origin of the chemical diversity observed around low-mass protostars probably resides in the earliest history of these systems. We aim to investigate the impact of protostellar feedback on the chemistry and grain growth in the circumstellar medium of multiple stellar systems. In the context of the ALMA Large Program FAUST, we present high-resolution (50 au) observations of CH$_3$OH, H$_2$CO, and SiO and continuum emission at 1.3 mm and 3 mm towards the Corona Australis star cluster. Methanol emission reveals an arc-like structure at $\sim$1800 au from the protostellar system IRS7B along the direction perpendicular to the major axis of the disc. The arc is located at the edge of two elongated continuum structures that define a cone emerging from IRS7B. The region inside the cone is probed by H$_2$CO, while the eastern wall of the arc shows bright emission in SiO, a typical shock tracer. Taking into account the association with a previously detected radio jet imaged with JVLA at 6 cm, the molecular arc reveals for the first time a bow shock driven by IRS7B and a two-sided dust cavity opened by the mass-loss process. For each cavity wall, we derive an average H$_2$ column density of $\sim$7$\times$10$^{21}$ cm$^{-2}$, a mass of $\sim$9$\times$10$^{-3}$ M$_\odot$, and a lower limit on the dust spectral index of $1.4$. These observations provide the first evidence of a shock and a conical dust cavity opened by the jet driven by IRS7B, with important implications for the chemical enrichment and grain growth in the envelope of Solar System analogues. △ Less

Submitted 2 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: 12 pages, 8 figures, 3 tables. Accepted Letter in Astronomy & Astrophysics

arXiv:2403.15315 [pdf, other]

Quantum Fluctuations Suppress the Critical Fields in BaCo$_2$(AsO$_4$)$_2$

Authors: Shiva Safari, William Bateman-Hemphill, Asimpunya Mitra, Félix Desrochers, Emily Z. Zhang, Lubuna Shafeek, Austin Ferrenti, Tyrel M. McQueen, Arkady Shekhter, Zoltán Köllö, Yong Baek Kim, B. J. Ramshaw, K. A. Modic

Abstract: Early efforts to realize exotic quantum ground states in frustrated magnets focused on frustration arising from the lattice geometry alone. Attention has shifted to bond-dependent anisotropic interactions, as well as further-neighbor interactions, on non-geometrically-frustrated lattices due to their greater versatility. The honeycomb magnet BaCo$_2$(AsO$_4$)$_2$ recently emerged as a candidate ho… ▽ More Early efforts to realize exotic quantum ground states in frustrated magnets focused on frustration arising from the lattice geometry alone. Attention has shifted to bond-dependent anisotropic interactions, as well as further-neighbor interactions, on non-geometrically-frustrated lattices due to their greater versatility. The honeycomb magnet BaCo$_2$(AsO$_4$)$_2$ recently emerged as a candidate host for both bond-dependent (e.g. Kitaev) and third-neighbor ($J_3$) interactions, and has become a model experimental system due to its relatively low levels of disorder. Understanding the relative importance of different exchange interactions holds the key to achieving novel ground states, such as quantum spin liquids. Here, we use the magnetotropic susceptibility to map out the intermediate and high-field phase diagram of BaCo$_2$(AsO$_4$)$_2$ as a function of the out-of-plane magnetic field direction at $T = 1.6$ K. We show that the experimental data are qualitatively consistent with classical Monte Carlo results of the XXZ-$J_1$-$J_3$ model with small Kitaev and off-diagonal exchange couplings included. However, the calculated critical fields are systematically larger than the experimental values. Infinite-DMRG computations on the quantum model reveal that quantum corrections from a nearby ferromagnetic state are likely responsible for the suppressed critical fields. Together, our experiment and theory analyses demonstrate that, while quantum fluctuations play an important role in determining the phase diagram, most of the physics of BaCo$_2$(AsO$_4$)$_2$ can be understood in terms of the classical dynamics of long-range ordered states, leaving little room for the possibility of a quantum spin liquid. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 16 pages, 12 figures

arXiv:2403.14598 [pdf, other]

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Authors: Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai

Abstract: PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the mode… ▽ More PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the model to generate and classify segmentation masks effectively. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization. PSALM achieves superior results on several benchmarks, such as RefCOCO/RefCOCO+/RefCOCOg, COCO Panoptic Segmentation, and COCO-Interactive, and further exhibits zero-shot capabilities on unseen tasks, such as open-vocabulary segmentation, generalized referring expression segmentation and video object segmentation, making a significant step towards a GPT moment in computer vision. Through extensive experiments, PSALM demonstrates its potential to transform the domain of image segmentation, leveraging the robust visual understanding capabilities of LMMs as seen in natural language processing. Code and models are available at https://github.com/zamling/PSALM. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.09142 [pdf, other]

USimAgent: Large Language Models for Simulating Search Users

Authors: Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, Jiaxin Mao

Abstract: Due to the advantages in the cost-efficiency and reproducibility, user simulation has become a promising solution to the user-centric evaluation of information retrieval systems. Nonetheless, accurately simulating user search behaviors has long been a challenge, because users' actions in search are highly complex and driven by intricate cognitive processes such as learning, reasoning, and planning… ▽ More Due to the advantages in the cost-efficiency and reproducibility, user simulation has become a promising solution to the user-centric evaluation of information retrieval systems. Nonetheless, accurately simulating user search behaviors has long been a challenge, because users' actions in search are highly complex and driven by intricate cognitive processes such as learning, reasoning, and planning. Recently, Large Language Models (LLMs) have demonstrated remarked potential in simulating human-level intelligence and have been used in building autonomous agents for various tasks. However, the potential of using LLMs in simulating search behaviors has not yet been fully explored. In this paper, we introduce a LLM-based user search behavior simulator, USimAgent. The proposed simulator can simulate users' querying, clicking, and stop** behaviors during search, and thus, is capable of generating complete search sessions for specific search tasks. Empirical investigation on a real user behavior dataset shows that the proposed simulator outperforms existing methods in query generation and is comparable to traditional methods in predicting user clicks and stop** behaviors. These results not only validate the effectiveness of using LLMs for user simulation but also shed light on the development of a more robust and generic user simulators. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.01259 [pdf, other]

doi 10.1103/PhysRevD.109.112018

Improved Modelling of Detector Response Effects in Phonon-based Crystal Detectors used for Dark Matter Searches

Authors: M. J. Wilson, A. Zaytsev, B. von Krosigk, I. Alkhatib, M. Buchanan, R. Chen, M. D. Diamond, E. Figueroa-Feliciano, S. A. S. Harms, Z. Hong, K. T. Kennard, N. A. Kurinsky, R. Mahapatra, N. Mirabolfathi, V. Novati, M. Platt, R. Ren, A. Sattari, B. Schmidt, Y. Wang, S. Zatschler, E. Zhang, A. Zuniga

Abstract: Various dark matter search experiments employ phonon-based crystal detectors operated at cryogenic temperatures. Some of these detectors, including certain silicon detectors used by the SuperCDMS Collaboration, are able to achieve single-charge sensitivity when a voltage bias is applied across the detector. The total amount of phonon energy measured by such a detector is proportional to the number… ▽ More Various dark matter search experiments employ phonon-based crystal detectors operated at cryogenic temperatures. Some of these detectors, including certain silicon detectors used by the SuperCDMS Collaboration, are able to achieve single-charge sensitivity when a voltage bias is applied across the detector. The total amount of phonon energy measured by such a detector is proportional to the number of electron-hole pairs created by the interaction. However, crystal impurities and surface effects can cause propagating charges to either become trapped inside the crystal or create additional unpaired charges, producing non-quantized measured energy as a result. A new analytical model for describing these detector response effects in phonon-based crystal detectors is presented. This model improves upon previous versions by demonstrating how the detector response, and thus the measured energy spectrum, is expected to differ depending on the source of events. We use this model to extract detector response parameters for SuperCDMS HVeV detectors, and illustrate how this robust modelling can help statistically discriminate between sources of events in order to improve the sensitivity of dark matter search experiments. △ Less

Submitted 24 June, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

Comments: 19 pages, 7 figures

Journal ref: Phys. Rev. D 109, 112018 (2024)

arXiv:2403.00160 [pdf, other]

A far-ultraviolet-driven photoevaporation flow observed in a protoplanetary disk

Authors: Olivier Berné, Emilie Habart, Els Peeters, Ilane Schroetter, Amélie Canin, Ameek Sidhu, Ryan Chown, Emeric Bron, Thomas J. Haworth, Pamela Klaassen, Boris Trahin, Dries Van De Putte, Felipe Alarcón, Marion Zannese, Alain Abergel, Edwin A. Bergin, Jeronimo Bernard-Salas, Christiaan Boersma, Jan Cami, Sara Cuadrado, Emmanuel Dartois, Daniel Dicken, Meriem Elyajouri, Asunción Fuente, Javier R. Goicoechea , et al. (121 additional authors not shown)

Abstract: Most low-mass stars form in stellar clusters that also contain massive stars, which are sources of far-ultraviolet (FUV) radiation. Theoretical models predict that this FUV radiation produces photo-dissociation regions (PDRs) on the surfaces of protoplanetary disks around low-mass stars, impacting planet formation within the disks. We report JWST and Atacama Large Millimetere Array observations of… ▽ More Most low-mass stars form in stellar clusters that also contain massive stars, which are sources of far-ultraviolet (FUV) radiation. Theoretical models predict that this FUV radiation produces photo-dissociation regions (PDRs) on the surfaces of protoplanetary disks around low-mass stars, impacting planet formation within the disks. We report JWST and Atacama Large Millimetere Array observations of a FUV-irradiated protoplanetary disk in the Orion Nebula. Emission lines are detected from the PDR; modelling their kinematics and excitation allows us to constrain the physical conditions within the gas. We quantify the mass-loss rate induced by the FUV irradiation, finding it is sufficient to remove gas from the disk in less than a million years. This is rapid enough to affect giant planet formation in the disk. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Journal ref: Science, 383, 6686, 2024

arXiv:2402.17110 [pdf, other]

Sinkhorn Distance Minimization for Knowledge Distillation

Authors: Xiao Cui, Yulei Qin, Yuting Gao, Enwei Zhang, Zihan Xu, Tong Wu, Ke Li, Xing Sun, Wengang Zhou, Houqiang Li

Abstract: Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few dis… ▽ More Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: Accepted by COLING 2024

arXiv:2402.16641 [pdf, other]

Towards Open-ended Visual Quality Comparison

Authors: Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, Weisi Lin

Abstract: Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into… ▽ More Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V "teacher" responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not only achieves in average 30% higher accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing related benchmarks and the proposed MICBench. Our model is published at https://huggingface.co/q-future/co-instruct. △ Less

Submitted 4 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Fix typos

arXiv:2402.14807 [pdf, other]

A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Authors: Nikhil Behari, Edwin Zhang, Yunfan Zhao, Aparna Taneja, Dheeraj Nagaraj, Milind Tambe

Abstract: Restless multi-armed bandits (RMAB) have demonstrated success in optimizing resource allocation for large beneficiary populations in public health settings. Unfortunately, RMAB models lack flexibility to adapt to evolving public health policy priorities. Concurrently, Large Language Models (LLMs) have emerged as adept automated planners across domains of robotic control and navigation. In this pap… ▽ More Restless multi-armed bandits (RMAB) have demonstrated success in optimizing resource allocation for large beneficiary populations in public health settings. Unfortunately, RMAB models lack flexibility to adapt to evolving public health policy priorities. Concurrently, Large Language Models (LLMs) have emerged as adept automated planners across domains of robotic control and navigation. In this paper, we propose a Decision Language Model (DLM) for RMABs, enabling dynamic fine-tuning of RMAB policies in public health settings using human-language commands. We propose using LLMs as automated planners to (1) interpret human policy preference prompts, (2) propose reward functions as code for a multi-agent RMAB environment, and (3) iterate on the generated reward functions using feedback from grounded RMAB simulations. We illustrate the application of DLM in collaboration with ARMMAN, an India-based non-profit promoting preventative care for pregnant mothers, that currently relies on RMAB policies to optimally allocate health worker calls to low-resource populations. We conduct a technology demonstration in simulation using the Gemini Pro model, showing DLM can dynamically shape policy outcomes using only human prompts as input. △ Less

Submitted 26 May, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.14090 [pdf, other]

Social Environment Design

Authors: Edwin Zhang, Sadie Zhao, Tonghan Wang, Safwan Hossain, Henry Gasztowtt, Stephan Zheng, David C. Parkes, Milind Tambe, Yiling Chen

Abstract: Artificial Intelligence (AI) holds promise as a technology that can be used to improve government and economic policy-making. This paper proposes a new research agenda towards this end by introducing Social Environment Design, a general framework for the use of AI for automated policy-making that connects with the Reinforcement Learning, EconCS, and Computational Social Choice communities. The fra… ▽ More Artificial Intelligence (AI) holds promise as a technology that can be used to improve government and economic policy-making. This paper proposes a new research agenda towards this end by introducing Social Environment Design, a general framework for the use of AI for automated policy-making that connects with the Reinforcement Learning, EconCS, and Computational Social Choice communities. The framework seeks to capture general economic environments, includes voting on policy objectives, and gives a direction for the systematic analysis of government and economic policy through AI simulation. We highlight key open problems for future research in AI-based policy-making. By solving these challenges, we hope to achieve various social welfare objectives, thereby promoting more ethical and responsible decision making. △ Less

Submitted 17 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

Comments: ICML 2024 Position Paper. Website at https://sed.eddie.win

arXiv:2402.10258 [pdf, other]

FAUST XII. Accretion streamers and jets in the VLA 1623--2417 protocluster

Authors: C. Codella, L. Podio, M. De Simone, C. Ceccarelli, S. Ohashi, C. J. Chandler, N. Sakai, J. E. Pineda, D. M. Segura-Cox, E. Bianchi, N. Cuello, A. López-Sepulcre, D. Fedele, P. Caselli, S. Charnley, D. Johnstone, Z. E. Zhang, M. J. Maureira, Y. Zhang, G. Sabatini, B. Svoboda, I. Jiménez-Serra, L. Loinard, S. Mercimek, N. Murillo , et al. (1 additional authors not shown)

Abstract: The ALMA interferometer has played a key role in revealing a new component of the Sun-like star forming process: the molecular streamers, i.e. structures up to thousands of au long funneling material non-axisymmetrically to disks. In the context of the FAUST ALMA LP, the archetypical VLA1623-2417 protostellar cluster has been imaged at 1.3 mm in the SO(5$_6$--4$_5$), SO(6$_6$--5$_5$), and SiO(5--4… ▽ More The ALMA interferometer has played a key role in revealing a new component of the Sun-like star forming process: the molecular streamers, i.e. structures up to thousands of au long funneling material non-axisymmetrically to disks. In the context of the FAUST ALMA LP, the archetypical VLA1623-2417 protostellar cluster has been imaged at 1.3 mm in the SO(5$_6$--4$_5$), SO(6$_6$--5$_5$), and SiO(5--4) line emission at the spatial resolution of 50 au. We detect extended SO emission, peaking towards the A and B protostars. Emission blue-shifted down to 6.6 km s$^{-1}$ reveals for the first time a long ($\sim$ 2000 au) accelerating streamer plausibly feeding the VLA1623 B protostar. Using SO, we derive for the first time an estimate of the excitation temperature of an accreting streamer: 33$\pm$9 K. The SO column density is $\sim$ 10$^{14}$ cm$^{-2}$, and the SO/H$_2$ abundance ratio is $\sim$ 10$^{-8}$. The total mass of the streamer is 3 $\times$ 10$^{-3}$ $Msun$, while its accretion rate is 3--5 $\times$ 10$^{-7}$ Msun yr$^{-1}$. This is close to the mass accretion rate of VLA1623 B, in the 0.6--3 $\times$ 10$^{-7}$ Msun yr$^{-1}$ range, showing the importance of the streamer in contributing to the mass of protostellar disks. The highest blue- and red-shifted SO velocities behave as the SiO(5--4) emission, the latter species detected for the first time in VLA1623-2417: the emission is compact (100-200 au), and associated only with the B protostar. The SO excitation temperature is $\sim$ 100 K, supporting the occurrence of shocks associated with the jet, traced by SiO. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: accepted by MNRAS

arXiv:2402.09705 [pdf, other]

Linear Depth QFT over IBM Heavy-hex Architecture

Authors: Xiangyu Gao, Yuwei **, Minghao Guo, Henry Chen, Eddy Z. Zhang

Abstract: Compiling a given quantum algorithm into a target hardware architecture is a challenging optimization problem. The compiler must take into consideration the coupling graph of physical qubits and the gate operation dependencies. The existing noise in hardware architectures requires the compilation to use as few running cycles as possible. Existing approaches include using SAT solver or heuristics t… ▽ More Compiling a given quantum algorithm into a target hardware architecture is a challenging optimization problem. The compiler must take into consideration the coupling graph of physical qubits and the gate operation dependencies. The existing noise in hardware architectures requires the compilation to use as few running cycles as possible. Existing approaches include using SAT solver or heuristics to complete the map** but these may cause the issue of either long compilation time (e.g., timeout after hours) or suboptimal compilation results in terms of running cycles (e.g., exponentially increasing number of total cycles). In this paper, we propose an efficient map** approach for Quantum Fourier Transformation (QFT) circuits over the existing IBM heavy-hex architecture. Such proposal first of all turns the architecture into a structure consisting of a straight line with dangling qubits, and then do the map** over this generated structure recursively. The calculation shows that there is a linear depth upper bound for the time complexity of these structures and for a special case where there is 1 dangling qubit in every 5 qubits, the time complexity is 5N+O(1). All these results are better than state of the art methods. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.07116 [pdf, other]

A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs

Authors: Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, Weisi Lin

Abstract: The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visu… ▽ More The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2309.14181

arXiv:2402.01512 [pdf, other]

Distractor Generation for Multiple-Choice Questions: A Survey of Methods, Datasets, and Evaluation

Authors: Elaf Alhazmi, Quan Z. Sheng, Wei Emma Zhang, Munazza Zaib, Ahoud Alhazmi

Abstract: Distractors are important in learning evaluation. This paper surveys distractor generation tasks using English multiple-choice question datasets for textual and multimodal contexts. In particular, this paper presents a thorough literature review of the recent studies on distractor generation tasks, discusses multiple choice components and their characteristics, analyzes the related datasets, and s… ▽ More Distractors are important in learning evaluation. This paper surveys distractor generation tasks using English multiple-choice question datasets for textual and multimodal contexts. In particular, this paper presents a thorough literature review of the recent studies on distractor generation tasks, discusses multiple choice components and their characteristics, analyzes the related datasets, and summarizes the evaluation metrics of distractor generation. Our investigation reveals that more than half of datasets are human-generated from educational sources in specific domains such as Science and English, which are largely text-based, with a lack of open domain and multimodal datasets. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.17654 [pdf, other]

All Beings Are Equal in Open Set Recognition

Authors: Chaohua Li, Enhao Zhang, Chuanxing Geng, SongCan Chen

Abstract: In open-set recognition (OSR), a promising strategy is exploiting pseudo-unknown data outside given $K$ known classes as an additional $K$+$1$-th class to explicitly model potential open space. However, treating unknown classes without distinction is unequal for them relative to known classes due to the category-agnostic and scale-agnostic of the unknowns. This inevitably not only disrupts the inh… ▽ More In open-set recognition (OSR), a promising strategy is exploiting pseudo-unknown data outside given $K$ known classes as an additional $K$+$1$-th class to explicitly model potential open space. However, treating unknown classes without distinction is unequal for them relative to known classes due to the category-agnostic and scale-agnostic of the unknowns. This inevitably not only disrupts the inherent distributions of unknown classes but also incurs both class-wise and instance-wise imbalances between known and unknown classes. Ideally, the OSR problem should model the whole class space as $K$+$\infty$, but enumerating all unknowns is impractical. Since the core of OSR is to effectively model the boundaries of known classes, this means just focusing on the unknowns nearing the boundaries of targeted known classes seems sufficient. Thus, as a compromise, we convert the open classes from infinite to $K$, with a novel concept Target-Aware Universum (TAU) and propose a simple yet effective framework Dual Contrastive Learning with Target-Aware Universum (DCTAU). In details, guided by the targeted known classes, TAU automatically expands the unknown classes from the previous $1$ to $K$, effectively alleviating the distribution disruption and the imbalance issues mentioned above. Then, a novel Dual Contrastive (DC) loss is designed, where all instances irrespective of known or TAU are considered as positives to contrast with their respective negatives. Experimental results indicate DCTAU sets a new state-of-the-art. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2401.12197 [pdf, other]

Empirical martingale projections via the adapted Wasserstein distance

Authors: Jose Blanchet, Johannes Wiesel, Erica Zhang, Zhenyuan Zhang

Abstract: Given a collection of multidimensional pairs $\{(X_i,Y_i):1 \leq i\leq n\}$, we study the problem of projecting the associated suitably smoothed empirical measure onto the space of martingale couplings (i.e. distributions satisfying $\mathbb{E}[Y|X]=X$) using the adapted Wasserstein distance. We call the resulting distance the smoothed empirical martingale projection distance (SE-MPD), for which w… ▽ More Given a collection of multidimensional pairs $\{(X_i,Y_i):1 \leq i\leq n\}$, we study the problem of projecting the associated suitably smoothed empirical measure onto the space of martingale couplings (i.e. distributions satisfying $\mathbb{E}[Y|X]=X$) using the adapted Wasserstein distance. We call the resulting distance the smoothed empirical martingale projection distance (SE-MPD), for which we obtain an explicit characterization. We also show that the space of martingale couplings remains invariant under the smoothing operation. We study the asymptotic limit of the SE-MPD, which converges at a parametric rate as the sample size increases if the pairs are either i.i.d. or satisfy appropriate mixing assumptions. Additional finite-sample results are also investigated. Using these results, we introduce a novel consistent martingale coupling hypothesis test, which we apply to test the existence of arbitrage opportunities in recently introduced neural network-based generative models for asset pricing calibration. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 55 pages, 7 figures

arXiv:2312.17090 [pdf, other]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Authors: Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, Qiong Yan, Xiongkuo Min, Guangtao Zhai, Weisi Lin

Abstract: The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human op… ▽ More The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign. In our experiments, we demonstrate the advantage of the discrete-level-based syllabus over direct-score-based variants for LMMs. Our code and the pre-trained weights are released at https://github.com/Q-Future/Q-Align. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: Technical Report

arXiv:2312.16114 [pdf, other]

Quantum Fourier Transformation Circuits Compilation

Authors: Yuwei **, Xiangyu Gao, Minghao Guo, Henry Chen, Fei Hua, Chi Zhang, Eddy Z. Zhang

Abstract: In this research paper, our primary focus revolves around the domain-specific hardware map** strategy tailored for Quantum Fourier Transformation (QFT) circuits. While previous approaches have heavily relied on SAT solvers or heuristic methods to generate hardware-compatible QFT circuits by inserting SWAP gates to realign logical qubits with physical qubits at various stages, they encountered si… ▽ More In this research paper, our primary focus revolves around the domain-specific hardware map** strategy tailored for Quantum Fourier Transformation (QFT) circuits. While previous approaches have heavily relied on SAT solvers or heuristic methods to generate hardware-compatible QFT circuits by inserting SWAP gates to realign logical qubits with physical qubits at various stages, they encountered significant challenges. These challenges include extended compilation times due to the expansive search space for SAT solvers and suboptimal outcomes in terms of the number of cycles required to execute all gate operations efficiently. In our study, we adopt a novel approach that combines technical intuition, often referred to as "educated guesses," and sophisticated program synthesis tools. Our objective is to uncover QFT map** solutions that leverage concepts such as affine loops and modular functions. The groundbreaking outcome of our research is the introduction of the first set of linear-depth transformed QFT circuits designed for Google Sycamore, IBM heavy-hex, and the conventional 2-dimensional (2D) grid configurations, accommodating an arbitrary number of qubits denoted as 'N'. Additionally, we have conducted comprehensive analyses to verify the correctness of these solutions and to develop strategies for handling potential faults within them. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2312.15300 [pdf, other]

Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models

Authors: Zicheng Zhang, Haoning Wu, Zhongpeng Ji, Chunyi Li, Erli Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Fengyu Sun, Shangling Jui, Weisi Lin, Guangtao Zhai

Abstract: Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA… ▽ More Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks, which is structured around two pivotal components: 1) Triadic-Tone Integration: Ordinary prompt design simply oscillates between the binary extremes of $positive$ and $negative$. Q-Boost innovates by incorporating a `middle ground' approach through $neutral$ prompts, allowing for a more balanced and detailed assessment. 2) Multi-Prompt Ensemble: Multiple quality-centric prompts are used to mitigate bias and acquire more accurate evaluation. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with the Q-Boost strategy. △ Less

Submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.09983 [pdf, other]

Toward Computationally Efficient Inverse Reinforcement Learning via Reward Sha**

Authors: Lauren H. Cooke, Harvey Klyne, Edwin Zhang, Cassidy Laidlaw, Milind Tambe, Finale Doshi-Velez

Abstract: Inverse reinforcement learning (IRL) is computationally challenging, with common approaches requiring the solution of multiple reinforcement learning (RL) sub-problems. This work motivates the use of potential-based reward sha** to reduce the computational burden of each RL sub-problem. This work serves as a proof-of-concept and we hope will inspire future developments towards computationally ef… ▽ More Inverse reinforcement learning (IRL) is computationally challenging, with common approaches requiring the solution of multiple reinforcement learning (RL) sub-problems. This work motivates the use of potential-based reward sha** to reduce the computational burden of each RL sub-problem. This work serves as a proof-of-concept and we hope will inspire future developments towards computationally efficient IRL. △ Less

Submitted 18 December, 2023; v1 submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.08653 [pdf, other]

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Authors: Shuailei Ma, Yuefeng Wang, Ying Wei, Jiaqi Fan, Enming Zhang, Xinyu Sun, Peihao Chen

Abstract: In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunate… ▽ More In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the combination of a simple \textbf{knowledge distillation} approach and the automatic pseudo-labeling mechanism in OWOD can achieve better performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures for known objects, leading to catastrophic forgetting. To alleviate these problems, we propose the \textbf{down-weight loss function} for knowledge distillation from vision-language to single vision modality. Meanwhile, we propose the \textbf{cascade decouple decoding structure} that decouples the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we propose two benchmarks, which we name "\textbf{StandardSet}$\heartsuit$" and "\textbf{IntensiveSet}$\spadesuit$" respectively, based on the complexity of their testing scenarios. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods. The code and proposed dataset are available at \url{https://github.com/xiaomabufei/SKDF}. △ Less

Submitted 30 March, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2303.11623

arXiv:2312.06363 [pdf, other]

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

Authors: Tao Chen, Enwei Zhang, Yuting Gao, Ke Li, Xing Sun, Yan Zhang, Hui Li

Abstract: Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose… ▽ More Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. △ Less

Submitted 12 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.02409 [pdf, other]

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

Authors: Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge

Abstract: Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR… ▽ More Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR's capabilities, we leverage LiDAR point cloud data by incorporating LiDAR semantic features from an off-the-shelf LiDAR feature extractor. We evaluate MGTR on Waymo Open Dataset motion prediction benchmark and show that the proposed method achieved state-of-the-art performance, ranking 1st on its leaderboard (https://waymo.com/open/challenges/2023/motion-prediction/). △ Less

Submitted 5 February, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Accepted to ICRA 2024

arXiv:2312.02189 [pdf, other]

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D

Authors: Pengsheng Guo, Hans Hao, Adam Caccavale, Zhongzheng Ren, Edward Zhang, Qi Shan, Aditya Sankar, Alexander G. Schwing, Alex Colburn, Fangchang Ma

Abstract: In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the… ▽ More In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00591 [pdf, other]

Less is More: Learning Reference Knowledge Using No-Reference Image Quality Assessment

Authors: Xudong Li, **gyuan Zheng, Xiawu Zheng, Runze Hu, Enwei Zhang, Yuting Gao, Yunhang Shen, Ke Li, Yutao Liu, **yang Dai, Yan Zhang, Rongrong Ji

Abstract: Image Quality Assessment (IQA) with reference images have achieved great success by imitating the human vision system, in which the image quality is effectively assessed by comparing the query image with its pristine reference image. However, for the images in the wild, it is quite difficult to access accurate reference images. We argue that it is possible to learn reference knowledge under the No… ▽ More Image Quality Assessment (IQA) with reference images have achieved great success by imitating the human vision system, in which the image quality is effectively assessed by comparing the query image with its pristine reference image. However, for the images in the wild, it is quite difficult to access accurate reference images. We argue that it is possible to learn reference knowledge under the No-Reference Image Quality Assessment (NR-IQA) setting, which is effective and efficient empirically. Concretely, by innovatively introducing a novel feature distillation method in IQA, we propose a new framework to learn comparative knowledge from non-aligned reference images. And then, to achieve fast convergence and avoid overfitting, we further propose an inductive bias regularization. Such a framework not only solves the congenital defects of NR-IQA but also improves the feature extraction framework, enabling it to express more abundant quality information. Surprisingly, our method utilizes less input while obtaining a more significant improvement compared to the teacher models. Extensive experiments on eight standard NR-IQA datasets demonstrate the superior performance to the state-of-the-art NR-IQA methods, i.e., achieving the PLCC values of 0.917 (vs. 0.884 in LIVEC) and 0.686 (vs. 0.661 in LIVEFB). △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.18259 [pdf, other]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/ △ Less

Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

arXiv:2311.17624 [pdf, other]

Combating Multi-path Interference to Improve Chirp-based Underwater Acoustic Communication

Authors: Wenjun Xie, Enqi Zhang, Lizhao You, Deqing Wang, Zhaorui Wang, Liqun Fu

Abstract: Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may… ▽ More Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may cause inter-symbol and intra-symbol interfere. In this paper, we present UWLoRa+, a system that realizes the same chirp modulation as LoRa with higher data rate, and enhances LoRa's design to address the multi-path challenge via the following designs: a) we replace the linear chirp used by LoRa with the non-linear chirp to reduce the signal interference range and the collision probability; b) we design an algorithm that first demodulates each path and then combines the demodulation results of detected paths; and c) we replace the Hamming codes used by LoRa with the non-binary LDPC codes to mitigate the impact of the inevitable collision.Experiment results show that the new designs improve the bit error rate (BER) by 3x, and the packet error rate (PER) significantly, compared with the LoRa's naive design. Compared with an state-of-the-art system for decoding underwater LoRa chirp signal, UWLoRa+ improves the throughput by up to 50 times. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Showing 1–50 of 289 results for author: Zhang, E