Search | arXiv e-print repository

arXiv:2406.20095 [pdf, other]

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, **ghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with au… ▽ More Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.18020 [pdf, other]

MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

Authors: Muzhen Cai, Sendong Zhao, Haochun Wang, Yanrui Du, Zewen Qiang, Bing Qin, Ting Liu

Abstract: Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most ex… ▽ More Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 8 pages, 5 figures

arXiv:2406.15160 [pdf, other]

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Authors: Ya Jiang, Qing Wang, Jun Du, Maocheng Hu, Pengfei Hu, Zeyan Liu, Shi Cheng, Zhaoxu Nian, Yuxuan Dong, Mingqi Cai, Xin Fang, Chin-Hui Lee

Abstract: This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich c… ▽ More This study presents an audio-visual information fusion approach to sound event localization and detection (SELD) in low-resource scenarios. We aim at utilizing audio and video modality information through cross-modal learning and multi-modal fusion. First, we propose a cross-modal teacher-student learning (TSL) framework to transfer information from an audio-only teacher model, trained on a rich collection of audio data with multiple data augmentation techniques, to an audio-visual student model trained with only a limited set of multi-modal data. Next, we propose a two-stage audio-visual fusion strategy, consisting of an early feature fusion and a late video-guided decision fusion to exploit synergies between audio and video modalities. Finally, we introduce an innovative video pixel swap** (VPS) technique to extend an audio channel swap** (ACS) method to an audio-visual joint augmentation. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge data set demonstrate significant improvements in SELD performances. Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks first place by effectively integrating the proposed techniques into a model ensemble. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: accepted by icme2024

arXiv:2406.13793 [pdf, other]

Exploring the Optimal Time Window for Predicting Cognitive Load Using Physiological Sensor Data

Authors: Minghao Cai, Carrie Demmans Epp

Abstract: Learning analytics has begun to use physiological signals because these have been linked with learners' cognitive and affective states. These signals, when interpreted through machine learning techniques, offer a nuanced understanding of the temporal dynamics of student learning experiences and processes. However, there is a lack of clear guidance on the optimal time window to use for analyzing ph… ▽ More Learning analytics has begun to use physiological signals because these have been linked with learners' cognitive and affective states. These signals, when interpreted through machine learning techniques, offer a nuanced understanding of the temporal dynamics of student learning experiences and processes. However, there is a lack of clear guidance on the optimal time window to use for analyzing physiological signals within predictive models. We conducted an empirical investigation of different time windows (ranging from 60 to 210 seconds) when analysing multichannel physiological sensor data for predicting cognitive load. Our results demonstrate a preference for longer time windows, with optimal window length typically exceeding 90 seconds. These findings challenge the conventional focus on immediate physiological responses, suggesting that a broader temporal scope could provide a more comprehensive understanding of cognitive processes. In addition, the variation in which time windows best supported prediction across classifiers underscores the complexity of integrating physiological measures. Our findings provide new insights for develo** educational technologies that more accurately reflect and respond to the dynamic nature of learner cognitive load in complex learning environments. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Presented at PhysioCHI: Towards Best Practices for Integrating Physiological Signals in HCI, May 11, 2024, Honolulu, HI, USA

arXiv:2406.09400 [pdf, other]

Yo'LLaVA: Your Personalized Language and Vision Assistant

Authors: Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee

Abstract: Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in o… ▽ More Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA). △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://thaoshibe.github.io/YoLLaVA

arXiv:2406.03038 [pdf]

Study on layout of double rotated serpentine springs for vertical-comb-driven torsional micromirror

Authors: Biyun Ling, Yuhu Xia, Minli Cai, Xiaoyue Wang, Yaming Wu

Abstract: The combination of double rotated serpentine springs (RSS) and vertical comb-drive is a suitbale solution for the development of torsional micromirror with high fill factor, low fabrication difficulty and good performance. However, the alignment error between upper and lower comb set caused by fabrication can induce force with unexpected direction. And the cross-axis coupled spring constants in do… ▽ More The combination of double rotated serpentine springs (RSS) and vertical comb-drive is a suitbale solution for the development of torsional micromirror with high fill factor, low fabrication difficulty and good performance. However, the alignment error between upper and lower comb set caused by fabrication can induce force with unexpected direction. And the cross-axis coupled spring constants in double rotated serpentine springs (DRSSs) makes micromirror more susceptible to this alignment error. Herein, in order to minimize the unexpected deflection caused by alignment error of vertical-comb-driven micromirror, this paper, for the first time, studies the effect of layout (centrosymmetrically-arranged and axisymmetrically-arranged) of DRSSs on cross-axis coupled spring constants. Both of theoretical analysis and finite element analysis (FEA) simulation are conducted to reveal this phenomenon. With an example, centrosymmetrically-arranged DRSSs are proved to be more resistant to pull-in of two comb sets. Finally, the relationship between key structure parameters and cross-axis coupled spring constants of centrosymmetrically -arranged DRSSs are presented. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.02721 [pdf, other]

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Authors: Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Difan Zou, Yisong Yue, Ziniu Hu

Abstract: We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model's hidden states, directly influencing the auto-regressive generation pro… ▽ More We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce Self-Control_{prefix}, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate Self-Control's efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, Self-Control_{prefix} enables a plug-and-play control and jointly controls multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs. △ Less

Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: 41 pages, 12 figures, 41 tables; Website: https://llm-self-control.github.io/

arXiv:2405.20718 [pdf, other]

Popularity-Aware Alignment and Contrast for Mitigating Popularity Bias

Authors: Miaomiao Cai, Lei Chen, Yifan Wang, Haoyue Bai, Peijie Sun, Le Wu, Min Zhang, Meng Wang

Abstract: Collaborative Filtering (CF) typically suffers from the significant challenge of popularity bias due to the uneven distribution of items in real-world datasets. This bias leads to a significant accuracy gap between popular and unpopular items. It not only hinders accurate user preference understanding but also exacerbates the Matthew effect in recommendation systems. To alleviate popularity bias,… ▽ More Collaborative Filtering (CF) typically suffers from the significant challenge of popularity bias due to the uneven distribution of items in real-world datasets. This bias leads to a significant accuracy gap between popular and unpopular items. It not only hinders accurate user preference understanding but also exacerbates the Matthew effect in recommendation systems. To alleviate popularity bias, existing efforts focus on emphasizing unpopular items or separating the correlation between item representations and their popularity. Despite the effectiveness, existing works still face two persistent challenges: (1) how to extract common supervision signals from popular items to improve the unpopular item representations, and (2) how to alleviate the representation separation caused by popularity bias. In this work, we conduct an empirical analysis of popularity bias and propose Popularity-Aware Alignment and Contrast (PAAC) to address two challenges. Specifically, we use the common supervisory signals modeled in popular item representations and propose a novel popularity-aware supervised alignment module to learn unpopular item representations. Additionally, we suggest re-weighting the contrastive learning loss to mitigate the representation separation from a popularity-centric perspective. Finally, we validate the effectiveness and rationale of PAAC in mitigating popularity bias through extensive experiments on three real-world datasets. Our code is available at https://github.com/miaomiao-cai2/KDD2024-PAAC. △ Less

Submitted 11 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: Accepted by KDD 2024

arXiv:2405.17430 [pdf, other]

Matryoshka Multimodal Models

Authors: Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

Abstract: Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While… ▽ More Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Project Page: https://matryoshka-mm.github.io/

arXiv:2405.15783 [pdf, other]

doi 10.1145/3626772.3658596

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

Authors: Haoyue Bai, Le Wu, Min Hou, Miaomiao Cai, Zhuangzhuang He, Yuyang Zhou, Richang Hong, Meng Wang

Abstract: Multimedia-based recommendation provides personalized item suggestions by learning the content preferences of users. With the proliferation of digital devices and APPs, a huge number of new items are created rapidly over time. How to quickly provide recommendations for new items at the inference time is challenging. What's worse, real-world items exhibit varying degrees of modality missing(e.g., m… ▽ More Multimedia-based recommendation provides personalized item suggestions by learning the content preferences of users. With the proliferation of digital devices and APPs, a huge number of new items are created rapidly over time. How to quickly provide recommendations for new items at the inference time is challenging. What's worse, real-world items exhibit varying degrees of modality missing(e.g., many short videos are uploaded without text descriptions). Though many efforts have been devoted to multimedia-based recommendations, they either could not deal with new multimedia items or assumed the modality completeness in the modeling process. In this paper, we highlight the necessity of tackling the modality missing issue for new item recommendation. We argue that users' inherent content preference is stable and better kept invariant to arbitrary modality missing environments. Therefore, we approach this problem from a novel perspective of invariant learning. However, how to construct environments from finite user behavior training data to generalize any modality missing is challenging. To tackle this issue, we propose a novel Multimodality Invariant Learning reCommendation(a.k.a. MILK) framework. Specifically, MILK first designs a cross-modality alignment module to keep semantic consistency from pretrained multimedia item features. After that, MILK designs multi-modal heterogeneous environments with cyclic mixup to augment training data, in order to mimic any modality missing for invariant user preference learning. Extensive experiments on three real datasets verify the superiority of our proposed framework. The code is available at https://github.com/HaoyueBai98/MILK. △ Less

Submitted 28 April, 2024; originally announced May 2024.

arXiv:2405.10818 [pdf]

Modeling Supply Chain Interaction and Disruption: Insights from Real-world Data and Complex Adaptive System

Authors: Jiawei Feng, Mengsi Cai, Fangze Dai, Tianci Bu, Xiaoyu Zhang, Huijun Zheng, Xin Lu

Abstract: In the rapidly evolving automotive industry, Systems-on-Chips (SoCs) are playing an increasingly crucial role in enhancing vehicle intelligence, connectivity, and safety features. For enterprises whose business encompasses automotive SoCs, the sustained and stable provision and receipt of SoC relevant goods or services are essential. Considering the imperative for a resilient and adaptable supply… ▽ More In the rapidly evolving automotive industry, Systems-on-Chips (SoCs) are playing an increasingly crucial role in enhancing vehicle intelligence, connectivity, and safety features. For enterprises whose business encompasses automotive SoCs, the sustained and stable provision and receipt of SoC relevant goods or services are essential. Considering the imperative for a resilient and adaptable supply network, enterprises are concentrating their efforts on formulating strategies to address risks stemming from supply chain disruptions caused by technological obsolescence, natural disasters, and geopolitical tensions. This study presents an open supply knowledge extraction and complement approach and build a supply chain network of automotive SoC enterprises in China, which incorporates cross-domain named entity recognition under limited information, fuzzy matching of firm entities, and supply relation inferring based on knowledge graph. Subsequently, we exhibit the degree and registered capital distribution across firms, and analyze the correlations between centrality metrics in the supply chain network. Finally, based on recovery capacity and risk transfer, two interaction disruption models (IDMs) are developed to elucidate the adaptive behaviors and effect of network disruptions under various business and attack strategies. This research not only aids in exploring the complexities of Chinese automotive SoC supply chain but also enriches our understanding of the dynamics of firm behavior in this crucial industry sector. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2304.10428 by other authors

arXiv:2405.06670 [pdf, other]

TLINet: Differentiable Neural Network Temporal Logic Inference

Authors: Danyang Li, Mingyu Cai, Cristian-Ioan Vasile, Roberto Tron

Abstract: There has been a growing interest in extracting formal descriptions of the system behaviors from data. Signal Temporal Logic (STL) is an expressive formal language used to describe spatial-temporal properties with interpretability. This paper introduces TLINet, a neural-symbolic framework for learning STL formulas. The computation in TLINet is differentiable, enabling the usage of off-the-shelf gr… ▽ More There has been a growing interest in extracting formal descriptions of the system behaviors from data. Signal Temporal Logic (STL) is an expressive formal language used to describe spatial-temporal properties with interpretability. This paper introduces TLINet, a neural-symbolic framework for learning STL formulas. The computation in TLINet is differentiable, enabling the usage of off-the-shelf gradient-based tools during the learning process. In contrast to existing approaches, we introduce approximation methods for max operator designed specifically for temporal logic-based gradient techniques, ensuring the correctness of STL satisfaction evaluation. Our framework not only learns the structure but also the parameters of STL formulas, allowing flexible combinations of operators and various logical structures. We validate TLINet against state-of-the-art baselines, demonstrating that our approach outperforms these baselines in terms of interpretability, compactness, rich expressibility, and computational efficiency. △ Less

Submitted 14 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.05543 [pdf, ps, other]

Predicting Cognitive Load Using Sensor Data in a Literacy Game

Authors: Minghao Cai, Carrie Demmans Epp

Abstract: Educational games are being increasingly used to support self-paced learning. However, educators and system designers often face challenges in monitoring student affect and cognitive load. Existing assessments in game-based learning environments (GBLEs) tend to focus more on outcomes rather than processes, potentially overlooking key aspects of the learning journey that include learner affect and… ▽ More Educational games are being increasingly used to support self-paced learning. However, educators and system designers often face challenges in monitoring student affect and cognitive load. Existing assessments in game-based learning environments (GBLEs) tend to focus more on outcomes rather than processes, potentially overlooking key aspects of the learning journey that include learner affect and cognitive load. To address this issue, we collected data and trained a model to track learner cognitive load while they used an online literacy game for English. We collected affect-related physiological data and pupil data during gameplay to enable the development of models that identify these latent characteristics of learner processes. Our model indicates the feasibility of using these data to track cognitive load in GBLEs. Our multimodal model distinguished different levels of cognitive load, achieving the highest Kappa (.417) and accuracy (70%). Our model reveals the importance of including affect-related features (i.e., EDA and heart rate) when predicting cognitive load and extends recent findings suggesting the benefit of using multiple channels when modeling latent aspects of learner processes. Findings also suggest that cognitive load tracking could now be used to facilitate the creation of personalized learning experiences. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: This work has been accepted by the 17th International Conference on Educational Data Mining

arXiv:2404.14673 [pdf, other]

High-Dimensional Two-Photon Quantum Controlled Phase-Flip Gate

Authors: Mingyuan Chen, Jiangshan Tang, Miao Cai, Franco Nori, Keyu Xia

Abstract: High-dimensional quantum systems have been used to reveal interesting fundamental physics and to improve information capacity and noise resilience in quantum information processing. However, it remains a significant challenge to realize universal two-photon quantum gates in high dimensions with high success probability. Here, by considering an ion-cavity QED system, we theoretically propose, to th… ▽ More High-dimensional quantum systems have been used to reveal interesting fundamental physics and to improve information capacity and noise resilience in quantum information processing. However, it remains a significant challenge to realize universal two-photon quantum gates in high dimensions with high success probability. Here, by considering an ion-cavity QED system, we theoretically propose, to the best of our knowledge, the first high-dimensional, deterministic and universal two-photon quantum gate. By using an optical cavity embedded with a single trapped 40Ca+ ion, we achieve a high average fidelity larger than 98% for a quantum controlled phase-flip gate in four-dimensional space, spanned by photonic spin angular momenta and orbital angular momenta. Our proposed system can be an essential building block for high-dimensional quantum information processing, and also provides a platform for studying high-dimensional cavity QED. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.14219 [pdf, other]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra , et al. (90 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. △ Less

Submitted 23 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 19 pages

arXiv:2404.12860 [pdf, other]

Nonreciprocal PT-symmetric phase transition in a non-Hermitian chiral quantum optical system

Authors: Miao Cai, Jiang-Shan Tang, Ming-Yuan Chen, Keyu Xia

Abstract: Phase transitions, non-Hermiticity and nonreciprocity play central roles in fundamental physics. However, the triple interplay of these three fields is of lack in the quantum domain. Here, we show nonreciprocal parity-time-symmetric phase transition in a non-Hermitian chiral quantum electrodynamical system, caused by the directional system dissipation. In remarkable contrast to previously reported… ▽ More Phase transitions, non-Hermiticity and nonreciprocity play central roles in fundamental physics. However, the triple interplay of these three fields is of lack in the quantum domain. Here, we show nonreciprocal parity-time-symmetric phase transition in a non-Hermitian chiral quantum electrodynamical system, caused by the directional system dissipation. In remarkable contrast to previously reported nonreciprocal phase transitions, the nonreciprocal parity-time-symmetric phases appear even when the atom-resonator coupling is reciprocal. Nonreciprocal photon blockade is obtained in the nonreciprocal phase region. These results may deepen the fundamental insight of nonreciprocal and non-Hermitian quantum physics, and also open a new door for unconventional quantum manipulation. △ Less

Submitted 21 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

Comments: 6 pages, 4 figures

arXiv:2404.11193 [pdf, other]

Photonic indistinguishability characterization and optimization for cavity-based single-photon source

Authors: Miao Cai, Mingyuan Chen, Jiangshan Tang, Keyu Xia

Abstract: Indistinguishability of single photons from independent sources is critically important for scalable quantum technologies. We provide a comprehensive comparison of single-photon indistinguishability of different kinds of cavity quantum electrodynamics (CQED) systems by numerically simulating Hong-Ou-Mandel (HOM) two-photon interference. We find that the CQED system using nature atoms exhibit super… ▽ More Indistinguishability of single photons from independent sources is critically important for scalable quantum technologies. We provide a comprehensive comparison of single-photon indistinguishability of different kinds of cavity quantum electrodynamics (CQED) systems by numerically simulating Hong-Ou-Mandel (HOM) two-photon interference. We find that the CQED system using nature atoms exhibit superiority in indistinguishability, benefiting from the inherently identical features. Moreover, a $Λ-$type three-level atoms show essential robust against variation of various system parameters because it exploits the two ground states with considerable smaller decay rates for single-photon generation. Furthermore, a machine learning-based framework is proposed to significantly and robustly improve single-photon indistinguishability for non-identical two CQED systems. This work may pave the way for designing and engineering reliable and scalable photon-based quantum technologies. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.08121 [pdf, other]

Symmetric Tropical Rank 2 Matrices

Authors: May Cai, Kisun Lee, Josephine Yu

Abstract: We study the tropicalization of the space of symmetric rank two matrices. Analogously to the result of Markwig and Yu for general tropical rank two matrices, we show that it has a simplicial complex structure as the space of symmetric bicolored trees and that this simplicial complex is shellable. We also discuss some matroid structures arising from this space and present generating functions for t… ▽ More We study the tropicalization of the space of symmetric rank two matrices. Analogously to the result of Markwig and Yu for general tropical rank two matrices, we show that it has a simplicial complex structure as the space of symmetric bicolored trees and that this simplicial complex is shellable. We also discuss some matroid structures arising from this space and present generating functions for the number of symmetric bicolored trees. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 20 pages, 8 figures

MSC Class: 14T15

arXiv:2404.00101 [pdf, other]

Quandle Action Quivers

Authors: Mason Cai, Sam Nelson

Abstract: Quandle Coloring Quivers are directed graph-valued invariants of classical and virtual knots and links associated to finite quandles. Quandle action quivers are subquivers of the full quandle coloring quiver associated to quandle actions by elements of the coloring quandle. These quivers provide a categorification of the quandle counting invariant associated to each element of the quandle. We obta… ▽ More Quandle Coloring Quivers are directed graph-valued invariants of classical and virtual knots and links associated to finite quandles. Quandle action quivers are subquivers of the full quandle coloring quiver associated to quandle actions by elements of the coloring quandle. These quivers provide a categorification of the quandle counting invariant associated to each element of the quandle. We obtain new polynomial invariants called quandle action polynomials from these quivers as decategorifications. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: 8 pages

MSC Class: 57K12

arXiv:2403.19927 [pdf, ps, other]

Parameter choice strategies for regularized least squares approximation of noisy continuous functions on the unit circle

Authors: Congpei An, Mou Cai

Abstract: In this paper, we consider a trigonometric polynomial reconstruction of continuous periodic functions from their noisy values at equidistant nodes of the unit circle by a regularized least squares method. We indicate that the constructed trigonometric polynomial can be determined in explicit due to the exactness of trapezoidal rule. Then a concrete error bound is derived based on the estimation of… ▽ More In this paper, we consider a trigonometric polynomial reconstruction of continuous periodic functions from their noisy values at equidistant nodes of the unit circle by a regularized least squares method. We indicate that the constructed trigonometric polynomial can be determined in explicit due to the exactness of trapezoidal rule. Then a concrete error bound is derived based on the estimation of the Lebesgue constant. In particular, we analyze three regularization parameter choice strategies: Morozov's discrepancy principal, L-curve and generalized cross-validation. Finally, numerical examples are given to perform that well chosen parameters by above strategy can improve approximation quality. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.19770 [pdf, other]

Hierarchical Deep Learning for Intention Estimation of Teleoperation Manipulation in Assembly Tasks

Authors: Mingyu Cai, Karankumar Patel, Soshi Iba, Songpo Li

Abstract: In human-robot collaboration, shared control presents an opportunity to teleoperate robotic manipulation to improve the efficiency of manufacturing and assembly processes. Robots are expected to assist in executing the user's intentions. To this end, robust and prompt intention estimation is needed, relying on behavioral observations. The framework presents an intention estimation technique at hie… ▽ More In human-robot collaboration, shared control presents an opportunity to teleoperate robotic manipulation to improve the efficiency of manufacturing and assembly processes. Robots are expected to assist in executing the user's intentions. To this end, robust and prompt intention estimation is needed, relying on behavioral observations. The framework presents an intention estimation technique at hierarchical levels i.e., low-level actions and high-level tasks, by incorporating multi-scale hierarchical information in neural networks. Technically, we employ hierarchical dependency loss to boost overall accuracy. Furthermore, we propose a multi-window method that assigns proper hierarchical prediction windows of input data. An analysis of the predictive power with various inputs demonstrates the predominance of the deep hierarchical model in the sense of prediction accuracy and early intention identification. We implement the algorithm on a virtual reality (VR) setup to teleoperate robotic hands in a simulation with various assembly tasks to show the effectiveness of online estimation. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: ICRA 2024

arXiv:2403.18348 [pdf, other]

Sequential Recommendation with Latent Relations based on Large Language Model

Authors: Shenghao Yang, Weizhi Ma, Peijie Sun, Qingyao Ai, Yiqun Liu, Mingchen Cai, Min Zhang

Abstract: Sequential recommender systems predict items that may interest users by modeling their preferences based on historical interactions. Traditional sequential recommendation methods rely on capturing implicit collaborative filtering signals among items. Recent relation-aware sequential recommendation models have achieved promising performance by explicitly incorporating item relations into the modeli… ▽ More Sequential recommender systems predict items that may interest users by modeling their preferences based on historical interactions. Traditional sequential recommendation methods rely on capturing implicit collaborative filtering signals among items. Recent relation-aware sequential recommendation models have achieved promising performance by explicitly incorporating item relations into the modeling of user historical sequences, where most relations are extracted from knowledge graphs. However, existing methods rely on manually predefined relations and suffer the sparsity issue, limiting the generalization ability in diverse scenarios with varied item relations. In this paper, we propose a novel relation-aware sequential recommendation framework with Latent Relation Discovery (LRD). Different from previous relation-aware models that rely on predefined rules, we propose to leverage the Large Language Model (LLM) to provide new types of relations and connections between items. The motivation is that LLM contains abundant world knowledge, which can be adopted to mine latent relations of items for recommendation. Specifically, inspired by that humans can describe relations between items using natural language, LRD harnesses the LLM that has demonstrated human-like knowledge to obtain language knowledge representations of items. These representations are fed into a latent relation discovery module based on the discrete state variational autoencoder (DVAE). Then the self-supervised relation discovery tasks and recommendation tasks are jointly optimized. Experimental results on multiple public datasets demonstrate our proposed latent relations discovery method can be incorporated with existing relation-aware sequential recommendation models and significantly improve the performance. Further analysis experiments indicate the effectiveness and reliability of the discovered latent relations. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted by SIGIR 2024

arXiv:2403.18325 [pdf, other]

Common Sense Enhanced Knowledge-based Recommendation with Large Language Model

Authors: Shenghao Yang, Weizhi Ma, Peijie Sun, Min Zhang, Qingyao Ai, Yiqun Liu, Mingchen Cai

Abstract: Knowledge-based recommendation models effectively alleviate the data sparsity issue leveraging the side information in the knowledge graph, and have achieved considerable performance. Nevertheless, the knowledge graphs used in previous work, namely metadata-based knowledge graphs, are usually constructed based on the attributes of items and co-occurring relations (e.g., also buy), in which the for… ▽ More Knowledge-based recommendation models effectively alleviate the data sparsity issue leveraging the side information in the knowledge graph, and have achieved considerable performance. Nevertheless, the knowledge graphs used in previous work, namely metadata-based knowledge graphs, are usually constructed based on the attributes of items and co-occurring relations (e.g., also buy), in which the former provides limited information and the latter relies on sufficient interaction data and still suffers from cold start issue. Common sense, as a form of knowledge with generality and universality, can be used as a supplement to the metadata-based knowledge graph and provides a new perspective for modeling users' preferences. Recently, benefiting from the emergent world knowledge of the large language model, efficient acquisition of common sense has become possible. In this paper, we propose a novel knowledge-based recommendation framework incorporating common sense, CSRec, which can be flexibly coupled to existing knowledge-based methods. Considering the challenge of the knowledge gap between the common sense-based knowledge graph and metadata-based knowledge graph, we propose a knowledge fusion approach based on mutual information maximization theory. Experimental results on public datasets demonstrate that our approach significantly improves the performance of existing knowledge-based recommendation models. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted by DASFAA 2024

arXiv:2403.15388 [pdf, other]

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Authors: Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, Yan Yan

Abstract: Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which… ▽ More Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them with the unpruned tokens, effectively supplementing and enhancing their informational content. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 14 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/. △ Less

Submitted 22 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

Comments: Project page: https://llava-prumerge.github.io/

arXiv:2403.14125 [pdf, other]

Learning causal graphs using variable grou** according to ancestral relationship

Authors: Ming Cai, Hisayuki Hara

Abstract: Several causal discovery algorithms have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. And some methods are not feasible when the sample size is smaller than the number of variables. To circumvent these problems, some researchers proposed causal structure learning algorithms usin… ▽ More Several causal discovery algorithms have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. And some methods are not feasible when the sample size is smaller than the number of variables. To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches. For learning the entire causal graph, the approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal discovery algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small. This paper proposes a new algorithm for grou** variables based the ancestral relationships among the variables, under the LiNGAM assumption, where the causal relationships are linear, and the mutually independent noise are distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG. The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM without grou** variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 12 pages, 5 figures

arXiv:2403.04369 [pdf, other]

From Graph to Word Bag: Introducing Domain Knowledge to Confusing Charge Prediction

Authors: Ang Li, Qiangchao Chen, Yiquan Wu, Ming Cai, Xiang Zhou, Fei Wu, Kun Kuang

Abstract: Confusing charge prediction is a challenging task in legal AI, which involves predicting confusing charges based on fact descriptions. While existing charge prediction methods have shown impressive performance, they face significant challenges when dealing with confusing charges, such as Snatch and Robbery. In the legal domain, constituent elements play a pivotal role in distinguishing confusing c… ▽ More Confusing charge prediction is a challenging task in legal AI, which involves predicting confusing charges based on fact descriptions. While existing charge prediction methods have shown impressive performance, they face significant challenges when dealing with confusing charges, such as Snatch and Robbery. In the legal domain, constituent elements play a pivotal role in distinguishing confusing charges. Constituent elements are fundamental behaviors underlying criminal punishment and have subtle distinctions among charges. In this paper, we introduce a novel From Graph to Word Bag (FWGB) approach, which introduces domain knowledge regarding constituent elements to guide the model in making judgments on confusing charges, much like a judge's reasoning process. Specifically, we first construct a legal knowledge graph containing constituent elements to help select keywords for each charge, forming a word bag. Subsequently, to guide the model's attention towards the differentiating information for each charge within the context, we expand the attention mechanism and introduce a new loss function with attention supervision through words in the word bag. We construct the confusing charges dataset from real-world judicial documents. Experiments demonstrate the effectiveness of our method, especially in maintaining exceptional performance in imbalanced label distributions. △ Less

Submitted 24 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.04366 [pdf, other]

Enhancing Court View Generation with Knowledge Injection and Guidance

Authors: Ang Li, Yiquan Wu, Yifei Liu, Fei Wu, Ming Cai, Kun Kuang

Abstract: Court View Generation (CVG) is a challenging task in the field of Legal Artificial Intelligence (LegalAI), which aims to generate court views based on the plaintiff claims and the fact descriptions. While Pretrained Language Models (PLMs) have showcased their prowess in natural language generation, their application to the complex, knowledge-intensive domain of CVG often reveals inherent limitatio… ▽ More Court View Generation (CVG) is a challenging task in the field of Legal Artificial Intelligence (LegalAI), which aims to generate court views based on the plaintiff claims and the fact descriptions. While Pretrained Language Models (PLMs) have showcased their prowess in natural language generation, their application to the complex, knowledge-intensive domain of CVG often reveals inherent limitations. In this paper, we present a novel approach, named Knowledge Injection and Guidance (KIG), designed to bolster CVG using PLMs. To efficiently incorporate domain knowledge during the training stage, we introduce a knowledge-injected prompt encoder for prompt tuning, thereby reducing computational overhead. Moreover, to further enhance the model's ability to utilize domain knowledge, we employ a generating navigator, which dynamically guides the text generation process in the inference stage without altering the model's architecture, making it readily transferable. Comprehensive experiments on real-world data demonstrate the effectiveness of our approach compared to several established baselines, especially in the responsivity of claims, where it outperforms the best baseline by 11.87%. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.03790 [pdf, other]

Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery

Authors: Wei Zhang, Miaoxin Cai, Tong Zhang, Guoqiang Lei, Yin Zhuang, Xuerui Mao

Abstract: Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the large language models (LLMs)'s powerful g… ▽ More Ship detection needs to identify ship locations from remote sensing (RS) scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to set up a unified paradigm for achieving multi-source ship detection. To address this challenge, in this article, leveraging the large language models (LLMs)'s powerful generalization ability, a unified visual-language model called Popeye is proposed for multi-source ship detection from RS imagery. Specifically, to bridge the interpretation gap between the multi-source images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box (HBB) and oriented bounding box (OBB). Subsequently, the hybrid experts encoder is designed to refine multi-scale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pre-trained visual-language knowledge from the nature scene into the RS domain for multi-source ship detection. In addition, the segment anything model (SAM) is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models for zero-shot multi-source ship detection. △ Less

Submitted 13 June, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03730 [pdf, other]

Learning 3D object-centric representation through prediction

Authors: John Day, Tushar Arora, Jirui Liu, Li Erran Li, Ming Bo Cai

Abstract: As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we de… ▽ More As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we developed a novel network architecture that simultaneously learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth, all while using only information directly available to the brain as training data, namely: sequences of images and self-motion. The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes. This results in object representations being learned as an essential byproduct of learning to predict. △ Less

Submitted 6 March, 2024; originally announced March 2024.

Comments: 21 pages, 11 figures. Project webpage can be found at https://jday54.github.io/opple_site/

ACM Class: I.2.10; I.4.8; I.4.6; I.4.10; I.2.6

arXiv:2403.02614 [pdf, other]

doi 10.1364/OE.509601

Generation of True Quantum Random Numbers with On-Demand Probability Distributions via Single-Photon Quantum Walks

Authors: Chaoying Meng, Miao Cai, Yufang Yang, Haodong Wu, Zhixiang Li, Ya** Ruan, Yong Zhang, Han Zhang, Keyu Xia, Franco Nori

Abstract: Random numbers are at the heart of diverse fields, ranging from simulations of stochastic processes to classical and quantum cryptography. The requirement for true randomness in these applications has motivated various proposals for generating random numbers based on the inherent randomness of quantum systems. The generation of true random numbers with arbitrarily defined probability distributions… ▽ More Random numbers are at the heart of diverse fields, ranging from simulations of stochastic processes to classical and quantum cryptography. The requirement for true randomness in these applications has motivated various proposals for generating random numbers based on the inherent randomness of quantum systems. The generation of true random numbers with arbitrarily defined probability distributions is highly desirable for applications, but it is very challenging. Here we show that single-photon quantum walks can generate multi-bit random numbers with on-demand probability distributions, when the required ``coin'' parameters are found with the gradient descent (GD) algorithm. Our theoretical and experimental results exhibit high fidelity for various selected distributions. This GD-enhanced single-photon system provides a convenient way for building flexible and reliable quantum random number generators. Multi-bit random numbers are a necessary resource for high-dimensional quantum key distribution. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2403.01686 [pdf, other]

doi 10.3847/2041-8213/ad319

AT2023lli: A Tidal Disruption Event with Prominent Optical Early Bump and Delayed Episodic X-ray Emission

Authors: Shifeng Huang, Ning Jiang, Jiazheng Zhu, Yibo Wang, Tinggui Wang, Shan-Qin Wang, Wen-Pei Gan, En-Wei Liang, Yu-**g Qin, Zheyu Lin, Lin-Na Xu, Min-Xuan Cai, Ji-An Jiang, Xu Kong, Jiaxun Li, Long Li, Jian-Guo Wang, Ze-Lin Xu, Yongquan Xue, Ye-Fei Yuan, **gquan Cheng, Lulu Fan, Jie Gao, Lei Hu, Weida Hu , et al. (20 additional authors not shown)

Abstract: High-cadence, multiwavelength observations have continuously revealed the diversity of tidal disruption events (TDEs), thus greatly advancing our knowledge and understanding of TDEs. In this work, we conducted an intensive optical-UV and X-ray follow-up campaign of TDE AT2023lli, and found a remarkable month-long bump in its UV/optical light curve nearly two months prior to maximum brightness. The… ▽ More High-cadence, multiwavelength observations have continuously revealed the diversity of tidal disruption events (TDEs), thus greatly advancing our knowledge and understanding of TDEs. In this work, we conducted an intensive optical-UV and X-ray follow-up campaign of TDE AT2023lli, and found a remarkable month-long bump in its UV/optical light curve nearly two months prior to maximum brightness. The bump represents the longest separation time from the main peak among known TDEs to date. The main UV/optical outburst declines as $t^{-4.10}$, making it one of the fastest decaying optically selected TDEs. Furthermore, we detected sporadic X-ray emission 30 days after the UV/optical peak, accompanied by a reduction in the period of inactivity. It is proposed that the UV/optical bump could be caused by the self-intersection of the stream debris, whereas the primary peak is generated by the reprocessed emission of the accretion process. In addition, our results suggest that episodic X-ray radiation during the initial phase of decline may be due to the patched obscurer surrounding the accretion disk, a phenomenon associated with the inhomogeneous reprocessing process. The double TDE scenario, in which two stars are disrupted in sequence, is also a possible explanation for producing the observed early bump and main peak. We anticipate that the multicolor light curves of TDEs, especially in the very early stages, and the underlying physics can be better understood in the near future with the assistance of dedicated surveys such as the deep high-cadence survey of the 2.5-meter Wide Field Survey Telescope (WFST). △ Less

Submitted 26 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 14 pages, 8 figures,accepted for publication by ApJL

arXiv:2402.18166 [pdf, other]

Sequence-level Semantic Representation Fusion for Recommender Systems

Authors: Lanling Xu, Zhen Tian, Bingqian Li, Junjie Zhang, **peng Wang, Mingchen Cai, Wayne Xin Zhao

Abstract: With the rapid development of recommender systems, there is increasing side information that can be employed to improve the recommendation performance. Specially, we focus on the utilization of the associated \emph{textual data} of items (eg product title) and study how text features can be effectively fused with ID features in sequential recommendation. However, there exists distinct data charact… ▽ More With the rapid development of recommender systems, there is increasing side information that can be employed to improve the recommendation performance. Specially, we focus on the utilization of the associated \emph{textual data} of items (eg product title) and study how text features can be effectively fused with ID features in sequential recommendation. However, there exists distinct data characteristics for the two kinds of item features, making a direct fusion method (eg adding text and ID embeddings as item representation) become less effective. To address this issue, we propose a novel {\ul \emph{Te}}xt-I{\ul \emph{D}} semantic fusion approach for sequential {\ul \emph{Rec}}ommendation, namely \textbf{\our}. The core idea of our approach is to conduct a sequence-level semantic fusion approach by better integrating global contexts. The key strategy lies in that we transform the text embeddings and ID embeddings by Fourier Transform from \emph{time domain} to \emph{frequency domain}. In the frequency domain, the global sequential characteristics of the original sequences are inherently aggregated into the transformed representations, so that we can employ simple multiplicative operations to effectively fuse the two kinds of item features. Our fusion approach can be proved to have the same effects of contextual convolution, so as to achieving sequence-level semantic fusion. In order to further improve the fusion performance, we propose to enhance the discriminability of the text embeddings from the text encoder, by adaptively injecting positional information via a mixture-of-experts~(MoE) modulation method. Our implementation is available at this repository: \textcolor{magenta}{\url{https://github.com/RUCAIBox/TedRec}}. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 8 pages, 5 figures

arXiv:2402.13254 [pdf, other]

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

Authors: Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee

Abstract: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation mode… ▽ More We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V. To facilitate future research, we release our code, dataset, benchmark, and checkpoints at https://countercurate.github.io. △ Less

Submitted 12 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 15 pages, 6 figures, 12 tables, Project Page: https://countercurate.github.io/

arXiv:2402.13194 [pdf, other]

Quantum Wiretap Channel Coding Assisted by Noisy Correlation

Authors: Minglai Cai, Andreas Winter

Abstract: We consider the private classical capacity of a quantum wiretap channel, where the users (sender Alice, receiver Bob, and eavesdropper Eve) have access to the resource of a shared quantum state, additionally to their channel inputs and outputs. An extreme case is maximal entanglement or a secret key between Alice and Bob, both of which would allow for onetime padding the message. But here both the… ▽ More We consider the private classical capacity of a quantum wiretap channel, where the users (sender Alice, receiver Bob, and eavesdropper Eve) have access to the resource of a shared quantum state, additionally to their channel inputs and outputs. An extreme case is maximal entanglement or a secret key between Alice and Bob, both of which would allow for onetime padding the message. But here both the wiretap channel and the shared state are general. In the other extreme case that the state is trivial, we recover the wiretap channel and its private capacity [N. Cai, A. Winter and R. W. Yeung, Probl. Inform. Transm. 40(4):318-336, 2004]. We show how to use the given resource state to build a code for secret classical communication. Our main result is a lower bound on the assisted private capacity, which asymptotically meets the multi-letter converse and which encompasses all sorts of previous results as special cases. △ Less

Submitted 11 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Journal ref: Proc. ISIT 2024, Athens (Greece), 7-12 July 2024

arXiv:2402.01137 [pdf, ps, other]

Long-time dynamics of stochastic wave equation with dissipative dam** and its full discretization: exponential ergodicity and strong law of large numbers

Authors: Meng Cai, Chuchu Chen, Jialin Hong, Tau Zhou

Abstract: For stochastic wave equation, when the dissipative dam** is a non-globally Lipschitz function of the velocity, there are few results on the long-time dynamics, in particular, the exponential ergodicity and strong law of large numbers, for the equation and its numerical discretization to our knowledge. Focus on this issue, the main contributions of this paper are as follows. First, based on const… ▽ More For stochastic wave equation, when the dissipative dam** is a non-globally Lipschitz function of the velocity, there are few results on the long-time dynamics, in particular, the exponential ergodicity and strong law of large numbers, for the equation and its numerical discretization to our knowledge. Focus on this issue, the main contributions of this paper are as follows. First, based on constructing novel Lyapunov functionals, we show the unique invariant measure and exponential ergodicity of the underlying equation and its full discretization. Second, the error estimates of invariant measures both in Wasserstein distance and in the weak sense are obtained. Third, the strong laws of large numbers of the equation and the full discretization are obtained, which states that the time averages of the exact and numerical solutions are shown to converge to the ergodic limit almost surely. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2401.16822 [pdf, other]

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

Authors: Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Xuerui Mao

Abstract: Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpre… ▽ More Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant diversities between the natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multi-sensor RS interpretation tasks uniformly is proposed in this paper for universal RS image comprehension. In EarthGPT, three key techniques are developed including a visual-enhanced perception mechanism, a cross-modal mutual comprehension approach, and a unified instruction tuning method for multi-sensor multi-task in the RS domain. More importantly, a dataset named MMRS-1M featuring large-scale multi-sensor multi-modal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multi-sensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks. △ Less

Submitted 8 March, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.11613 [pdf, other]

Hot Jupiter Formation in Dense Star Clusters

Authors: Leonard Benkendorff, Francesco Flammini Dotti, Katja Stock, Maxwell Xu Cai, Rainer Spurzem

Abstract: Hot Jupiters (HJ) are defined as Jupiter-mass exoplanets orbiting around their host star with an orbital period < 10 days. It is assumed that HJ do not form in-situ but ex-situ. Recent discoveries show that star clusters contribute to the formation of HJ. We present direct $N$-body simulations of planetary systems in star clusters and analyze the formation of HJ in them. We combine two direct $N$-… ▽ More Hot Jupiters (HJ) are defined as Jupiter-mass exoplanets orbiting around their host star with an orbital period < 10 days. It is assumed that HJ do not form in-situ but ex-situ. Recent discoveries show that star clusters contribute to the formation of HJ. We present direct $N$-body simulations of planetary systems in star clusters and analyze the formation of HJ in them. We combine two direct $N$-body codes: NBODY6++GPU for the dynamics of dense star clusters with 32 000 and 64 000 stellar members and LonelyPlanets used to follow 200 identical planetary systems around solar mass stars in those star clusters. We use different sets with 3, 4, or 5 planets and with the innermost planet at a semi-major axis of 5 au or 1 au and follow them for 100 Myr in our simulations. The results indicate that HJs are generated with high efficiency in dense star clusters if the innermost planet is already close to the host star at a semi-major axis of 1 au. If the innermost planet is initially beyond a semi-major axis of 5 au, the probability of a potential HJ ranges between $1.5-4.5$ percent. Very dense stellar neighborhoods tend to eject planets rather than forming HJs. A correlation between HJ formation and angular momentum deficit (AMD) is not witnessed. Young Hot Jupiters ($t_{\rm age} < 100$ Myrs) have only been found, in our simulations, in planetary systems with the innermost planet at a semi-major axis of 1 au. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.04997 [pdf, other]

Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis

Authors: Lanling Xu, Junjie Zhang, Bingqian Li, **peng Wang, Mingchen Cai, Wayne Xin Zhao, Ji-Rong Wen

Abstract: Recently, large language models such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in… ▽ More Recently, large language models such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, \ie task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by experiments to systematically analyze the impact of different factors on two public datasets. Finally, we summarize promising directions to shed lights on future research. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: 40 pages, under review

arXiv:2401.01731 [pdf, other]

Extracting double-quantum coherence in two-dimensional electronic spectroscopy under pump-probe geometry

Authors: Mao-Rui Cai, Xue Zhang, Zi-Qian Cheng, Teng-Fei Yan, Hui Dong

Abstract: Two-dimensional electronic spectroscopy (2DES) can be implemented with different geometries, e.g., BOXCARS, collinear and pump-probe geometries. The pump-probe geometry has its advantage of overlap** only two beams and reducing phase cycling steps. However, its applications are typically limited to observe the dynamics with single-quantum coherence and population, leaving the challenge to measur… ▽ More Two-dimensional electronic spectroscopy (2DES) can be implemented with different geometries, e.g., BOXCARS, collinear and pump-probe geometries. The pump-probe geometry has its advantage of overlap** only two beams and reducing phase cycling steps. However, its applications are typically limited to observe the dynamics with single-quantum coherence and population, leaving the challenge to measure the dynamics of the double-quantum (2Q) coherence, which reflects the many-body interactions. We propose an experimental technique in 2DES under pump-probe geometry with a designed pulse sequence and the signal processing method to extract 2Q coherence. In the designed pulse sequence with the probe pulse arriving earlier than pump pulses, our measured signal includes the 2Q signal as well as the zero-quantum (0Q) signal. With phase cycling and the data processing using causality enforcement, we extract the 2Q signal. The proposal is demonstrated with the rubidium atoms. And we observe the collective resonances of two-body dipole-dipole interactions of both $D_{1}$ and $D_{2}$ lines. △ Less

Submitted 1 March, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: 7 pages, 5 figures

arXiv:2312.15154 [pdf, other]

Completions to Discrete Probability Distributions in Log-linear Models

Authors: May Cai, Cecilie Olesen Recke, Thomas Yahl

Abstract: Completion problems, of recovering a point from a set of observed coordinates, are abundant in applications to image reconstruction, phylogenetics, and data science. We consider a completion problem coming from algebraic statistics: to describe the completions of a point to a probability distribution lying in a given log-linear model. When there are finitely many completions, we show that these po… ▽ More Completion problems, of recovering a point from a set of observed coordinates, are abundant in applications to image reconstruction, phylogenetics, and data science. We consider a completion problem coming from algebraic statistics: to describe the completions of a point to a probability distribution lying in a given log-linear model. When there are finitely many completions, we show that these points either have a unique completion or two completions to the log-linear model depending on the set of observed coordinates. We additionally describe the region of points which have a completion to the log-linear model. △ Less

Submitted 13 February, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: This work was conducted as a part of the Algebraic Statistics and Our Changing World Program hosted by the Institute for Mathematical and Statistical Innovation (IMSI). 21 pages, 5 figures

arXiv:2312.10897 [pdf, other]

Generalized Category Discovery with Large Language Models in the Loop

Authors: Wenbin An, Wenkai Shi, Feng Tian, Haonan Lin, QianYing Wang, Yaqiang Wu, Mingxiang Cai, Luyan Wang, Yan Chen, Hai** Zhu, ** Chen

Abstract: Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits the… ▽ More Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits their applications in the real world. To mitigate the above issues, we propose Loop, an end-to-end active-learning framework that introduces Large Language Models (LLMs) into the training loop, which can boost model performance and generate category names without relying on any human efforts. Specifically, we first propose Local Inconsistent Sampling (LIS) to select samples that have a higher probability of falling to wrong clusters, based on neighborhood prediction consistency and entropy of cluster assignment probabilities. Then we propose a Scalable Query strategy to allow LLMs to choose true neighbors of the selected samples from multiple candidate samples. Based on the feedback from LLMs, we perform Refined Neighborhood Contrastive Learning (RNCL) to pull samples and their neighbors closer to learn clustering-friendly representations. Finally, we select representative samples from clusters corresponding to novel categories to allow LLMs to generate category names for them. Extensive experiments on three benchmark datasets show that Loop outperforms SOTA models by a large margin and generates accurate category names for the discovered clusters. Code and data are available at https://github.com/Lackel/LOOP. △ Less

Submitted 26 May, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: Accepted by ACL 2024 Findings, code and data are available at https://github.com/Lackel/LOOP

arXiv:2312.08153 [pdf, other]

$ρ$-Diffusion: A diffusion-based density estimation framework for computational physics

Authors: Maxwell X. Cai, Kin Long Kelvin Lee

Abstract: In physics, density $ρ(\cdot)$ is a fundamentally important scalar function to model, since it describes a scalar field or a probability density function that governs a physical process. Modeling $ρ(\cdot)$ typically scales poorly with parameter space, however, and quickly becomes prohibitively difficult and computationally expensive. One promising avenue to bypass this is to leverage the capabili… ▽ More In physics, density $ρ(\cdot)$ is a fundamentally important scalar function to model, since it describes a scalar field or a probability density function that governs a physical process. Modeling $ρ(\cdot)$ typically scales poorly with parameter space, however, and quickly becomes prohibitively difficult and computationally expensive. One promising avenue to bypass this is to leverage the capabilities of denoising diffusion models often used in high-fidelity image generation to parameterize $ρ(\cdot)$ from existing scientific data, from which new samples can be trivially sampled from. In this paper, we propose $ρ$-Diffusion, an implementation of denoising diffusion probabilistic models for multidimensional density estimation in physics, which is currently in active development and, from our results, performs well on physically motivated 2D and 3D density functions. Moreover, we propose a novel hashing technique that allows $ρ$-Diffusion to be conditioned by arbitrary amounts of physical parameters of interest. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: 6 pages, 2 figures, accepted for publication at the NeurIPS 2023 workshop "Machine Learning and the Physical Sciences"

arXiv:2312.00784 [pdf, other]

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Authors: Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee

Abstract: While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual… ▽ More While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available. △ Less

Submitted 26 April, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Accepted to CVPR2024. Project page: https://vip-llava.github.io/

arXiv:2311.17318 [pdf]

Impact of Indoor Mobility Behavior on the Respiratory Infectious Diseases Transmission Trends

Authors: Ziwei Cui, Ming Cai, Zheng Zhu, Gongbo Chen, Yao Xiao

Abstract: The importance of indoor human mobility in the transmission dynamics of respiratory infectious diseases has been acknowledged. Previous studies have predominantly addressed a single type of mobility behavior such as queueing and a series of behaviors under specific scenarios. However, these studies ignore the abstraction of mobility behavior in various scenes and the critical examination of how th… ▽ More The importance of indoor human mobility in the transmission dynamics of respiratory infectious diseases has been acknowledged. Previous studies have predominantly addressed a single type of mobility behavior such as queueing and a series of behaviors under specific scenarios. However, these studies ignore the abstraction of mobility behavior in various scenes and the critical examination of how these abstracted behaviors impact disease propagation. To address these problems, this study considers people's mobility behaviors in a general scenario, abstracting them into two main categories: crowding behavior, related to the spatial aspect, and stop** behavior, related to the temporal aspect. Accordingly, this study investigates their impacts on disease spreading and the impact of individual spatio-temporal distribution resulting from these mobility behaviors on epidemic transmission. First, a point of interest (POI) method is introduced to quantify the crowding-related spatial POI factors (i.e., the number of crowdings and the distance between crowdings) and stop**-related temporal POI factors (i.e., the number of stop**s and the duration of each stop**). Besides, a personal space determined with Voronoi diagrams is used to construct the individual spatio-temporal distribution factor. Second, two indicators (i.e., the daily number of new cases and the average exposure risk of people) are applied to quantify epidemic transmission. These indicators are derived from a fundamental model which accurately predicts disease transmission between moving individuals. Third, a set of 200 indoor scenarios is constructed and simulated to help determine variable values. Concurrently, the influences and underlying mechanisms of these behavioral factors on disease transmission are examined using structural equation modeling and causal inference modeling...... △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.01487 [pdf, other]

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Authors: Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, **peng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, Ji-Rong Wen

Abstract: Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). A surge of visual instruction datasets with various focuses and characteristics have been proposed recently, enabling MLLMs to achieve surprising results on evaluation benchmarks. To develop more capable MLLMs, in this paper, we aim to investigate a… ▽ More Visual instruction tuning is an essential approach to improving the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). A surge of visual instruction datasets with various focuses and characteristics have been proposed recently, enabling MLLMs to achieve surprising results on evaluation benchmarks. To develop more capable MLLMs, in this paper, we aim to investigate a more fundamental question: ``what makes for good visual instructions?''. By conducting a comprehensive empirical study, we find that instructions focused on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs on evaluation benchmarks. Building upon this finding, we design a systematic approach to automatically creating high-quality complex visual reasoning instructions. Our approach employs a synthesis-complication-reformulation paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the synthetic visual reasoning instruction dataset consisting of 32K examples, namely ComVint, and fine-tune four MLLMs on it. Experimental results demonstrate that our dataset consistently enhances the performance of all the compared MLLMs, e.g., improving the performance of MiniGPT-4 and BLIP-2 on MME-Cognition by 32.6% and 28.8%, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Work in progress

arXiv:2310.20398 [pdf, other]

doi 10.1016/j.jcp.2023.112596

A hybrid approach for solving the gravitational N-body problem with Artificial Neural Networks

Authors: Veronica Saz Ulibarrena, Philipp Horn, Simon Portegies Zwart, Elena Sellentin, Barry Koren, Maxwell X. Cai

Abstract: Simulating the evolution of the gravitational N-body problem becomes extremely computationally expensive as N increases since the problem complexity scales quadratically with the number of bodies. We study the use of Artificial Neural Networks (ANNs) to replace expensive parts of the integration of planetary systems. Neural networks that include physical knowledge have grown in popularity in the l… ▽ More Simulating the evolution of the gravitational N-body problem becomes extremely computationally expensive as N increases since the problem complexity scales quadratically with the number of bodies. We study the use of Artificial Neural Networks (ANNs) to replace expensive parts of the integration of planetary systems. Neural networks that include physical knowledge have grown in popularity in the last few years, although few attempts have been made to use them to speed up the simulation of the motion of celestial bodies. We study the advantages and limitations of using Hamiltonian Neural Networks to replace computationally expensive parts of the numerical simulation. We compare the results of the numerical integration of a planetary system with asteroids with those obtained by a Hamiltonian Neural Network and a conventional Deep Neural Network, with special attention to understanding the challenges of this problem. Due to the non-linear nature of the gravitational equations of motion, errors in the integration propagate. To increase the robustness of a method that uses neural networks, we propose a hybrid integrator that evaluates the prediction of the network and replaces it with the numerical solution if considered inaccurate. Hamiltonian Neural Networks can make predictions that resemble the behavior of symplectic integrators but are challenging to train and in our case fail when the inputs differ ~7 orders of magnitude. In contrast, Deep Neural Networks are easy to train but fail to conserve energy, leading to fast divergence from the reference solution. The hybrid integrator designed to include the neural networks increases the reliability of the method and prevents large energy errors without increasing the computing cost significantly. For this problem, the use of neural networks results in faster simulations when the number of asteroids is >70. △ Less

Submitted 31 October, 2023; originally announced October 2023.

Comments: Accepted for publication in the Journal of Computational Physics

arXiv:2310.15457 [pdf, ps, other]

An Unconditionally Stable Iterative Decoupled Algorithm for Multiple-Network Poroelasticity

Authors: Meng Lei, Mingchao Cai, Feng Wang

Abstract: In this work, we introduce an iterative decoupled algorithm designed for addressing the quasi-static multiple-network poroelasticity problem. This problem pertains to the simultaneous modeling of fluid flow and deformations within an elastic porous medium permeated by multiple fluid networks, each with distinct characteristics. Our approach focuses on the total-pressure-based formulation, which tr… ▽ More In this work, we introduce an iterative decoupled algorithm designed for addressing the quasi-static multiple-network poroelasticity problem. This problem pertains to the simultaneous modeling of fluid flow and deformations within an elastic porous medium permeated by multiple fluid networks, each with distinct characteristics. Our approach focuses on the total-pressure-based formulation, which treats the solid displacement, total pressure, and network pressures as primary unknowns. This formulation transforms the original problem into a combination of the generalized Stokes problem and the parabolic problem, offering certain advantages such as mitigating elastic locking effects and streamlining the discretization process. Notably, the algorithm ensures unconditional convergence to the solution of the total-pressure-based coupled algorithm. To validate the accuracy and efficiency of our method, we present numerical experiments. The robustness of the algorithm with respect to the physical parameters and the discretization parameters is carefully investigated. △ Less

Submitted 27 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: to be submitted

arXiv:2310.13610 [pdf, other]

Make Your Decision Convincing! A Unified Two-Stage Framework: Self-Attribution and Decision-Making

Authors: Yanrui Du, Sendong Zhao, Haochun Wang, Yuhan Chen, Rui Bai, Zewen Qiang, Muzhen Cai, Bing Qin

Abstract: Explaining black-box model behavior with natural language has achieved impressive results in various NLP tasks. Recent research has explored the utilization of subsequences from the input text as a rationale, providing users with evidence to support the model decision. Although existing frameworks excel in generating high-quality rationales while achieving high task performance, they neglect to ac… ▽ More Explaining black-box model behavior with natural language has achieved impressive results in various NLP tasks. Recent research has explored the utilization of subsequences from the input text as a rationale, providing users with evidence to support the model decision. Although existing frameworks excel in generating high-quality rationales while achieving high task performance, they neglect to account for the unreliable link between the generated rationale and model decision. In simpler terms, a model may make correct decisions while attributing wrong rationales, or make poor decisions while attributing correct rationales. To mitigate this issue, we propose a unified two-stage framework known as Self-Attribution and Decision-Making (SADM). Through extensive experiments on five reasoning datasets from the ERASER benchmark, we demonstrate that our framework not only establishes a more reliable link between the generated rationale and model decision but also achieves competitive results in task performance and the quality of rationale. Furthermore, we explore the potential of our framework in semi-supervised scenarios. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.05405 [pdf, other]

Subthreshold production of $J/ψ$ mesons from the deuteron with SoLID

Authors: T. Liu, Z. W. Zhao, M. Cai, D. Byer, H. Gao

Abstract: The electro- and photo-production of $J/ψ$ meson near the threshold from the proton is relevant to the search of hidden charm pentaquark candidates reported by the LHCb collaboration, and the study of the QCD trace anomaly's contribution to the proton mass. It is also expected to be sensitive to the QCD van der Waals interaction, that is mediated by multi-gluon exchanges and expected to dominate t… ▽ More The electro- and photo-production of $J/ψ$ meson near the threshold from the proton is relevant to the search of hidden charm pentaquark candidates reported by the LHCb collaboration, and the study of the QCD trace anomaly's contribution to the proton mass. It is also expected to be sensitive to the QCD van der Waals interaction, that is mediated by multi-gluon exchanges and expected to dominate the interaction between two hadrons with no common valence quarks. Subthreshold production of $J/ψ$ from a nuclear target is expected to enhance such attractive interaction, and also allows for a direct probe of short range correlations inside a nucleus. With the high luminosity capability of the 12-GeV CEBAF facility at Jefferson Lab, high-precision data on $J/ψ$ meson production from the proton is becoming available, providing also a reference for subthreshold $J/ψ$ production from the deuteron. Data from the deuteron will establish the baseline for subthreshold $J/ψ$ production from other nuclear targets. In this paper, we present our findings from a feasibility study of subthreshold $J/ψ$ production from the deuteron using the proposed Solenoidal Large Intensity Device (SoLID), and discuss the potential physics impact of such data. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.05036 [pdf, other]

AvalonBench: Evaluating LLMs Playing the Game of Avalon

Authors: Jonathan Light, Min Cai, Sheng Shen, Ziniu Hu

Abstract: In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, Resistance Avalon. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compe… ▽ More In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, Resistance Avalon. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compelling test-bed to study the decision-making and language-processing capabilities of LLM Agents. To facilitate research in this line, we introduce AvalonBench - a comprehensive game environment tailored for evaluating multi-agent LLM Agents. This benchmark incorporates: (1) a game environment for Avalon, (2) rule-based bots as baseline opponents, and (3) ReAct-style LLM agents with tailored prompts for each role. Notably, our evaluations based on AvalonBench highlight a clear capability gap. For instance, models like ChatGPT playing good-role got a win rate of 22.2% against rule-based bots playing evil, while good-role bot achieves 38.2% win rate in the same setting. We envision AvalonBench could be a good test-bed for develo** more advanced LLMs (with self-playing) and agent frameworks that can effectively model the layered complexities of such game environments. △ Less

Submitted 8 November, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

Showing 1–50 of 291 results for author: Cai, M