-
SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR
Authors:
Shuaishuai Ye,
Shunfei Chen,
Xinhui Hu,
Xinkang Xu
Abstract:
In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Cl…
▽ More
In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. To further utilize the language information embedded in text, we also incorporate MoE layers into the decoder of SC-MoE. In addition, we introduce routers into every MoE layer of the encoder and the decoder and achieve better recognition performance. Experimental results show that the SC-MoE significantly improves CS ASR performances over baseline with comparable computational efficiency.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Indications of superconductivities in blend of variant apatite and covellite
Authors:
Hongyang Wang,
Yi**g Zhao,
Hao Wu,
Ling Wang,
Zhixing Wu,
Zhihui Geng,
Jiewen Xiao,
Weiwei Xue,
Shufeng Ye,
Ning Chen,
Xianfeng Qiao,
Yao Yao
Abstract:
Through heavily do** sulfur into an apatite framework, we synthesize a new blend mainly comprising variant apatite and covellite (copper sulfide). Magnetic measurement exhibits that significant diamagnetism appears at around 260 K and drops dramatically below 30 K implying coexistence of two superconducting phases. The upper critical magnetic field is larger than 1000 Oe at 250 K. Electric measu…
▽ More
Through heavily do** sulfur into an apatite framework, we synthesize a new blend mainly comprising variant apatite and covellite (copper sulfide). Magnetic measurement exhibits that significant diamagnetism appears at around 260 K and drops dramatically below 30 K implying coexistence of two superconducting phases. The upper critical magnetic field is larger than 1000 Oe at 250 K. Electric measurement manifests that the current-voltage curves deviate from the normal linear lineshape suggesting the presence of zero-resistance effect, and the critical current is around 50 $μ$A at 140 K. These exotic magnetic and electric features strongly indicate these two components, variant apatite and covellite, individually trigger two superconducting phases at near-room and low temperatures.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Authors:
Hoyeon Chang,
**ho Park,
Seonghyeon Ye,
Sohee Yang,
Youngkyung Seo,
Du-Seong Chang,
Minjoon Seo
Abstract:
Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge ac…
▽ More
Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Authors:
Qingyun Li,
Zhe Chen,
Weiyun Wang,
Wenhai Wang,
Shenglong Ye,
Zhenjiang **,
Guanzhou Chen,
Yinan He,
Zhangwei Gao,
Erfei Cui,
Jiashuo Yu,
Hao Tian,
Jiasheng Zhou,
Chao Xu,
Bin Wang,
Xingjian Wei,
Wei Li,
Wenjian Zhang,
Bo Zhang,
Pinlong Cai,
Licheng Wen,
Xiangchao Yan,
Zhenxiang Li,
Pei Chu,
Yi Wang
, et al. (15 additional authors not shown)
Abstract:
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an…
▽ More
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.
△ Less
Submitted 13 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Authors:
Seungone Kim,
Juyoung Suk,
Ji Yong Cho,
Shayne Longpre,
Chaeeun Kim,
Dongkeun Yoon,
Gui** Son,
Ye** Cho,
Sheikh Shafayat,
**heon Baek,
Sue Hyun Park,
Hyeonbin Hwang,
**kyung Jo,
Hyowon Cho,
Haebin Shin,
Seongyun Lee,
Hanseok Oh,
Noah Lee,
Namgyu Ho,
Se June Joo,
Miyoung Ko,
Yoonjoo Lee,
Hyungjoo Chae,
Jamin Shin,
Joel Jang
, et al. (7 additional authors not shown)
Abstract:
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on spec…
▽ More
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria like helpfulness and harmlessness, which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 103 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval/tree/main/BiGGen-Bench.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference
Authors:
Shengyuan Ye,
Jiangsu Du,
Liekang Zeng,
Wenzhong Ou,
Xiaowen Chu,
Yutong Lu,
Xu Chen
Abstract:
Transformer-based models have unlocked a plethora of powerful intelligent applications at the edge, such as voice assistant in smart home. Traditional deployment approaches offload the inference workloads to the remote cloud server, which would induce substantial pressure on the backbone network as well as raise users' privacy concerns. To address that, in-situ inference has been recently recogniz…
▽ More
Transformer-based models have unlocked a plethora of powerful intelligent applications at the edge, such as voice assistant in smart home. Traditional deployment approaches offload the inference workloads to the remote cloud server, which would induce substantial pressure on the backbone network as well as raise users' privacy concerns. To address that, in-situ inference has been recently recognized for edge intelligence, but it still confronts significant challenges stemming from the conflict between intensive workloads and limited on-device computing resources. In this paper, we leverage our observation that many edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources and propose Galaxy, a collaborative edge AI system that breaks the resource walls across heterogeneous edge devices for efficient Transformer inference acceleration. Galaxy introduces a novel hybrid model parallelism to orchestrate collaborative inference, along with a heterogeneity-aware parallelism planning for fully exploiting the resource potential. Furthermore, Galaxy devises a tile-based fine-grained overlap** of communication and computation to mitigate the impact of tensor synchronizations on inference latency under bandwidth-constrained edge environments. Extensive evaluation based on prototype implementation demonstrates that Galaxy remarkably outperforms state-of-the-art approaches under various edge environment setups, achieving up to 2.5x end-to-end latency reduction.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge
Authors:
**gguang Tian,
Shuaishuai Ye,
Shunfei Chen,
Yang Xiang,
Zhaohui Yin,
Xinhui Hu,
Xinkang Xu
Abstract:
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on t…
▽ More
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Deep Space Separable Distillation for Lightweight Acoustic Scene Classification
Authors:
ShuQi Ye,
Yuan Tian
Abstract:
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-…
▽ More
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Monetary Policies on Green Financial Markets: Evidence from a Multi-Moment Connectedness Network
Authors:
Tingguo Zheng,
Hongyin Zhang,
Shiqi Ye
Abstract:
This paper introduces a novel multi-moment connectedness network approach for analyzing the interconnectedness of green financial market. Focusing on the impact of monetary policy shocks, our study reveals that connectedness within the green bond and equity markets varies with different moments (returns, volatility, skewness, and kurtosis) and changes significantly around Federal Open Market Commi…
▽ More
This paper introduces a novel multi-moment connectedness network approach for analyzing the interconnectedness of green financial market. Focusing on the impact of monetary policy shocks, our study reveals that connectedness within the green bond and equity markets varies with different moments (returns, volatility, skewness, and kurtosis) and changes significantly around Federal Open Market Committee (FOMC) events. Static analysis shows a decrease in connectedness with higher moments, while dynamic analysis highlights increased sensitivity to event-driven shocks. We find that both tight and loose monetary policy shocks initially elevate connectedness within the first six months. However, the effects of tight shocks gradually fade, whereas loose shocks may reduce connectedness after one year. These results offer insight to policymakers in regulating sustainable economies and investment managers in strategizing asset allocation and risk management, especially in environmentally focused markets. Our study contributes to understanding the complex dynamics of the green financial market in response to monetary policies, hel** in decision-making for sustainable economic development and financial stability.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Implementation of Big AI Models for Wireless Networks with Collaborative Edge Computing
Authors:
Liekang Zeng,
Shengyuan Ye,
Xu Chen,
Yang Yang
Abstract:
Big Artificial Intelligence (AI) models have emerged as a crucial element in various intelligent applications at the edge, such as voice assistants in smart homes and autonomous robotics in smart factories. Training big AI models, e.g., for personalized fine-tuning and continual model refinement, poses significant challenges to edge devices due to the inherent conflict between limited computing re…
▽ More
Big Artificial Intelligence (AI) models have emerged as a crucial element in various intelligent applications at the edge, such as voice assistants in smart homes and autonomous robotics in smart factories. Training big AI models, e.g., for personalized fine-tuning and continual model refinement, poses significant challenges to edge devices due to the inherent conflict between limited computing resources and intensive workload associated with training. Despite the constraints of on-device training, traditional approaches usually resort to aggregating training data and sending it to a remote cloud for centralized training. Nevertheless, this approach is neither sustainable, which strains long-range backhaul transmission and energy-consuming datacenters, nor safely private, which shares users' raw data with remote infrastructures. To address these challenges, we alternatively observe that prevalent edge environments usually contain a diverse collection of trusted edge devices with untapped idle resources, which can be leveraged for edge training acceleration. Motivated by this, in this article, we propose collaborative edge training, a novel training mechanism that orchestrates a group of trusted edge devices as a resource pool for expedited, sustainable big AI model training at the edge. As an initial step, we present a comprehensive framework for building collaborative edge training systems and analyze in-depth its merits and sustainable scheduling choices following its workflow. To further investigate the impact of its parallelism design, we empirically study a case of four typical parallelisms from the perspective of energy demand with realistic testbeds. Finally, we discuss open challenges for sustainable collaborative edge training to point to future directions of edge-centric big AI model training.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Authors:
Zhe Chen,
Weiyun Wang,
Hao Tian,
Shenglong Ye,
Zhangwei Gao,
Erfei Cui,
Wenwen Tong,
Kongzhi Hu,
Jiapeng Luo,
Zheng Ma,
Ji Ma,
Jiaqi Wang,
Xiaoyi Dong,
Hang Yan,
Hewei Guo,
Conghui He,
Botian Shi,
Zhenjiang **,
Chao Xu,
Bin Wang,
Xingjian Wei,
Wei Li,
Wenjian Zhang,
Bo Zhang,
Pinlong Cai
, et al. (10 additional authors not shown)
Abstract:
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual…
▽ More
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.
△ Less
Submitted 29 April, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks
Authors:
Changho Lee,
Janghoon Han,
Seonghyeon Ye,
Stanley Jungkyu Choi,
Honglak Lee,
Kyunghoon Bae
Abstract:
Instruction tuning has shown its ability to not only enhance zero-shot generalization across various tasks but also its effectiveness in improving the performance of specific tasks. A crucial aspect in instruction tuning for a particular task is a strategic selection of related tasks that offer meaningful supervision, thereby enhancing efficiency and preventing performance degradation from irrelev…
▽ More
Instruction tuning has shown its ability to not only enhance zero-shot generalization across various tasks but also its effectiveness in improving the performance of specific tasks. A crucial aspect in instruction tuning for a particular task is a strategic selection of related tasks that offer meaningful supervision, thereby enhancing efficiency and preventing performance degradation from irrelevant tasks. Our research reveals that leveraging instruction information \textit{alone} enables the identification of pertinent tasks for instruction tuning. This approach is notably simpler compared to traditional methods that necessitate complex measurements of pairwise transferability between tasks or the creation of data samples for the target task. Furthermore, by additionally learning the unique instructional template style of the meta-dataset, we observe an improvement in task selection accuracy, which contributes to enhanced overall performance. Experimental results demonstrate that training on a small set of tasks, chosen solely based on the instructions, leads to substantial performance improvements on benchmarks like P3, Big-Bench, NIV2, and Big-Bench Hard. Significantly, these improvements exceed those achieved by prior task selection methods, highlighting the efficacy of our approach.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards
Authors:
Hyeonbin Hwang,
Doyoung Kim,
Seungone Kim,
Seonghyeon Ye,
Minjoon Seo
Abstract:
Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Sel…
▽ More
Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs). However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
△ Less
Submitted 16 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Knowledge-Reuse Transfer Learning Methods in Molecular and Material Science
Authors:
An Chen,
Zhilong Wang,
Karl Luigi Loza Vidaurre,
Yanqiang Han,
Simin Ye,
Kehao Tao,
Shiwei Wang,
**g Gao,
**** Li
Abstract:
Molecules and materials are the foundation for the development of modern advanced industries such as energy storage systems and semiconductor devices. However, traditional trial-and-error methods or theoretical calculations are highly resource-intensive, and extremely long R&D (Research and Development) periods cannot meet the urgent need for molecules/materials in industrial development. Machine…
▽ More
Molecules and materials are the foundation for the development of modern advanced industries such as energy storage systems and semiconductor devices. However, traditional trial-and-error methods or theoretical calculations are highly resource-intensive, and extremely long R&D (Research and Development) periods cannot meet the urgent need for molecules/materials in industrial development. Machine learning (ML) methods based on big data are expected to break this dilemma. However, the difficulty in constructing large-scale datasets of new molecules/materials due to the high cost of data acquisition and annotation limits the development of machine learning. The application of transfer learning lowers the data requirements for model training, which makes transfer learning stand out in researches addressing data quality issues. In this review, we summarize recent advances in transfer learning related to molecular and materials science. We focus on the application of transfer learning methods for the discovery of advanced molecules/materials, particularly, the construction of transfer learning frameworks for different systems, and how transfer learning can enhance the performance of models. In addition, the challenges of transfer learning are also discussed.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Observation of diamagnetic strange-metal phase in sulfur-copper codoped lead apatite
Authors:
Hongyang Wang,
Hao Wu,
Ning Chen,
Xianfeng Qiao,
Ling Wang,
Zhixing Wu,
Zhihui Geng,
Weiwei Xue,
Shufeng Ye,
Yao Yao
Abstract:
By codo** sulfur and copper into lead apatite, the crystal grains are directionally stacked and the room-temperature resistivity is reduced from insulating to $2\times10^{-5}~Ω\cdot$m. The resistance-temperature curve exhibits a nearly linear relationship at low temperature suggesting the presence of strange-metal phase, and a second-order phase transition is then observed at around 230~K during…
▽ More
By codo** sulfur and copper into lead apatite, the crystal grains are directionally stacked and the room-temperature resistivity is reduced from insulating to $2\times10^{-5}~Ω\cdot$m. The resistance-temperature curve exhibits a nearly linear relationship at low temperature suggesting the presence of strange-metal phase, and a second-order phase transition is then observed at around 230~K during cooling the samples. A possible Meissner effect is present in dc magnetic measurements. Further hydrothermal lead-free synthesis results in smaller resistance and stronger diamagnetism, demonstrating the essential component might be sulfur-substituted copper apatite and the alkalis matter as well. A clear pathway towards superconductivity in this material is subsequently benchmarked.
△ Less
Submitted 6 May, 2024; v1 submitted 17 March, 2024;
originally announced March 2024.
-
Efficient Trajectory Forecasting and Generation with Conditional Flow Matching
Authors:
Sean Ye,
Matthew Gombolay
Abstract:
Trajectory prediction and generation are vital for autonomous robots navigating dynamic environments. While prior research has typically focused on either prediction or generation, our approach unifies these tasks to provide a versatile framework and achieve state-of-the-art performance. Diffusion models, which are currently state-of-the-art for learned trajectory generation in long-horizon planni…
▽ More
Trajectory prediction and generation are vital for autonomous robots navigating dynamic environments. While prior research has typically focused on either prediction or generation, our approach unifies these tasks to provide a versatile framework and achieve state-of-the-art performance. Diffusion models, which are currently state-of-the-art for learned trajectory generation in long-horizon planning and offline reinforcement learning tasks, rely on a computationally intensive iterative sampling process. This slow process impedes the dynamic capabilities of robotic systems. In contrast, we introduce Trajectory Conditional Flow Matching (T-CFM), a novel data-driven approach that utilizes flow matching techniques to learn a solver time-varying vector field for efficient and fast trajectory generation. We demonstrate the effectiveness of T-CFM on three separate tasks: adversarial tracking, real-world aircraft trajectory forecasting, and long-horizon planning. Our model outperforms state-of-the-art baselines with an increase of 35% in predictive accuracy and 142% increase in planning performance. Notably, T-CFM achieves up to 100$\times$ speed-up compared to diffusion-based models without sacrificing accuracy, which is crucial for real-time decision making in robotics.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Diffusion-Reinforcement Learning Hierarchical Motion Planning in Adversarial Multi-agent Games
Authors:
Zixuan Wu,
Sean Ye,
Manisha Natarajan,
Matthew C. Gombolay
Abstract:
Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as se…
▽ More
Reinforcement Learning- (RL-)based motion planning has recently shown the potential to outperform traditional approaches from autonomous navigation to robot manipulation. In this work, we focus on a motion planning task for an evasive target in a partially observable multi-agent adversarial pursuit-evasion games (PEG). These pursuit-evasion problems are relevant to various applications, such as search and rescue operations and surveillance robots, where robots must effectively plan their actions to gather intelligence or accomplish mission tasks while avoiding detection or capture themselves. We propose a hierarchical architecture that integrates a high-level diffusion model to plan global paths responsive to environment data while a low-level RL algorithm reasons about evasive versus global path-following behavior. Our approach outperforms baselines by 51.2% by leveraging the diffusion model to guide the RL algorithm for more efficient exploration and improves the explanability and predictability.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts
Authors:
Zinan Zeng,
Sen Ye,
Zijian Cai,
Heng Wang,
Yuhan Liu,
Haokai Zhang,
Minnan Luo
Abstract:
Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a…
▽ More
Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review's textual content, and the review's metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56% and 8.41% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE's superiority in robustness and generalization.
△ Less
Submitted 13 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Scalable Community Search with Accuracy Guarantee on Attributed Graphs
Authors:
Yuxiang Wang,
Shuzhan Ye,
Xiaoliang Xu,
Yuxia Geng,
Zhenghe Zhao,
Xiangyu Ke,
Tianxing Wu
Abstract:
Given an attributed graph $G$ and a query node $q$, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from $G$ that contains $q$. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs.…
▽ More
Given an attributed graph $G$ and a query node $q$, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from $G$ that contains $q$. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community's quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node $q$. We formally define our CS-AG problem atop a $q$-centric attribute cohesiveness metric considering both textual and numerical attributes, for $k$-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, $k$-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54$\times$ (41.1$\times$ on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.
△ Less
Submitted 29 February, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models
Authors:
Hanseok Oh,
Hyunji Lee,
Seonghyeon Ye,
Haebin Shin,
Hansol Jang,
Changwook Jun,
Minjoon Seo
Abstract:
Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to language model instructions, has the potential to yield more aligned search targets. Prior studies restrict the applicati…
▽ More
Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to language model instructions, has the potential to yield more aligned search targets. Prior studies restrict the application of instructions in information retrieval to a task description format, neglecting the broader context of diverse and evolving search scenarios. Furthermore, the prevailing benchmarks utilized for evaluation lack explicit tailoring to assess instruction-following ability, thereby hindering progress in this field. In response to these limitations, we propose a novel benchmark,INSTRUCTIR, specifically designed to evaluate instruction-following ability in information retrieval tasks. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Through experimental analysis, we observe that retrievers fine-tuned to follow task-style instructions, such as INSTRUCTOR, can underperform compared to their non-instruction-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
The Redshift Evolution of the Binary Black Hole Mass Distribution from Dense Star Clusters
Authors:
Claire S. Ye,
Maya Fishbach
Abstract:
Gravitational-wave detectors are unveiling a population of binary black hole (BBH) mergers out to redshifts $z \approx 1$, and are starting to constrain how the BBH population evolves with redshift. We present predictions for the redshift evolution of the BBH mass and spin distributions for systems originating from dense star clusters. Utilizing a grid of 144 state-of-the-art dynamical models for…
▽ More
Gravitational-wave detectors are unveiling a population of binary black hole (BBH) mergers out to redshifts $z \approx 1$, and are starting to constrain how the BBH population evolves with redshift. We present predictions for the redshift evolution of the BBH mass and spin distributions for systems originating from dense star clusters. Utilizing a grid of 144 state-of-the-art dynamical models for globular clusters, we demonstrate that BBH merger rates peak at higher redshifts for larger black hole primary masses $M_1$. Specifically, for $M_1\gtrsim40\,M_{\odot}$, the BBH merger rate reaches its peak at redshift $z\approx2.1$, while for $M_1\lesssim20\,M_{\odot}$, the peak occurs at $z\approx1.1$, assuming that the cluster formation rate peaks at $z=2.2$. The average BBH primary mass also increases from $\sim 10\,M_{\odot}$ at $z=0$ to $\sim 30\,M_{\odot}$ at $z=10$. We show that $\sim 20\%$ BBHs contain massive remnants from next-generation mergers, with this fraction increasing (decreasing) for larger (smaller) primary masses. This difference is not large enough to significantly alter the effective spins of the BBH population originating from globular clusters, and we find that their effective spin distribution does not evolve across cosmic time. These findings can be used to distinguish BBHs from dense star clusters by future gravitational wave observations.
△ Less
Submitted 3 June, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Dual-modal Dynamic Traceback Learning for Medical Report Generation
Authors:
Shuchang Ye,
Mingyuan Meng,
Mingjian Li,
Dagan Feng,
**man Kim
Abstract:
With increasing reliance on medical imaging in clinical practices, automated report generation from medical images is in great demand. Existing report generation methods typically adopt an encoder-decoder deep learning framework to build a uni-directional image-to-report map**. However, such a framework ignores the bi-directional mutual associations between images and reports, thus incurring dif…
▽ More
With increasing reliance on medical imaging in clinical practices, automated report generation from medical images is in great demand. Existing report generation methods typically adopt an encoder-decoder deep learning framework to build a uni-directional image-to-report map**. However, such a framework ignores the bi-directional mutual associations between images and reports, thus incurring difficulties in associating the intrinsic medical meanings between them. Recent generative representation learning methods have demonstrated the benefits of dual-modal learning from both image and text modalities. However, these methods exhibit two major drawbacks for medical report generation: 1) they tend to capture morphological information and have difficulties in capturing subtle pathological semantic information, and 2) they predict masked text rely on both unmasked images and text, inevitably degrading performance when inference is based solely on images. In this study, we propose a new report generation framework with dual-modal dynamic traceback learning (DTrace) to overcome the two identified drawbacks and enable dual-modal learning for medical report generation. To achieve this, our DTrace introduces a traceback mechanism to control the semantic validity of generated content via self-assessment. Further, our DTrace introduces a dynamic learning strategy to adapt to various proportions of image and text input, enabling report generation without reliance on textual input during inference. Extensive experiments on two well-benchmarked datasets (IU-Xray and MIMIC-CXR) show that our DTrace outperforms state-of-the-art medical report generation methods.
△ Less
Submitted 6 March, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Possible Meissner effect near room temperature in copper-substituted lead apatite
Authors:
Hongyang Wang,
Yao Yao,
Ke Shi,
Yi**g Zhao,
Hao Wu,
Zhixing Wu,
Zhihui Geng,
Shufeng Ye,
Ning Chen
Abstract:
With copper-substituted lead apatite below room temperature, we observe diamagnetic dc magnetization under magnetic field of 25 Oe with remarkable bifurcation between zero-field-cooling and field-cooling measurements, and under 200 Oe it changes to be paramagnetism. A glassy memory effect is found during cooling. Typical hysteresis loops for superconductors are detected below 250 K, along with an…
▽ More
With copper-substituted lead apatite below room temperature, we observe diamagnetic dc magnetization under magnetic field of 25 Oe with remarkable bifurcation between zero-field-cooling and field-cooling measurements, and under 200 Oe it changes to be paramagnetism. A glassy memory effect is found during cooling. Typical hysteresis loops for superconductors are detected below 250 K, along with an asymmetry between forward and backward sweep of magnetic field. Our experiment suggests at room temperature the Meissner effect is possibly present in this material.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
Fluid Antenna Array Enhanced Over-the-Air Computation
Authors:
Deyou Zhang,
Sicong Ye,
Ming Xiao,
Kezhi Wang,
Marco Di Renzo,
Mikael Skoglund
Abstract:
Over-the-air computation (AirComp) has emerged as a promising technology for fast wireless data aggregation by harnessing the superposition property of wireless multiple-access channels. This paper investigates a fluid antenna (FA) array-enhanced AirComp system, employing the new degrees of freedom achieved by antenna movements. Specifically, we jointly optimize the transceiver design and antenna…
▽ More
Over-the-air computation (AirComp) has emerged as a promising technology for fast wireless data aggregation by harnessing the superposition property of wireless multiple-access channels. This paper investigates a fluid antenna (FA) array-enhanced AirComp system, employing the new degrees of freedom achieved by antenna movements. Specifically, we jointly optimize the transceiver design and antenna position vector (APV) to minimize the mean squared error (MSE) between target and estimated function values. To tackle the resulting highly non-convex problem, we adopt an alternating optimization technique to decompose it into three subproblems. These subproblems are then iteratively solved until convergence, leading to a locally optimal solution. Numerical results show that FA arrays with the proposed transceiver and APV design significantly outperform the traditional fixed-position antenna arrays in terms of MSE.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Splittings and poly-freeness of triangle Artin groups
Authors:
Xiaolei Wu,
Shengkui Ye
Abstract:
We prove that the triangle Artin group $\mathrm{Art}_{23M}$ splits as a graph of free groups if and only if $M$ is greater than $5$ and even. This answers two questions of Jankiewicz \cite[Question 2.2, Question 2.3]{Jan21} in the negative. Combined with the results of Squier and Jankiewicz, this completely determines when a triangle Artin group splits as a graph of free groups. Furthermore, we pr…
▽ More
We prove that the triangle Artin group $\mathrm{Art}_{23M}$ splits as a graph of free groups if and only if $M$ is greater than $5$ and even. This answers two questions of Jankiewicz \cite[Question 2.2, Question 2.3]{Jan21} in the negative. Combined with the results of Squier and Jankiewicz, this completely determines when a triangle Artin group splits as a graph of free groups. Furthermore, we prove that the triangle Artin groups are virtually poly-free when the labels are not of the form $(2,3, 2k+1)$ with $k\geq 3$. This partially answers a question of Bestvina \cite{Be99}.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Integrating the PanDA Workload Management System with the Vera C. Rubin Observatory
Authors:
Edward Karavakis,
Wen Guan,
Zhaoyu Yang,
Tadashi Maeno,
Torre Wenaus,
Jennifer Adelman-McCarthy,
Fernando Barreiro Megino,
Kaushik De,
Richard Dubois,
Michelle Gower,
Tim Jenness,
Alexei Klimentov,
Tatiana Korchuganova,
Mikolaj Kowalik,
Fa-Hui Lin,
Paul Nilsson,
Sergey Padolski,
Wei Yang,
Shuwei Ye
Abstract:
The Vera C. Rubin Observatory will produce an unprecedented astronomical data set for studies of the deep and dynamic universe. Its Legacy Survey of Space and Time (LSST) will image the entire southern sky every three to four days and produce tens of petabytes of raw image data and associated calibration data over the course of the experiment's run. More than 20 terabytes of data must be stored ev…
▽ More
The Vera C. Rubin Observatory will produce an unprecedented astronomical data set for studies of the deep and dynamic universe. Its Legacy Survey of Space and Time (LSST) will image the entire southern sky every three to four days and produce tens of petabytes of raw image data and associated calibration data over the course of the experiment's run. More than 20 terabytes of data must be stored every night, and annual campaigns to reprocess the entire dataset since the beginning of the survey will be conducted over ten years. The Production and Distributed Analysis (PanDA) system was evaluated by the Rubin Observatory Data Management team and selected to serve the Observatory's needs due to its demonstrated scalability and flexibility over the years, for its Directed Acyclic Graph (DAG) support, its support for multi-site processing, and its highly scalable complex workflows via the intelligent Data Delivery Service (iDDS). PanDA is also being evaluated for prompt processing where data must be processed within 60 seconds after image capture. This paper will briefly describe the Rubin Data Management system and its Data Facilities (DFs). Finally, it will describe in depth the work performed in order to integrate the PanDA system with the Rubin Observatory to be able to run the Rubin Science Pipelines using PanDA.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models
Authors:
Yu** Kim,
Jaehong Yoon,
Seonghyeon Ye,
Sangmin Bae,
Namgyu Ho,
Sung Ju Hwang,
Se-young Yun
Abstract:
The dynamic nature of knowledge in an ever-changing world presents challenges for language models trained on static data; the model in the real world often requires not only acquiring new knowledge but also overwriting outdated information into updated ones. To study the ability of language models for these time-dependent dynamics in human language, we introduce a novel task, EvolvingQA, a tempora…
▽ More
The dynamic nature of knowledge in an ever-changing world presents challenges for language models trained on static data; the model in the real world often requires not only acquiring new knowledge but also overwriting outdated information into updated ones. To study the ability of language models for these time-dependent dynamics in human language, we introduce a novel task, EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database. The construction of EvolvingQA is automated with our pipeline using large language models. We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge. Our analysis suggests that models fail to rectify knowledge due to small weight gradients. In addition, we elucidate that language models particularly struggle to reflect the change of numerical or temporal information. Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models.
△ Less
Submitted 20 April, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Evaluating Large Language Models in Ophthalmology
Authors:
Jason Holmes,
Shuyuan Ye,
Yiwei Li,
Shi-Nan Wu,
Zhengliang Liu,
Zihao Wu,
**yu Hu,
Huan Zhao,
Xi Jiang,
Wei Liu,
Hong Wei,
Jie Zou,
Tianming Liu,
Yi Shao
Abstract:
Purpose: The performance of three different large language models (LLMS) (GPT-3.5, GPT-4, and PaLM2) in answering ophthalmology professional questions was evaluated and compared with that of three different professional populations (medical undergraduates, medical masters, and attending physicians). Methods: A 100-item ophthalmology single-choice test was administered to three different LLMs (GPT-…
▽ More
Purpose: The performance of three different large language models (LLMS) (GPT-3.5, GPT-4, and PaLM2) in answering ophthalmology professional questions was evaluated and compared with that of three different professional populations (medical undergraduates, medical masters, and attending physicians). Methods: A 100-item ophthalmology single-choice test was administered to three different LLMs (GPT-3.5, GPT-4, and PaLM2) and three different professional levels (medical undergraduates, medical masters, and attending physicians), respectively. The performance of LLM was comprehensively evaluated and compared with the human group in terms of average score, stability, and confidence. Results: Each LLM outperformed undergraduates in general, with GPT-3.5 and PaLM2 being slightly below the master's level, while GPT-4 showed a level comparable to that of attending physicians. In addition, GPT-4 showed significantly higher answer stability and confidence than GPT-3.5 and PaLM2. Conclusion: Our study shows that LLM represented by GPT-4 performs better in the field of ophthalmology. With further improvements, LLM will bring unexpected benefits in medical education and clinical decision making in the near future.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
Post-Layout Simulation Driven Analog Circuit Sizing
Authors:
Xiaohan Gao,
Haoyi Zhang,
Siyuan Ye,
Mingjie Liu,
David Z. Pan,
Linxiao Shen,
Runsheng Wang,
Yibo Lin,
Ru Huang
Abstract:
Post-layout simulation provides accurate guidance for analog circuit design, but post-layout performance is hard to be directly optimized at early design stages. Prior work on analog circuit sizing often utilizes pre-layout simulation results as the optimization objective. In this work, we propose a post-layout-simulation-driven (post-simulation-driven for short) analog circuit sizing framework th…
▽ More
Post-layout simulation provides accurate guidance for analog circuit design, but post-layout performance is hard to be directly optimized at early design stages. Prior work on analog circuit sizing often utilizes pre-layout simulation results as the optimization objective. In this work, we propose a post-layout-simulation-driven (post-simulation-driven for short) analog circuit sizing framework that directly optimizes the post-layout simulation performance. The framework integrates automated layout generation into the optimization loop of transistor sizing and leverages a coupled Bayesian optimization algorithm to search for the best post-simulation performance. Experimental results demonstrate that our framework can achieve over 20% better post-layout performance in competitive time than manual design and the method that only considers pre-layout optimization.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Generative AI May Prefer to Present National-level Characteristics of Cities Based on Stereotypical Geographic Impressions at the Continental Level
Authors:
Shan Ye
Abstract:
A simple experiment was conducted to test the ability of the Chinese-based generative artificial intelligence (AI) platform, Wenxin Yige, to render images of urban street views of different countries. The study found that images generated by this AI platform may contain continental-level stereotypes in terms of showing the level of economic development and modernization. Street view images generat…
▽ More
A simple experiment was conducted to test the ability of the Chinese-based generative artificial intelligence (AI) platform, Wenxin Yige, to render images of urban street views of different countries. The study found that images generated by this AI platform may contain continental-level stereotypes in terms of showing the level of economic development and modernization. Street view images generated from Wenxin Yige do not adequately represent the diverse range of urban landscapes found across different nations. Using these generated images for geography education or outreach initiatives could inadvertently strengthen people's existing stereotypical views about individual countries.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
Authors:
Zhiyao Sun,
Tian Lv,
Sheng Ye,
Matthieu Lin,
Jenny Sheng,
Yu-Hui Wen,
Min**g Yu,
Yong-** Liu
Abstract:
The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many map** between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion map** or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fai…
▽ More
The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many map** between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion map** or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .
△ Less
Submitted 14 May, 2024; v1 submitted 30 September, 2023;
originally announced October 2023.
-
The dominant mechanism(s) for populating the outskirts of star clusters with neutron star binaries
Authors:
Nathan W. C. Leigh,
Claire S. Ye,
Steffani M. Grondin,
Giacomo Fragione,
Jeremy J. Webb,
Craig O. Heinke
Abstract:
It has been argued that heavy binaries composed of neutron stars (NSs) and millisecond pulsars (MSPs) can end up in the outskirts of star clusters via an interaction with a massive black hole (BH) binary expelling them from the core. We argue here, however, that this mechanism will rarely account for such observed objects. Only for primary masses $\lesssim$ 100 M$_{\odot}$ and a narrow range of or…
▽ More
It has been argued that heavy binaries composed of neutron stars (NSs) and millisecond pulsars (MSPs) can end up in the outskirts of star clusters via an interaction with a massive black hole (BH) binary expelling them from the core. We argue here, however, that this mechanism will rarely account for such observed objects. Only for primary masses $\lesssim$ 100 M$_{\odot}$ and a narrow range of orbital separations should a BH-BH binary be both dynamically hard and produce a sufficiently low recoil velocity to retain the NS binary in the cluster. Hence, BH binaries are in general likely to eject NSs from clusters. We explore several alternative mechanisms that would cause NS/MSP binaries to be observed in the outskirts of their host clusters after a Hubble time. The most likely mechanism is a three-body interaction involving the NS/MSP binary and a normal star. We compare to Monte Carlo simulations of cluster evolution for the globular clusters NGC 6752 and 47 Tuc, and show that the models not only confirm that normal three-body interactions involving all stellar-mass objects are the dominant mechanism for putting NS/MSP binaries into the cluster outskirts, they also reproduce the observed NS/MSP binary radial distributions without needing to invoke the presence of a massive BH binary. Higher central densities and an episode of core-collapse can broaden the radial distributions of NSs/MSPs and NS/MSP binaries due to three-body interactions, making these clusters more likely to host NSs in the cluster outskirts.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Score Mismatching for Generative Modeling
Authors:
Senmao Ye,
Fei Liu
Abstract:
We propose a new score-based model with one-step sampling. Previously, score-based models were burdened with heavy computations due to iterative sampling. For substituting the iterative process, we train a standalone generator to compress all the time steps with the gradient backpropagated from the score network. In order to produce meaningful gradients for the generator, the score network is trai…
▽ More
We propose a new score-based model with one-step sampling. Previously, score-based models were burdened with heavy computations due to iterative sampling. For substituting the iterative process, we train a standalone generator to compress all the time steps with the gradient backpropagated from the score network. In order to produce meaningful gradients for the generator, the score network is trained to simultaneously match the real data distribution and mismatch the fake data distribution. This model has the following advantages: 1) For sampling, it generates a fake image with only one step forward. 2) For training, it only needs 10 diffusion steps.3) Compared with consistency model, it is free of the ill-posed problem caused by consistency loss. On the popular CIFAR-10 dataset, our model outperforms Consistency Model and Denoising Score Matching, which demonstrates the potential of the framework. We further provide more examples on the MINIST and LSUN datasets. The code is available on GitHub.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
Visualizing the Zhang-Rice singlet, molecular orbitals and pair formation in cuprate
Authors:
Shusen Ye,
Jianfa Zhao,
Zhiheng Yao,
Sixuan Chen,
Zehao Dong,
Xintong Li,
Luchuan Shi,
Qingqing Liu,
Changqing **,
Yayu Wang
Abstract:
The parent compound of cuprates is a charge-transfer-type Mott insulator with strong hybridization between the Cu $3d_{\mathrm x^2-y^2}$ and O $2p$ orbitals. A key question concerning the pairing mechanism is the behavior of doped holes in the antiferromagnetic (AF) Mott insulator background, which is a prototypical quantum many-body problem. It was proposed that doped hole on the O site tends to…
▽ More
The parent compound of cuprates is a charge-transfer-type Mott insulator with strong hybridization between the Cu $3d_{\mathrm x^2-y^2}$ and O $2p$ orbitals. A key question concerning the pairing mechanism is the behavior of doped holes in the antiferromagnetic (AF) Mott insulator background, which is a prototypical quantum many-body problem. It was proposed that doped hole on the O site tends to form a singlet, known as Zhang-Rice singlet (ZRS), with the unpaired Cu spin. But experimentally little is known about the properties of a single hole and the interplay between them that leads to superconductivity. Here we use scanning tunneling microscopy to visualize the electronic states in hole-doped $\mathrm{Ca_2CuO_2Cl_2}$, aiming to establish the atomic-scale local basis for pair formation. A single doped hole is shown to have an in-gap state and a clover-shaped spatial distribution that can be attributed to a localized ZRS. When the dopants are close enough, they develop delocalized molecular orbitals with characteristic stripe- and ladder-shaped patterns, accompanied by the opening of a small gap around the Fermi level ($E_{\mathrm F}$). With increasing do**, the molecular orbitals proliferate in space and gradually form densely packed plaquettes, but the stripe and ladder patterns remain nearly the same. The low-energy electronic states of the molecular orbitals are intimately related to the local pairing properties, thus play a vitally important role in the emergence of superconductivity. We propose that the Cooper pair is formed by two holes occupying the stripe-like molecular orbital, while the attractive interaction is mediated by the AF spin background.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness
Authors:
Liyao Jiang,
Chenglin Li,
Haolan Chen,
Xiaodong Gao,
Xinwang Zhong,
Yang Qiu,
Shani Ye,
Di Niu
Abstract:
Online advertisements are important elements in e-commerce sites, social media platforms, and search engines. With the increasing popularity of mobile browsing, many online ads are displayed with visual information in the form of a cover image in addition to text descriptions to grab the attention of users. Various recent studies have focused on predicting the click rates of online advertisements…
▽ More
Online advertisements are important elements in e-commerce sites, social media platforms, and search engines. With the increasing popularity of mobile browsing, many online ads are displayed with visual information in the form of a cover image in addition to text descriptions to grab the attention of users. Various recent studies have focused on predicting the click rates of online advertisements aware of visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at https://github.com/LiyaoJiang1998/adsee.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions
Authors:
Tianxu Wu,
Shuo Ye,
Shuhuang Chen,
Qinmu Peng,
Xinge You
Abstract:
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models ha…
▽ More
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic map** between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.
△ Less
Submitted 15 May, 2024; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement
Authors:
Sheng Ye,
Yubin Hu,
Matthieu Lin,
Yu-Hui Wen,
Wang Zhao,
Yong-** Liu,
Wen** Wang
Abstract:
The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they s…
▽ More
The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. This work aims to reconstruct high-fidelity surfaces with fine-grained details by addressing the above limitations. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method outperforms existing methods in terms of reconstruction quality. Furthermore, the proposed method also generalizes well to real-world indoor scenarios captured by our hand-held mobile phones. Our code is publicly available at: https://github.com/yec22/Fine-Grained-Indoor-Recon.
△ Less
Submitted 25 December, 2023; v1 submitted 14 September, 2023;
originally announced September 2023.
-
Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization
Authors:
Zhenguang Liu,
Xinyang Yu,
Ruili Wang,
Shuai Ye,
Zhe Ma,
Jianfeng Dong,
Sifeng He,
Feng Qian,
Xiaobo Zhang,
Roger Zimmermann,
Lei Yang
Abstract:
The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to…
▽ More
The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features.
In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlap** semantics of the original feature and remove redundant information.
(2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature.
Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at https://github.com/yyyooooo/DMI/, ho** to contribute to the community.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Excitation of extraordinary modes inside the source of Saturn's kilometric radiation
Authors:
Hao Ning,
Yao Chen,
Chuanyang Li,
Shengyi Ye,
Alexey Kuznetsov,
Siyuan Wu
Abstract:
The electron cyclotron maser instability (ECMI) of extraordinary mode waves was investigated with the parameters observed in Saturn's kilometric radiation (SKR) sources. Previous studies employed simplified dispersion relations, and did not consider the excitation of the relativistic (R) mode. This mode is introduced by considering the relativistic effect in plasmas consisting of both cold and hot…
▽ More
The electron cyclotron maser instability (ECMI) of extraordinary mode waves was investigated with the parameters observed in Saturn's kilometric radiation (SKR) sources. Previous studies employed simplified dispersion relations, and did not consider the excitation of the relativistic (R) mode. This mode is introduced by considering the relativistic effect in plasmas consisting of both cold and hot electrons. Using particle-in-cell simulations, we investigated the excitation of R and X modes based on the measured data. Using the reported value of the density ratio of energetic to total electrons $n_e/n_0=24\%$, the most unstable mode is the R mode. The esca** X-mode emissions are amplified only if the energetic electrons are dominant with $n_e/n_0 \ge 90\%$. For these cases, only the X mode is excited and the R mode disappears due to its strong coupling. The results are well in line with the linear kinetic theory of ECMI. The properties of both the R and X modes are consistent with the observed SKR emissions. This raises questions about the nature of the measured electric field fluctuations within ``presumed'' SKR sources. The study provides new insights into the ECMI process relevant to SKR emission mechanisms.
△ Less
Submitted 2 September, 2023;
originally announced September 2023.
-
Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model
Authors:
Hongtai **g,
Zhengtao Gao,
Sheng Xu,
Tao Shen,
Zhangzhi Peng,
Shwai He,
Tao You,
Shuang Ye,
Wei Lin,
Siqi Sun
Abstract:
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging…
▽ More
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% non-redundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold, and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model
Authors:
Yubin Hu,
Sheng Ye,
Wang Zhao,
Matthieu Lin,
Yuze He,
Yu-Hui Wen,
Ying He,
Yong-** Liu
Abstract:
Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden ar…
▽ More
Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos. Code: https://github.com/THU-LYJ-Lab/O2-Recon.
△ Less
Submitted 19 March, 2024; v1 submitted 18 August, 2023;
originally announced August 2023.
-
Single Millisecond Pulsars from Dynamical Interaction Processes in Dense Star Clusters
Authors:
Claire S. Ye,
Kyle Kremer,
Scott M. Ransom,
Frederic A. Rasio
Abstract:
Globular clusters (GCs) are particularly efficient at forming millisecond pulsars. Among these pulsars, about half lack a companion star, a significantly higher fraction than in the Galactic field. This fraction increases further in some of the densest GCs, especially those that have undergone core collapse, suggesting that dynamical interaction processes play a key role. For the first time, we cr…
▽ More
Globular clusters (GCs) are particularly efficient at forming millisecond pulsars. Among these pulsars, about half lack a companion star, a significantly higher fraction than in the Galactic field. This fraction increases further in some of the densest GCs, especially those that have undergone core collapse, suggesting that dynamical interaction processes play a key role. For the first time, we create N-body models that reproduce the ratio of single-to-binary pulsars in Milky-Way-like GCs. We focus especially on NGC 6752, a typical core-collapsed cluster with many observed millisecond pulsars. Previous studies suggested that an increased rate of neutron star binary disruption in the densest clusters could explain the overabundance of single pulsars in these systems. Here, we demonstrate that binary disruption is ineffective and instead we propose that two additional dynamical processes play the dominant role: (1) tidal disruption of main-sequence stars by neutron stars; and (2) gravitational collapse of heavy white-dwarf-binary merger remnants. Neutron stars formed through these processes may also be associated with fast radio bursts similar to those observed recently in an extragalactic GC.
△ Less
Submitted 19 January, 2024; v1 submitted 28 July, 2023;
originally announced July 2023.
-
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Authors:
Seonghyeon Ye,
Doyoung Kim,
Sungdong Kim,
Hyeonbin Hwang,
Seungone Kim,
Yongrae Jo,
James Thorne,
Juho Kim,
Minjoon Seo
Abstract:
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instruct…
▽ More
Evaluation of Large Language Models (LLMs) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. However, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. Using FLASK, we compare multiple open-source and proprietary LLMs and observe a high correlation between model-based and human-based evaluations. We publicly release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.
△ Less
Submitted 14 April, 2024; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Diffusion Models for Multi-target Adversarial Tracking
Authors:
Sean Ye,
Manisha Natarajan,
Zixuan Wu,
Matthew Gombolay
Abstract:
Target tracking plays a crucial role in real-world scenarios, particularly in drug-trafficking interdiction, where the knowledge of an adversarial target's location is often limited. Improving autonomous tracking systems will enable unmanned aerial, surface, and underwater vehicles to better assist in interdicting smugglers that use manned surface, semi-submersible, and aerial vessels. As unmanned…
▽ More
Target tracking plays a crucial role in real-world scenarios, particularly in drug-trafficking interdiction, where the knowledge of an adversarial target's location is often limited. Improving autonomous tracking systems will enable unmanned aerial, surface, and underwater vehicles to better assist in interdicting smugglers that use manned surface, semi-submersible, and aerial vessels. As unmanned drones proliferate, accurate autonomous target estimation is even more crucial for security and safety. This paper presents Constrained Agent-based Diffusion for Enhanced Multi-Agent Tracking (CADENCE), an approach aimed at generating comprehensive predictions of adversary locations by leveraging past sparse state information. To assess the effectiveness of this approach, we evaluate predictions on single-target and multi-target pursuit environments, employing Monte-Carlo sampling of the diffusion model to estimate the probability associated with each generated trajectory. We propose a novel cross-attention based diffusion model that utilizes constraint-based sampling to generate multimodal track hypotheses. Our single-target model surpasses the performance of all baseline methods on Average Displacement Error (ADE) for predictions across all time horizons.
△ Less
Submitted 12 January, 2024; v1 submitted 12 July, 2023;
originally announced July 2023.
-
AmadeusGPT: a natural language interface for interactive animal behavioral analysis
Authors:
Shaokai Ye,
Jessy Lauer,
Mu Zhou,
Alexander Mathis,
Mackenzie W. Mathis
Abstract:
The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natura…
▽ More
The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We benchmark AmadeusGPT and show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Augmenting Sports Videos with VisCommentator
Authors:
Chen Zhu-Tian,
Shuainan Ye,
Xiangtong Chu,
Haijun Xia,
Hui Zhang,
Huamin Qu,
Yingcai Wu
Abstract:
Visualizing data in sports videos is gaining traction in sports analytics, given its ability to communicate insights and explicate player strategies engagingly. However, augmenting sports videos with such data visualizations is challenging, especially for sports analysts, as it requires considerable expertise in video editing. To ease the creation process, we present a design space that characteri…
▽ More
Visualizing data in sports videos is gaining traction in sports analytics, given its ability to communicate insights and explicate player strategies engagingly. However, augmenting sports videos with such data visualizations is challenging, especially for sports analysts, as it requires considerable expertise in video editing. To ease the creation process, we present a design space that characterizes augmented sports videos at an element-level (what the constituents are) and clip-level (how those constituents are organized). We do so by systematically reviewing 233 examples of augmented sports videos collected from TV channels, teams, and leagues. The design space guides selection of data insights and visualizations for various purposes. Informed by the design space and close collaboration with domain experts, we design VisCommentator, a fast prototy** tool, to eases the creation of augmented table tennis videos by leveraging machine learning-based data extractors and design space-based visualization recommendations. With VisCommentator, sports analysts can create an augmented video by selecting the data to visualize instead of manually drawing the graphical marks. Our system can be generalized to other racket sports (e.g., tennis, badminton) once the underlying datasets and models are available. A user study with seven domain experts shows high satisfaction with our system, confirms that the participants can reproduce augmented sports videos in a short period, and provides insightful implications into future improvements and opportunities.
△ Less
Submitted 10 May, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
HOFA: Twitter Bot Detection with Homophily-Oriented Augmentation and Frequency Adaptive Attention
Authors:
Sen Ye,
Zhaoxuan Tan,
Zhenyu Lei,
Ruijie He,
Hongrui Wang,
Qinghua Zheng,
Minnan Luo
Abstract:
Twitter bot detection has become an increasingly important and challenging task to combat online misinformation, facilitate social content moderation, and safeguard the integrity of social platforms. Though existing graph-based Twitter bot detection methods achieved state-of-the-art performance, they are all based on the homophily assumption, which assumes users with the same label are more likely…
▽ More
Twitter bot detection has become an increasingly important and challenging task to combat online misinformation, facilitate social content moderation, and safeguard the integrity of social platforms. Though existing graph-based Twitter bot detection methods achieved state-of-the-art performance, they are all based on the homophily assumption, which assumes users with the same label are more likely to be connected, making it easy for Twitter bots to disguise themselves by following a large number of genuine users. To address this issue, we proposed HOFA, a novel graph-based Twitter bot detection framework that combats the heterophilous disguise challenge with a homophily-oriented graph augmentation module (Homo-Aug) and a frequency adaptive attention module (FaAt). Specifically, the Homo-Aug extracts user representations and computes a k-NN graph using an MLP and improves Twitter's homophily by injecting the k-NN graph. For the FaAt, we propose an attention mechanism that adaptively serves as a low-pass filter along a homophilic edge and a high-pass filter along a heterophilic edge, preventing user features from being over-smoothed by their neighborhood. We also introduce a weight guidance loss to guide the frequency adaptive attention module. Our experiments demonstrate that HOFA achieves state-of-the-art performance on three widely-acknowledged Twitter bot detection benchmarks, which significantly outperforms vanilla graph-based bot detection techniques and strong heterophilic baselines. Furthermore, extensive studies confirm the effectiveness of our Homo-Aug and FaAt module, and HOFA's ability to demystify the heterophilous disguise challenge.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Adversarial Search and Tracking with Multiagent Reinforcement Learning in Sparsely Observable Environment
Authors:
Zixuan Wu,
Sean Ye,
Manisha Natarajan,
Letian Chen,
Rohan Paleja,
Matthew C. Gombolay
Abstract:
We study a search and tracking (S&T) problem where a team of dynamic search agents must collaborate to track an adversarial, evasive agent. The heterogeneous search team may only have access to a limited number of past adversary trajectories within a large search space. This problem is challenging for both model-based searching and reinforcement learning (RL) methods since the adversary exhibits r…
▽ More
We study a search and tracking (S&T) problem where a team of dynamic search agents must collaborate to track an adversarial, evasive agent. The heterogeneous search team may only have access to a limited number of past adversary trajectories within a large search space. This problem is challenging for both model-based searching and reinforcement learning (RL) methods since the adversary exhibits reactionary and deceptive evasive behaviors in a large space leading to sparse detections for the search agents. To address this challenge, we propose a novel Multi-Agent RL (MARL) framework that leverages the estimated adversary location from our learnable filtering model. We show that our MARL architecture can outperform all baselines and achieves a 46% increase in detection rate.
△ Less
Submitted 20 October, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Learning Models of Adversarial Agent Behavior under Partial Observability
Authors:
Sean Ye,
Manisha Natarajan,
Zixuan Wu,
Rohan Paleja,
Letian Chen,
Matthew C. Gombolay
Abstract:
The need for opponent modeling and tracking arises in several real-world scenarios, such as professional sports, video game design, and drug-trafficking interdiction. In this work, we present Graph based Adversarial Modeling with Mutal Information (GrAMMI) for modeling the behavior of an adversarial opponent agent. GrAMMI is a novel graph neural network (GNN) based approach that uses mutual inform…
▽ More
The need for opponent modeling and tracking arises in several real-world scenarios, such as professional sports, video game design, and drug-trafficking interdiction. In this work, we present Graph based Adversarial Modeling with Mutal Information (GrAMMI) for modeling the behavior of an adversarial opponent agent. GrAMMI is a novel graph neural network (GNN) based approach that uses mutual information maximization as an auxiliary objective to predict the current and future states of an adversarial opponent with partial observability. To evaluate GrAMMI, we design two large-scale, pursuit-evasion domains inspired by real-world scenarios, where a team of heterogeneous agents is tasked with tracking and interdicting a single adversarial agent, and the adversarial agent must evade detection while achieving its own objectives. With the mutual information formulation, GrAMMI outperforms all baselines in both domains and achieves 31.68% higher log-likelihood on average for future adversarial state predictions across both domains.
△ Less
Submitted 5 July, 2023; v1 submitted 19 June, 2023;
originally announced June 2023.