Search | arXiv e-print repository

VFIMamba: Video Frame Interpolation with State Space Models

Authors: Guozhen Zhang, Chunxu Liu, Yutao Cui, Xiaotong Zhao, Kai Ma, Limin Wang

Abstract: Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering b… ▽ More Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0.80 dB for 4K frames and 0.96 dB for 2K frames. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01964 [pdf, other]

Enabling Discriminative Reasoning in LLMs for Legal Judgment Prediction

Authors: Chenlong Deng, Kelong Mao, Yuyao Zhang, Zhicheng Dou

Abstract: Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs) underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspire… ▽ More Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs) underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspired by human judicial reasoning. ADAPT involves decomposing case facts, discriminating among potential charges, and predicting the final judgment. We further enhance LLMs through fine-tuning with multi-task synthetic trajectories to improve legal judgment prediction accuracy and efficiency under our ADAPT framework. Extensive experiments conducted on two widely-used datasets demonstrate the superior performance of our framework in legal judgment prediction, particularly when dealing with complex and confusing charges. △ Less

Submitted 2 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01916 [pdf, other]

doi 10.1109/TPAMI.2024.3416710

Sequential Manipulation Against Rank Aggregation: Theory and Algorithm

Authors: Ke Ma, Qianqian Xu, **shan Zeng, Wei Liu, Xiaochun Cao, Yingfei Sun, Qingming Huang

Abstract: Rank aggregation with pairwise comparisons is widely encountered in sociology, politics, economics, psychology, sports, etc . Given the enormous social impact and the consequent incentives, the potential adversary has a strong motivation to manipulate the ranking list. However, the ideal attack opportunity and the excessive adversarial capability cause the existing methods to be impractical. To fu… ▽ More Rank aggregation with pairwise comparisons is widely encountered in sociology, politics, economics, psychology, sports, etc . Given the enormous social impact and the consequent incentives, the potential adversary has a strong motivation to manipulate the ranking list. However, the ideal attack opportunity and the excessive adversarial capability cause the existing methods to be impractical. To fully explore the potential risks, we leverage an online attack on the vulnerable data collection process. Since it is independent of rank aggregation and lacks effective protection mechanisms, we disrupt the data collection process by fabricating pairwise comparisons without knowledge of the future data or the true distribution. From the game-theoretic perspective, the confrontation scenario between the online manipulator and the ranker who takes control of the original data source is formulated as a distributionally robust game that deals with the uncertainty of knowledge. Then we demonstrate that the equilibrium in the above game is potentially favorable to the adversary by analyzing the vulnerability of the sampling algorithms such as Bernoulli and reservoir methods. According to the above theoretical analysis, different sequential manipulation policies are proposed under a Bayesian decision framework and a large class of parametric pairwise comparison models. For attackers with complete knowledge, we establish the asymptotic optimality of the proposed policies. To increase the success rate of the sequential manipulation with incomplete knowledge, a distributionally robust estimator, which replaces the maximum likelihood estimation in a saddle point problem, provides a conservative data generation solution. Finally, the corroborating empirical evidence shows that the proposed method manipulates the results of rank aggregation methods in a sequential manner. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted by IEEE TPAMI URL: https://ieeexplore.ieee.org/document/10564181

arXiv:2407.00565 [pdf, other]

Joint Task Allocation and Scheduling for Multi-Hop Distributed Computing

Authors: Ke Ma, Junfei Xie

Abstract: The rise of the Internet of Things and edge computing has shifted computing resources closer to end-users, benefiting numerous delay-sensitive, computation-intensive applications. To speed up computation, distributed computing is a promising technique that allows parallel execution of tasks across multiple compute nodes. However, current research predominantly revolves around the master-worker par… ▽ More The rise of the Internet of Things and edge computing has shifted computing resources closer to end-users, benefiting numerous delay-sensitive, computation-intensive applications. To speed up computation, distributed computing is a promising technique that allows parallel execution of tasks across multiple compute nodes. However, current research predominantly revolves around the master-worker paradigm, limiting resource sharing within one-hop neighborhoods. This limitation can render distributed computing ineffective in scenarios with limited nearby resources or constrained/dynamic connectivity. In this paper, we address this limitation by introducing a new distributed computing framework that extends resource sharing beyond one-hop neighborhoods through exploring layered network structures and multi-hop routing. Our framework involves transforming the network graph into a sink tree and formulating a joint optimization problem based on the layered tree structure for task allocation and scheduling. To solve this problem, we propose two exact methods that find optimal solutions and three heuristic strategies to improve efficiency and scalability. The performances of these methods are analyzed and evaluated through theoretical analyses and comprehensive simulation studies. The results demonstrate their promising performances over the traditional distributed computing and computation offloading strategies. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19853 [pdf, other]

YuLan: An Open-source Large Language Model

Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for develo** LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19760 [pdf, other]

Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation

Authors: Chenlong Deng, Kelong Mao, Zhicheng Dou

Abstract: Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to… ▽ More Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.14515 [pdf, other]

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Authors: Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

Abstract: The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Vide… ▽ More The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.11247 [pdf, other]

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Ke Ma, Kewei Chen, Dongxu Guo, Tian Ye, Yanting Zhang, Hongwei Wang, Gaoang Wang

Abstract: Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challengin… ▽ More Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR 2024 Embodied AI Workshop

arXiv:2406.09688 [pdf, other]

FreeCtrl: Constructing Control Centers with Feedforward Layers for Learning-Free Controllable Text Generation

Authors: Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Kezhi Mao

Abstract: Controllable text generation (CTG) seeks to craft texts adhering to specific attributes, traditionally employing learning-based techniques such as training, fine-tuning, or prefix-tuning with attribute-specific datasets. These approaches, while effective, demand extensive computational and data resources. In contrast, some proposed learning-free alternatives circumvent learning but often yield inf… ▽ More Controllable text generation (CTG) seeks to craft texts adhering to specific attributes, traditionally employing learning-based techniques such as training, fine-tuning, or prefix-tuning with attribute-specific datasets. These approaches, while effective, demand extensive computational and data resources. In contrast, some proposed learning-free alternatives circumvent learning but often yield inferior results, exemplifying the fundamental machine learning trade-off between computational expense and model efficacy. To overcome these limitations, we propose FreeCtrl, a learning-free approach that dynamically adjusts the weights of selected feedforward neural network (FFN) vectors to steer the outputs of large language models (LLMs). FreeCtrl hinges on the principle that the weights of different FFN vectors influence the likelihood of different tokens appearing in the output. By identifying and adaptively adjusting the weights of attribute-related FFN vectors, FreeCtrl can control the output likelihood of attribute keywords in the generated content. Extensive experiments on single- and multi-attribute control reveal that the learning-free FreeCtrl outperforms other learning-free and learning-based methods, successfully resolving the dilemma between learning costs and model performance. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: ACL 2024

arXiv:2406.08187 [pdf, other]

Learning-based Traversability Costmap for Autonomous Off-road Navigation

Authors: Qiumin Zhu, Zhen Sun, Songpengcheng Xia, Guoqing Liu, Kehui Ma, Ling Pei, Zheng Gong

Abstract: Traversability estimation in off-road terrains is an essential procedure for autonomous navigation. However, creating reliable labels for complex interactions between the robot and the surface is still a challenging problem in learning-based costmap generation. To address this, we propose a method that predicts traversability costmaps by leveraging both visual and geometric information of the envi… ▽ More Traversability estimation in off-road terrains is an essential procedure for autonomous navigation. However, creating reliable labels for complex interactions between the robot and the surface is still a challenging problem in learning-based costmap generation. To address this, we propose a method that predicts traversability costmaps by leveraging both visual and geometric information of the environment. To quantify the surface properties like roughness and bumpiness, we introduce a novel way of risk-aware labelling with proprioceptive information for network training. We validate our method in costmap prediction and navigation tasks for complex off-road scenarios. Our results demonstrate that our costmap prediction method excels in terms of average accuracy and MSE. The navigation results indicate that using our learned costmaps leads to safer and smoother driving, outperforming previous methods in terms of the highest success rate, lowest normalized trajectory length, lowest time cost, and highest mean stability across two scenarios. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.05013 [pdf, other]

CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search

Authors: Fengran Mo, Abbas Ghaddar, Kelong Mao, Mehdi Rezagholizadeh, Boxing Chen, Qun Liu, Jian-Yun Nie

Abstract: In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly… ▽ More In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at https://github.com/fengranMark/CHIQ. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04548 [pdf, other]

GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

Authors: Hsiao-Ying Lu, Yiran Li, Ujwal Pratap Krishna Kaluvakolanu Thyagarajan, Kwan-Liu Ma

Abstract: Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face li… ▽ More Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face limitations in systematically exploring diverse substructures and evaluating results in the absence of ground truths. To address this gap, we introduce GNNAnatomy, a model- and dataset-agnostic visual analytics system designed to facilitate the generation and evaluation of multi-level explanations for GNNs. In GNNAnatomy, we employ graphlets to elucidate GNN behavior in graph-level classification tasks. By analyzing the associations between GNN classifications and graphlet frequencies, we formulate hypothesized factual and counterfactual explanations. To validate a hypothesized graphlet explanation, we introduce two metrics: (1) the correlation between its frequency and the classification confidence, and (2) the change in classification confidence after removing this substructure from the original graph. To demonstrate the effectiveness of GNNAnatomy, we conduct case studies on both real-world and synthetic graph datasets from various domains. Additionally, we qualitatively compare GNNAnatomy with a state-of-the-art GNN explainer, demonstrating the utility and versatility of our design. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.00009 [pdf, other]

ULTra-AV: A Unified Longitudinal Trajectory Dataset for Automated Vehicle

Authors: Hang Zhou, Ke Ma, Shixiao Liang, Xiaopeng Li, Xiaobo Qu

Abstract: Automated Vehicles (AVs) promise significant advances in transportation. Critical to these improvements is understanding AVs' longitudinal behavior, relying heavily on real-world trajectory data. Existing open-source trajectory datasets of AV, however, often fall short in refinement, reliability, and completeness, hindering effective performance metrics analysis and model development. This study a… ▽ More Automated Vehicles (AVs) promise significant advances in transportation. Critical to these improvements is understanding AVs' longitudinal behavior, relying heavily on real-world trajectory data. Existing open-source trajectory datasets of AV, however, often fall short in refinement, reliability, and completeness, hindering effective performance metrics analysis and model development. This study addresses these challenges by creating a Unified Longitudinal TRAjectory dataset for AVs (Ultra-AV) to analyze their microscopic longitudinal driving behaviors. This dataset compiles data from 13 distinct sources, encompassing various AV types, test sites, and experiment scenarios. We established a three-step data processing: 1. extraction of longitudinal trajectory data, 2. general data cleaning, and 3. data-specific cleaning to obtain the longitudinal trajectory data and car-following trajectory data. The validity of the processed data is affirmed through performance evaluations across safety, mobility, stability, and sustainability, along with an analysis of the relationships between variables in car-following models. Our work not only furnishes researchers with standardized data and metrics for longitudinal AV behavior studies but also sets guidelines for data collection and model development. △ Less

Submitted 16 May, 2024; originally announced June 2024.

Comments: NA

arXiv:2405.20890 [pdf, other]

Constraining Gluonic Contact Interaction of a Neutrino-philic Dark Fermion at Hadron Colliders and Direct Detection Experiments

Authors: Kai Ma, Lin-Yun He

Abstract: Weakly interacting fermion with the Standard Model particles is a promising candidate of the genuine dark matter. In this paper, we study signatures of the gluonic interactions of a dark fermion and a neutrino at hadron colliders and direct detection experiments. The lowest order interactions are described by contact operators in dimension 7. At hadron colliders, the mono-jet production is the mos… ▽ More Weakly interacting fermion with the Standard Model particles is a promising candidate of the genuine dark matter. In this paper, we study signatures of the gluonic interactions of a dark fermion and a neutrino at hadron colliders and direct detection experiments. The lowest order interactions are described by contact operators in dimension 7. At hadron colliders, the mono-jet production is the most sensitive channel. And these operators can also induce both spin-independent and spin dependent absorption of the dark fermion at nuclear target. We show that for a nearly massless dark fermion, the energy scales are constrained to be higher than 500 GeV and 1.2 TeV by the current LHC and HE-LHC searches, respectively. Furthermore, we also find that almost all the parameter space accessible by the spin-independent absorption has been excluded by the current LHC constraints. In contrast, for spin-dependent absorption at light nuclear target there is still some parameter space which can not be reached by current and upcoming LHC searches. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 32 pages, 7 captioned figures; 1 figure and 3 tables in the Appendix

arXiv:2405.20612 [pdf, other]

UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation

Authors: Hanzhang Zhou, Zijian Feng, Zixiao Zhu, Junlang Qian, Kezhi Mao

Abstract: Large language models (LLMs) have demonstrated impressive capabilities in various tasks using the in-context learning (ICL) paradigm. However, their effectiveness is often compromised by inherent bias, leading to prompt brittleness, i.e., sensitivity to design settings such as example selection, order, and prompt formatting. Previous studies have addressed LLM bias through external adjustment of m… ▽ More Large language models (LLMs) have demonstrated impressive capabilities in various tasks using the in-context learning (ICL) paradigm. However, their effectiveness is often compromised by inherent bias, leading to prompt brittleness, i.e., sensitivity to design settings such as example selection, order, and prompt formatting. Previous studies have addressed LLM bias through external adjustment of model outputs, but the internal mechanisms that lead to such bias remain unexplored. Our work delves into these mechanisms, particularly investigating how feedforward neural networks (FFNs) and attention heads result in the bias of LLMs. By Interpreting the contribution of individual FFN vectors and attention heads, we identify the biased LLM components that skew LLMs' prediction toward specific labels. To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads. Extensive experiments across 12 NLP datasets demonstrate that UniBias significantly enhances ICL performance and alleviates prompt brittleness of LLMs. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.20343 [pdf, other]

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Authors: Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, Kaisheng Ma

Abstract: In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from… ▽ More In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details. △ Less

Submitted 13 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: Project page: https://wukailu.github.io/Unique3D

ACM Class: I.2.10

arXiv:2405.19885 [pdf, other]

Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning

Authors: Hengkai Tan, Songming Liu, Kai Ma, Chengyang Ying, Xingxing Zhang, Hang Su, Jun Zhu

Abstract: Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe tha… ▽ More Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that uses Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. In order to do real-time decision-making, we further adopt FFT and Sliding DFT methods in the model architecture to achieve parallel training and efficient recurrent inference. Extensive results in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found https://thkkk.github.io/fcnet. △ Less

Submitted 5 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19327 [pdf, other]

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Authors: Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kai**g Ma, Minghao Liu, Morry Niu , et al. (20 additional authors not shown)

Abstract: Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparabl… ▽ More Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs. △ Less

Submitted 2 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Comments: https://map-neo.github.io/

arXiv:2405.18435 [pdf, other]

QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Authors: Hongwei Bran Li, Fernando Navarro, Ivan Ezhov, Amirhossein Bayat, Dhritiman Das, Florian Kofler, Suprosanna Shit, Diana Waldmannstetter, Johannes C. Paetzold, Xiaobin Hu, Benedikt Wiestler, Lucas Zimmer, Tamaz Amiranashvili, Chinmay Prabhakar, Christoph Berger, Jonas Weidner, Michelle Alonso-Basant, Arif Rashid, Ujjwal Baid, Wesam Adel, Deniz Ali, Bhakti Baheti, Yingbin Bai, Ishaan Bhatt, Sabri Can Cetindag , et al. (55 additional authors not shown)

Abstract: Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de… ▽ More Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks. △ Less

Submitted 24 June, 2024; v1 submitted 19 March, 2024; originally announced May 2024.

Comments: initial technical report

arXiv:2405.16886 [pdf, other]

Hawk: Learning to Understand Open-World Video Anomalies

Authors: Jiaqi Tang, Hao Lu, Ruizheng Wu, Xiaogang Xu, Ke Ma, Cheng Fang, Bin Guo, Jiangbo Lu, Qifeng Chen, Ying-Cong Chen

Abstract: Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In t… ▽ More Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs. However, current VAD systems are often limited by their superficial semantic understanding of scenes and minimal user interaction. Additionally, the prevalent data scarcity in existing datasets restricts their applicability in open-world scenarios. In this paper, we introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely. Recognizing the difference in motion information between abnormal and normal videos, Hawk explicitly integrates motion modality to enhance anomaly identification. To reinforce motion attention, we construct an auxiliary consistency loss within the motion and video space, guiding the video branch to focus on the motion modality. Moreover, to improve the interpretation of motion-to-language, we establish a clear supervisory relationship between motion and its linguistic representation. Furthermore, we have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions. The final results demonstrate that Hawk achieves SOTA performance, surpassing existing baselines in both video description generation and question-answering. Our codes/dataset/demo will be released at https://github.com/jqtangust/hawk. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16878 [pdf, other]

Complementary Search of Fermionic Absorption Operators at Hadron Collider and Direct Detection Experiments

Authors: Kai Ma, Shao-Feng Ge, Lin-Yun He, Ning Zhou

Abstract: Instead of the energy recoil signal at direct detection experiments, dark matter appears always as missing energy at high energy colliders. For a fermionc dark matter coupled with quarks and neutrino via absorption operators, its production is always accompanied by an invisible neutrino. We study in details the mono-X (photon, jet, and $Z$) productions at the Large Hadron Collider (LHC). To make e… ▽ More Instead of the energy recoil signal at direct detection experiments, dark matter appears always as missing energy at high energy colliders. For a fermionc dark matter coupled with quarks and neutrino via absorption operators, its production is always accompanied by an invisible neutrino. We study in details the mono-X (photon, jet, and $Z$) productions at the Large Hadron Collider (LHC). To make easy comparison between the collider and DM direct detection experiments, we start from the quark-level absorption operators for the first time. In other words, we study the model-independent constraints on a generic fermionic absorption dark fermion. In addition, the interplay and comparison with the possible detection at the neutrino experiment especially Borexino is also briefly discussed. We find that for both spin-dependent and spin-independent absorption of the dark matter, the experiments with light nuclear target can provide the strongest constraints. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 46 pages, 20 captioned figures, 4 tables. The main results of this paper have been reported at the workshop "Roadmap of Dark Matter models for Run 3"

arXiv:2405.15318 [pdf, other]

Are Long-LLMs A Necessity For Long-Context Tasks?

Authors: Hong** Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Yujia Zhou, Xu Chen, Zhicheng Dou

Abstract: The learning and deployment of long-LLMs remains a challenging problem despite recent progresses. In this work, we argue that the long-LLMs are not a necessity to solve long-context tasks, as common long-context tasks are short-context solvable, i.e. they can be solved by purely working with oracle short-contexts within the long-context tasks' inputs. On top of this argument, we propose a framewor… ▽ More The learning and deployment of long-LLMs remains a challenging problem despite recent progresses. In this work, we argue that the long-LLMs are not a necessity to solve long-context tasks, as common long-context tasks are short-context solvable, i.e. they can be solved by purely working with oracle short-contexts within the long-context tasks' inputs. On top of this argument, we propose a framework called LC-Boost (Long-Context Bootstrapper), which enables a short-LLM to address the long-context tasks in a bootstrap** manner. In our framework, the short-LLM prompts itself to reason for two critical decisions: 1) how to access to the appropriate part of context within the input, 2) how to make effective use of the accessed context. By adaptively accessing and utilizing the context based on the presented tasks, LC-Boost can serve as a general framework to handle diversified long-context processing problems. We comprehensively evaluate different types of tasks from popular long-context benchmarks, where LC-Boost is able to achieve a substantially improved performance with a much smaller consumption of resource. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 18 pages

arXiv:2405.13113 [pdf, other]

MAMMOTH-Subaru. II. Diverse Populations of Circumgalactic Ly$α$ Nebulae at Cosmic Noon

Authors: Mingyu Li, Haibin Zhang, Zheng Cai, Yongming Liang, Nobunari Kashikawa, Ke Ma, Xiaohui Fan, J. Xavier Prochaska, Bjorn H. C. Emonts, Xin Wang, Yun**g Wu, Shiwu Zhang, Qiong Li, Sean D. Johnson, Minghao Yue, Fabrizio Arrigoni Battaia, Sebastiano Cantalupo, Joseph F. Hennawi, Satoshi Kikuta, Yuanhang Ning, Masami Ouchi, Rhythm Shimakawa, Ben Wang, Weichen Wang, Zheng Zheng , et al. (1 additional authors not shown)

Abstract: Circumgalactic Lyman-alpha (Ly$α$) nebulae are gaseous halos around galaxies exhibiting luminous extended Ly$α$ emission. This work investigates Ly$α$ nebulae from deep imaging of $\sim12~\mathrm{deg}^2$ sky, targeted by the MAMMOTH-Subaru survey. Utilizing the wide-field capability of Hyper Suprime-Cam (HSC), we present one of the largest blind Ly$α$ nebula selections, including QSO nebulae, Ly… ▽ More Circumgalactic Lyman-alpha (Ly$α$) nebulae are gaseous halos around galaxies exhibiting luminous extended Ly$α$ emission. This work investigates Ly$α$ nebulae from deep imaging of $\sim12~\mathrm{deg}^2$ sky, targeted by the MAMMOTH-Subaru survey. Utilizing the wide-field capability of Hyper Suprime-Cam (HSC), we present one of the largest blind Ly$α$ nebula selections, including QSO nebulae, Ly$α$ blobs, and radio galaxy nebulae down to typical $2σ$ Ly$α$ surface brightness of $(5-10)\times10^{-18}\mathrm{~erg~s^{-1}~cm^{-2}~arcsec^{-2}}$. The sample contains 117 nebulae with Ly$α$ sizes of 40 - 400 kpc, and the most gigantic one spans about 365 kpc, referred to as the Ivory Nebula. Combining with multiwavelength data, we investigate diverse nebula populations and associated galaxies. We find a small fraction of Ly$α$ nebulae have QSOs ($\sim7\%$), luminous infrared galaxies ($\sim1\%$), and radio galaxies ($\sim 2\%$). Remarkably, among the 28 enormous Ly$α$ nebulae (ELANe) exceeding 100 kpc, about $80\%$ are associated with UV-faint galaxies ($M_\mathrm{UV} > -22$), categorized as Type II ELANe. We underscore that Type II ELANe constitute the majority but remain largely hidden in current galaxy and QSO surveys. Dusty starburst and obscured AGN activity are proposed to explain the nature of Type II ELANe. The SED of stacking all Ly$α$ nebulae also reveals signs of massive dusty star-forming galaxies with obscured AGNs. We propose a model to explain the dusty nature where the diverse populations of Ly$α$ nebula capture massive galaxies at different evolutionary stages undergoing violent assembling. Ly$α$ nebulae provide critical insights into the formation and evolution of today's massive cluster galaxies at cosmic noon. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 26 pages, 10 figures, 3 tables, submitted to ApJS, comments welcome

arXiv:2405.12569 [pdf, other]

TypeII-CsiNet: CSI Feedback with TypeII Codebook

Authors: Yiliang Sang, Ke Ma, Yang Ming, ** Lian, Zhaocheng Wang

Abstract: The latest TypeII codebook selects partial strongest angular-delay ports for the feedback of downlink channel state information (CSI), whereas its performance is limited due to the deficiency of utilizing the correlations among the port coefficients. To tackle this issue, we propose a tailored autoencoder named TypeII-CsiNet to effectively integrate the TypeII codebook with deep learning, wherein… ▽ More The latest TypeII codebook selects partial strongest angular-delay ports for the feedback of downlink channel state information (CSI), whereas its performance is limited due to the deficiency of utilizing the correlations among the port coefficients. To tackle this issue, we propose a tailored autoencoder named TypeII-CsiNet to effectively integrate the TypeII codebook with deep learning, wherein three novel designs are developed for sufficiently boosting the sum rate performance. Firstly, a dedicated pre-processing module is designed to sort the selected ports for reserving the correlations of their corresponding coefficients. Secondly, a position-filling layer is developed in the decoder to fill the feedback coefficients into their ports in the recovered CSI matrix, so that the corresponding angular-delay-domain structure is adequately leveraged to enhance the reconstruction accuracy. Thirdly, a two-stage loss function is proposed to improve the sum rate performance while avoiding the trap** in local optimums during model training. Simulation results verify that our proposed TypeII-CsiNet outperforms the TypeII codebook and existing deep learning benchmarks. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.11891 [pdf, ps, other]

Unveiling and Manipulating Prompt Influence in Large Language Models

Authors: Zijian Feng, Hanzhang Zhou, Zixiao Zhu, Junlang Qian, Kezhi Mao

Abstract: Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in sha** the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we pr… ▽ More Prompts play a crucial role in guiding the responses of Large Language Models (LLMs). However, the intricate role of individual tokens in prompts, known as input saliency, in sha** the responses remains largely underexplored. Existing saliency methods either misalign with LLM generation objectives or rely heavily on linearity assumptions, leading to potential inaccuracies. To address this, we propose Token Distribution Dynamics (TDD), a \textcolor{black}{simple yet effective} approach to unveil and manipulate the role of prompts in generating LLM outputs. TDD leverages the robust interpreting capabilities of the language model head (LM head) to assess input saliency. It projects input tokens into the embedding space and then estimates their significance based on distribution dynamics over the vocabulary. We introduce three TDD variants: forward, backward, and bidirectional, each offering unique insights into token relevance. Extensive experiments reveal that the TDD surpasses state-of-the-art baselines with a big margin in elucidating the causal relationships between prompts and LLM outputs. Beyond mere interpretation, we apply TDD to two prompt manipulation tasks for controlled text generation: zero-shot toxic language suppression and sentiment steering. Empirical results underscore TDD's proficiency in identifying both toxic and sentimental cues in prompts, subsequently mitigating toxicity or modulating sentiment in the generated content. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: ICLR 2024

arXiv:2405.11672 [pdf]

Interpretable Machine Learning Enhances Disease Prognosis: Applications on COVID-19 and Onward

Authors: **zhi Shen, Ke Ma

Abstract: In response to the COVID-19 pandemic, the integration of interpretable machine learning techniques has garnered significant attention, offering transparent and understandable insights crucial for informed clinical decision making. This literature review delves into the applications of interpretable machine learning in predicting the prognosis of respiratory diseases, particularly focusing on COVID… ▽ More In response to the COVID-19 pandemic, the integration of interpretable machine learning techniques has garnered significant attention, offering transparent and understandable insights crucial for informed clinical decision making. This literature review delves into the applications of interpretable machine learning in predicting the prognosis of respiratory diseases, particularly focusing on COVID-19 and its implications for future research and clinical practice. We reviewed various machine learning models that are not only capable of incorporating existing clinical domain knowledge but also have the learning capability to explore new information from the data. These models and experiences not only aid in managing the current crisis but also hold promise for addressing future disease outbreaks. By harnessing interpretable machine learning, healthcare systems can enhance their preparedness and response capabilities, thereby improving patient outcomes and mitigating the impact of respiratory diseases in the years to come. △ Less

Submitted 20 May, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.10988 [pdf, other]

Flow Score Distillation for Diverse Text-to-3D Generation

Authors: Runjie Yan, Kailu Wu, Kaisheng Ma

Abstract: Recent advancements in Text-to-3D generation have yielded remarkable progress, particularly through methods that rely on Score Distillation Sampling (SDS). While SDS exhibits the capability to create impressive 3D assets, it is hindered by its inherent maximum-likelihood-seeking essence, resulting in limited diversity in generation outcomes. In this paper, we discover that the Denoise Diffusion Im… ▽ More Recent advancements in Text-to-3D generation have yielded remarkable progress, particularly through methods that rely on Score Distillation Sampling (SDS). While SDS exhibits the capability to create impressive 3D assets, it is hindered by its inherent maximum-likelihood-seeking essence, resulting in limited diversity in generation outcomes. In this paper, we discover that the Denoise Diffusion Implicit Models (DDIM) generation process (\ie PF-ODE) can be succinctly expressed using an analogue of SDS loss. One step further, one can see SDS as a generalized DDIM generation process. Following this insight, we show that the noise sampling strategy in the noise addition stage significantly restricts the diversity of generation results. To address this limitation, we present an innovative noise sampling approach and introduce a novel text-to-3D method called Flow Score Distillation (FSD). Our validation experiments across various text-to-image Diffusion Models demonstrate that FSD substantially enhances generation diversity without compromising quality. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.08487 [pdf, other]

Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method

Authors: Mian Zou, Baosheng Yu, Yibing Zhan, Siwei Lyu, Kede Ma

Abstract: In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define t… ▽ More In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.06600 [pdf, other]

Multi-Object Tracking in the Dark

Authors: Xinzhe Wang, Kang Ma, Qiankun Liu, Yunhao Zou, Ying Fu

Abstract: Low-light scenes are prevalent in real-world applications (e.g. autonomous driving and surveillance at night). Recently, multi-object tracking in various practical use cases have received much attention, but multi-object tracking in dark scenes is rarely considered. In this paper, we focus on multi-object tracking in dark scenes. To address the lack of datasets, we first build a Low-light Multi-Ob… ▽ More Low-light scenes are prevalent in real-world applications (e.g. autonomous driving and surveillance at night). Recently, multi-object tracking in various practical use cases have received much attention, but multi-object tracking in dark scenes is rarely considered. In this paper, we focus on multi-object tracking in dark scenes. To address the lack of datasets, we first build a Low-light Multi-Object Tracking (LMOT) dataset. LMOT provides well-aligned low-light video pairs captured by our dual-camera system, and high-quality multi-object tracking annotations for all videos. Then, we propose a low-light multi-object tracking method, termed as LTrack. We introduce the adaptive low-pass downsample module to enhance low-frequency components of images outside the sensor noises. The degradation suppression learning strategy enables the model to learn invariant information under noise disturbance and image quality degradation. These components improve the robustness of multi-object tracking in dark scenes. We conducted a comprehensive analysis of our LMOT dataset and proposed LTrack. Experimental results demonstrate the superiority of the proposed method and its competitiveness in real night low-light scenes. Dataset and Code: https: //github.com/ying-fu/LMOT △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: Accepted by CVPR2024

arXiv:2405.03234 [pdf, other]

A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series

Authors: Ziquan Deng, Xiwei Xuan, Kwan-Liu Ma, Zhaodan Kong

Abstract: Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performed models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights to detect such issues… ▽ More Time series anomaly detection is a critical machine learning task for numerous applications, such as finance, healthcare, and industrial systems. However, even high-performed models may exhibit potential issues such as biases, leading to unreliable outcomes and misplaced confidence. While model explanation techniques, particularly visual explanations, offer valuable insights to detect such issues by elucidating model attributions of their decision, many limitations still exist -- They are primarily instance-based and not scalable across dataset, and they provide one-directional information from the model to the human side, lacking a mechanism for users to address detected issues. To fulfill these gaps, we introduce HILAD, a novel framework designed to foster a dynamic and bidirectional collaboration between humans and AI for enhancing anomaly detection models in time series. Through our visual interface, HILAD empowers domain experts to detect, interpret, and correct unexpected model behaviors at scale. Our evaluation with two time series datasets and user studies demonstrates the effectiveness of HILAD in fostering a deeper human understanding, immediate corrective actions, and the reliability enhancement of models. △ Less

Submitted 7 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: The manuscript is currently under review

arXiv:2405.02345 [pdf, other]

Exploring the Capabilities of Large Language Models for Generating Diverse Design Solutions

Authors: Kevin Ma, Daniele Grandi, Christopher McComb, Kosa Goucher-Lambert

Abstract: Access to large amounts of diverse design solutions can support designers during the early stage of the design process. In this paper, we explore the efficacy of large language models (LLM) in producing diverse design solutions, investigating the level of impact that parameter tuning and various prompt engineering techniques can have on the diversity of LLM-generated design solutions. Specifically… ▽ More Access to large amounts of diverse design solutions can support designers during the early stage of the design process. In this paper, we explore the efficacy of large language models (LLM) in producing diverse design solutions, investigating the level of impact that parameter tuning and various prompt engineering techniques can have on the diversity of LLM-generated design solutions. Specifically, LLMs are used to generate a total of 4,000 design solutions across five distinct design topics, eight combinations of parameters, and eight different types of prompt engineering techniques, comparing each combination of parameter and prompt engineering method across four different diversity metrics. LLM-generated solutions are compared against 100 human-crowdsourced solutions in each design topic using the same set of diversity metrics. Results indicate that human-generated solutions consistently have greater diversity scores across all design topics. Using a post hoc logistic regression analysis we investigate whether these differences primarily exist at the semantic level. Results show that there is a divide in some design topics between humans and LLM-generated solutions, while others have no clear divide. Taken together, these results contribute to the understanding of LLMs' capabilities in generating a large volume of diverse design solutions and offer insights for future research that leverages LLMs to generate diverse design solutions for a broad range of design tasks (e.g., inspirational stimuli). △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: preprint of journal paper

arXiv:2404.18402 [pdf, other]

Entanglement enhancement of two giant atoms with multiple connection points in bidirectional-chiral quantum waveguide-QED system

Authors: Jie Liu, Yue Cai, Kang-Jie Ma, Lei Tan, Wu-Ming Liu

Abstract: We study the entanglement generation of two giant atoms within a one-dimensional bidirectional-chiral waveguide quantum electrodynamics (QED) system, where the initial state of the two giant atoms are $|e_a,g_b\rangle $. Here, each giant atom is coupled to the waveguide through three connection points, with the configurations divided into five types based on the arrangement of coupling points betw… ▽ More We study the entanglement generation of two giant atoms within a one-dimensional bidirectional-chiral waveguide quantum electrodynamics (QED) system, where the initial state of the two giant atoms are $|e_a,g_b\rangle $. Here, each giant atom is coupled to the waveguide through three connection points, with the configurations divided into five types based on the arrangement of coupling points between the giant atoms and the waveguide: separate, fully braided, partially braided, fully nested, and partially nested. We explore the entanglement generation process within each configuration in both nonchiral and chiral coupling cases. It is demonstrated that entanglement can be controlled as needed by either adjusting the phase shift or selecting different configurations. For nonchiral coupling, the entanglement of each configuration exhibits steady state properties attributable to the presence of dark state. In addition, we find that steady-state entanglement can be obtained at more phase shifts in certain configurations by increasing the number of coupling points between the giant atoms and the bidirectional waveguide. In the case of chiral coupling, the entanglement is maximally enhanced compared to the one of nonchiral case. Especially in fully braided configuration, the concurrence reaches its peak value 1, which is robust to chirality. We further show the influence of atomic initial states on the evolution of interatomic entanglement. Our scheme can be used for entanglement generation in chiral quantum networks of giant-atom waveguide-QED systems, with potential applications in quantum networks and quantum communications. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 10 pages,8 figures

arXiv:2404.16068 [pdf, other]

SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Authors: Yifan Jiang, Filip Ilievski, Kaixin Ma

Abstract: While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot se… ▽ More While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13591 [pdf, other]

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Authors: Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara

Abstract: While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a spe… ▽ More While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset. △ Less

Submitted 24 April, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.13556 [pdf, other]

ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval

Authors: Kelong Mao, Chenlong Deng, Haonan Chen, Fengran Mo, Zheng Liu, Tetsuya Sakai, Zhicheng Dou

Abstract: Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via c… ▽ More Conversational search requires accurate interpretation of user intent from complex multi-turn contexts. This paper presents ChatRetriever, which inherits the strong generalization capability of large language models to robustly represent complex conversational sessions for dense retrieval. To achieve this, we propose a simple and effective dual-learning approach that adapts LLM for retrieval via contrastive learning while enhancing the complex session understanding through masked instruction tuning on high-quality conversational instruction tuning data. Extensive experiments on five conversational search benchmarks demonstrate that ChatRetriever substantially outperforms existing conversational dense retrievers, achieving state-of-the-art performance on par with LLM-based rewriting approaches. Furthermore, ChatRetriever exhibits superior robustness in handling diverse conversational contexts. Our work highlights the potential of adapting LLMs for retrieval with complex inputs like conversational search sessions and proposes an effective approach to advance this research direction. △ Less

Submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.12347 [pdf, other]

AniClipart: Clipart Animation with Text-to-Video Priors

Authors: Ronghuan Wu, Wanchao Su, Kede Ma, **g Liao

Abstract: Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nev… ▽ More Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Comments: Project Page: https://aniclipart.github.io/

arXiv:2404.08008 [pdf, other]

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Authors: Kehua Feng, Keyan Ding, Kede Ma, Zhihua Wang, Qiang Zhang, Huajun Chen

Abstract: The past years have witnessed a proliferation of large language models (LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to the inaccuracy of standard metrics in reflecting human preferences and the inefficiency in sampling informative and diverse test examples. While human evaluation remains the gold standard, it is expensive and time-consuming, especially when dealing wit… ▽ More The past years have witnessed a proliferation of large language models (LLMs). Yet, automated and unbiased evaluation of LLMs is challenging due to the inaccuracy of standard metrics in reflecting human preferences and the inefficiency in sampling informative and diverse test examples. While human evaluation remains the gold standard, it is expensive and time-consuming, especially when dealing with a large number of testing samples. To address this problem, we propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs, whose responses are subject to three-alternative forced choice by human subjects. The pairwise comparison results are then aggregated into a global ranking using the Elo rating system. We select eight representative LLMs and compare them in terms of four skills: knowledge understanding, mathematical reasoning, writing, and coding. Experimental results show that the proposed method achieves a reliable and sensible ranking of LLMs' capabilities, identifies their relative strengths and weaknesses, and offers valuable insights for further LLM advancement. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 32 pages, 6 figures

arXiv:2404.06419 [pdf, other]

Exploring Four Fermion Contact Couplings of a Dark Fermion and an Electron at Hadron Colliders and Direct Detection Experiments

Authors: Kai Ma

Abstract: Both the collider searches and direct detections are promising approaches to probe a fermionic dark matter. In this paper we study signatures of the four fermion contact operators involving a dark fermion, an electron and a quark pair. We show that the mono-electron production channel at hadron collider can provide strong constraints. Associated productions of a charged electron with a photon/jet… ▽ More Both the collider searches and direct detections are promising approaches to probe a fermionic dark matter. In this paper we study signatures of the four fermion contact operators involving a dark fermion, an electron and a quark pair. We show that the mono-electron production channel at hadron collider can provide strong constraints. Associated productions of a charged electron with a photon/jet with missing energy are also studied. Using the current LHC data at $\sqrt{s} = 13$\,TeV, the lower bound on the energy scale of the (axial-)vector operator can reach to $12$\,TeV for a massless dark fermion. It can be further improved to about $24$\,TeV at the HE-LHC with $\sqrt{s} = 25$\,TeV and a total luminosity $20\,{\rm ab}^{-1}$. For the direct detections, the signal operators can generate induced $β^\pm$ decays. For the induced $β^-$ decay, we show that the constraints are weaker than the ones from the collider searches in almost all of the parameter space, and the accessible parameter space is already excluded by the current LHC data. In case of a relative heavy dark fermion (a few MeV), the induced $β^+$ decay is more sensitive than the collider search. Despite the advantage of the collider search that a much wider range of the dark fermion mass can be investigated, it can also provide a complementarity to the detect detections. △ Less

Submitted 23 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 32 pages, 15 captioned figures and 2 tables; v2: invisibility of the dark fermion at hadron collider is discussed, and the title is improved

arXiv:2404.04619 [pdf, other]

Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Authors: Zhonghan Zhao, Ke Ma, Wenhao Chai, Xuan Wang, Kewei Chen, Dongxu Guo, Yanting Zhang, Hongwei Wang, Gaoang Wang

Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks m… ▽ More With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2403.08282

arXiv:2404.03543 [pdf, other]

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Authors: Jiawei Guo, Ziming Li, Xueling Liu, Kai**g Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

Abstract: Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench empha… ▽ More Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners. △ Less

Submitted 6 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.01672 [pdf, other]

The Meta Distribution of the SIR in Joint Communication and Sensing Networks

Authors: Kun Ma, Chenyuan Feng, Giovanni Geraci, Howard H. Yang

Abstract: In this paper, we introduce a novel mathematical framework for assessing the performance of joint communication and sensing (JCAS) in wireless networks, employing stochastic geometry as an analytical tool. We focus on deriving the meta distribution of the signal-to-interference ratio (SIR) for JCAS networks. This approach enables a fine-grained quantification of individual user or radar performanc… ▽ More In this paper, we introduce a novel mathematical framework for assessing the performance of joint communication and sensing (JCAS) in wireless networks, employing stochastic geometry as an analytical tool. We focus on deriving the meta distribution of the signal-to-interference ratio (SIR) for JCAS networks. This approach enables a fine-grained quantification of individual user or radar performance intrinsic to these networks. Our work involves the modeling of JCAS networks and the derivation of mathematical expressions for the JCAS SIR meta distribution. Through simulations, we validate both our theoretical analysis and illustrate how the JCAS SIR meta distribution varies with the network deployment density. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.00417 [pdf, other]

Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation

Authors: HongWei Yan, Liyuan Wang, Kaisheng Ma, Yi Zhong

Abstract: To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods pr… ▽ More To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However, a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end, we introduce a novel approach, Multi-level Online Sequential Experts (MOSE), which cultivates the model as stacked sub-experts, integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts, thereby significantly advancing OCL performance over state-of-the-art baselines (e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet). △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.00252 [pdf, other]

Learned Scanpaths Aid Blind Panoramic Video Quality Assessment

Authors: Kanglong Fan, Wen Wen, Mu Li, Yifan Peng, Kede Ma

Abstract: Panoramic videos have the advantage of providing an immersive and interactive viewing experience. Nevertheless, their spherical nature gives rise to various and uncertain user viewing behaviors, which poses significant challenges for panoramic video quality assessment (PVQA). In this work, we propose an end-to-end optimized, blind PVQA method with explicit modeling of user viewing patterns through… ▽ More Panoramic videos have the advantage of providing an immersive and interactive viewing experience. Nevertheless, their spherical nature gives rise to various and uncertain user viewing behaviors, which poses significant challenges for panoramic video quality assessment (PVQA). In this work, we propose an end-to-end optimized, blind PVQA method with explicit modeling of user viewing patterns through visual scanpaths. Our method consists of two modules: a scanpath generator and a quality assessor. The scanpath generator is initially trained to predict future scanpaths by minimizing their expected code length and then jointly optimized with the quality assessor for quality prediction. Our blind PVQA method enables direct quality assessment of panoramic images by treating them as videos composed of identical frames. Experiments on three public panoramic image and video quality datasets, encompassing both synthetic and authentic distortions, validate the superiority of our blind PVQA model over existing methods. △ Less

Submitted 15 May, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.19417 [pdf, other]

OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion

Authors: Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, Cewu Lu

Abstract: We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them… ▽ More We present OAKINK2, a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation, OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance, Primitive Task, and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks, treating them as a sequence of object affordance fulfillment. The first level, Affordance, outlines the functionalities that objects in the scene can afford, the second level, Primitive Task, describes the minimal interaction units that humans interact with the object to achieve its affordance, and the third level, Complex Task, illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body, hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2, we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework, we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: To be appeared in CVPR 2024. 26 pages

arXiv:2403.13500 [pdf, other]

doi 10.1051/0004-6361/202348993

The Galactic latitude dependency of Faraday complexity in the S-PASS/ATCA RM catalogue

Authors: S. Ranchod, S. A. Mao, R. Deane, S. S. Sridhar, A. Damas-Segovia, J. D. Livingston, Y. K. Ma

Abstract: The S-band Polarisation All Sky Survey (SPASS/ATCA) rotation measure (RM) catalogue is the largest broadband RM catalogue to date, increasing the RM density in the sparse southern sky. Through analysis of this catalogue, we report a latitude dependency of the Faraday complexity of polarised sources in this catalogue within 10$^\circ$ of the Galactic plane towards the inner Galaxy. In this study, w… ▽ More The S-band Polarisation All Sky Survey (SPASS/ATCA) rotation measure (RM) catalogue is the largest broadband RM catalogue to date, increasing the RM density in the sparse southern sky. Through analysis of this catalogue, we report a latitude dependency of the Faraday complexity of polarised sources in this catalogue within 10$^\circ$ of the Galactic plane towards the inner Galaxy. In this study, we aim to investigate this trend with follow-up observations using the Australia Telescope Compact Array (ATCA). We observe 95 polarised sources from the SPASS/ATCA RM catalogue at 1.1 - 3.1 GHz with ATCA's 6 km configuration. We present Stokes QU fitting results and a comparative analysis with the SPASS/ATCA catalogue. We find an overall decrease in complexity in these sources with the higher angular resolution observations, with a complexity fraction of 42\%, establishing that the majority of the complexity in the SPASS/ATCA sample is due to the mixing-in of diffuse Galactic emission at scales $θ> 2.8'$. Furthermore, we find a correlation between our observed small-scale complexity $θ< 2.8'$ and the Galactic spiral arms, which we interpret to be due to Galactic turbulence or small-scale polarised emission. These results emphasise the importance of considering the maximum angular scale to which the observations are sensitive in the classification of Faraday complexity; the effect of which can be more carefully investigated with SKA-precursor and pathfinder arrays (e.g. MeerKAT and ASKAP). △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: 16 pages, 16 figures

Journal ref: A&A 686, A104 (2024)

arXiv:2403.12504 [pdf, other]

TON-VIO: Online Time Offset Modeling Networks for Robust Temporal Alignment in High Dynamic Motion VIO

Authors: Chaoran Xiong, Guoqing Liu, Qi Wu, Songpengcheng Xia, Tong Hua, Kehui Ma, Zhen Sun, Yan Xiang, Ling Pei

Abstract: Temporal misalignment (time offset) between sensors is common in low cost visual-inertial odometry (VIO) systems. Such temporal misalignment introduces inconsistent constraints for state estimation, leading to a significant positioning drift especially in high dynamic motion scenarios. In this article, we focus on online temporal calibration to reduce the positioning drift caused by the time offse… ▽ More Temporal misalignment (time offset) between sensors is common in low cost visual-inertial odometry (VIO) systems. Such temporal misalignment introduces inconsistent constraints for state estimation, leading to a significant positioning drift especially in high dynamic motion scenarios. In this article, we focus on online temporal calibration to reduce the positioning drift caused by the time offset for high dynamic motion VIO. For the time offset observation model, most existing methods rely on accurate state estimation or stable visual tracking. For the prediction model, current methods oversimplify the time offset as a constant value with white Gaussian noise. However, these ideal conditions are seldom satisfied in real high dynamic scenarios, resulting in the poor performance. In this paper, we introduce online time offset modeling networks (TON) to enhance real-time temporal calibration. TON improves the accuracy of time offset observation and prediction modeling. Specifically, for observation modeling, we propose feature velocity observation networks to enhance velocity computation for features in unstable visual tracking conditions. For prediction modeling, we present time offset prediction networks to learn its evolution pattern. To highlight the effectiveness of our method, we integrate the proposed TON into both optimization-based and filter-based VIO systems. Simulation and real-world experiments are conducted to demonstrate the enhanced performance of our approach. Additionally, to contribute to the VIO community, we will open-source the code of our method on: https://github.com/Franky-X/FVON-TPN. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12369 [pdf, other]

Block-Dominant Compressed Sensing for Near-Field Communications: Fundamentals, Solutions and Future Directions

Authors: Liyang Lu, Ke Ma, Zhaocheng Wang

Abstract: Near-field (NF) communications draw much attention in the context of extremely large-scale antenna arrays (ELAA). Owing to a large number of antennas and high carrier frequency, the NF coverage distance is quite substantial, where the electromagnetic radiation propagates by spherical waves, in contrast to the conventional planar waves of the far-field. Motivated by these facts, the block-dominant… ▽ More Near-field (NF) communications draw much attention in the context of extremely large-scale antenna arrays (ELAA). Owing to a large number of antennas and high carrier frequency, the NF coverage distance is quite substantial, where the electromagnetic radiation propagates by spherical waves, in contrast to the conventional planar waves of the far-field. Motivated by these facts, the block-dominant compressed sensing (BD-CS) assisted NF communications are proposed. Specifically, we elucidate why block sparsity exists in the distance-limited NF region. Then, block-dominant side-information (BD-SI) is introduced in support of the actual NF communication implementation. We validate that BD-CS is capable of providing exceptional channel estimation accuracy and high spectral efficiency, where the associated challenges, opportunities and its actual implementation in NF communications need to be carefully addressed. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Submitted to IEEE for possible publication

arXiv:2403.12105 [pdf, ps, other]

doi 10.1515/phys-2015-0043

2-D isotropic negative refractive index in a N-type four-level atomic system

Authors: Shun-Cai Zhao, Qi-Xuan Wu, Kun Ma

Abstract: 2-D(Two-dimensional) isotropic negative refractive index (NRI) is explicitly realized via the orthogonal signal and coupling standing-wave fields coupling the N-type four-level atomic system. Under some key parameters of the dense vapor media, the atomic system exhibits isotropic NRI with simultaneous negative permittivity and permeability (i.e. Left-handedness) in the 2-D x-y plane. Compared with… ▽ More 2-D(Two-dimensional) isotropic negative refractive index (NRI) is explicitly realized via the orthogonal signal and coupling standing-wave fields coupling the N-type four-level atomic system. Under some key parameters of the dense vapor media, the atomic system exhibits isotropic NRI with simultaneous negative permittivity and permeability (i.e. Left-handedness) in the 2-D x-y plane. Compared with other 2-D NRI schemes, the coherent atomic vapor media in our scheme may be an ideal 2-D isotropic NRI candidate and has some potential advantages, significance or applications in the further investigation. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 6 pages, 4 figures

Journal ref: Open Phys. 2015; 13:349-354

arXiv:2403.11335 [pdf, other]

ConvSDG: Session Data Generation for Conversational Search

Authors: Fengran Mo, Bole Yi, Kelong Mao, Chen Qu, Kaiyu Huang, Jian-Yun Nie

Abstract: Conversational search provides a more convenient interface for users to search by allowing multi-turn interaction with the search engine. However, the effectiveness of the conversational dense retrieval methods is limited by the scarcity of training data required for their fine-tuning. Thus, generating more training conversational sessions with relevant labels could potentially improve search perf… ▽ More Conversational search provides a more convenient interface for users to search by allowing multi-turn interaction with the search engine. However, the effectiveness of the conversational dense retrieval methods is limited by the scarcity of training data required for their fine-tuning. Thus, generating more training conversational sessions with relevant labels could potentially improve search performance. Based on the promising capabilities of large language models (LLMs) on text generation, we propose ConvSDG, a simple yet effective framework to explore the feasibility of boosting conversational search by using LLM for session data generation. Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning, according to the availability of relevance judgments. The generated data are used to fine-tune the conversational dense retriever. Extensive experiments on four widely used datasets demonstrate the effectiveness and broad applicability of our ConvSDG framework compared with several strong baselines. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted by WWW 2024 Workshop

arXiv:2403.10854 [pdf, other]

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Authors: Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, Lei Zhang

Abstract: While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investiga… ▽ More While Multimodal Large Language Models (MLLMs) have experienced significant advancement on visual understanding and reasoning, their potentials to serve as powerful, flexible, interpretable, and text-driven models for Image Quality Assessment (IQA) remains largely unexplored. In this paper, we conduct a comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we first investigate nine prompting systems for MLLMs as the combinations of three standardized testing procedures in psychophysics (i.e., the single-stimulus, double-stimulus, and multiple-stimulus methods) and three popular prompting strategies in natural language processing (i.e., the standard, in-context, and chain-of-thought prompting). We then present a difficult sample selection procedure, taking into account sample diversity and uncertainty, to further challenge MLLMs equipped with the respective optimal prompting systems. We assess three open-source and one close-source MLLMs on several visual attributes of image quality (e.g., structural and textural distortions, color differences, and geometric transformations) in both full-reference and no-reference scenarios. Experimental results show that only the close-source GPT-4V provides a reasonable account for human perception of image quality, but is weak at discriminating fine-grained quality variations (e.g., color differences) and at comparing visual quality of multiple images, tasks humans can perform effortlessly. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Showing 1–50 of 610 results for author: Ma, K