-
SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection
Authors:
Dingkang Liang,
Wei Hua,
Chunsheng Shi,
Zhikang Zou,
Xiaoqing Ye,
Xiang Bai
Abstract:
Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Ther…
▽ More
Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The code will be made available.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
Authors:
Haiquan Zhao,
Lingyu Li,
Shisong Chen,
Shuqi Kong,
Jiaan Wang,
Kexin Huang,
Tianle Gu,
Yixu Wang,
Dandan Liang,
Zhixu Li,
Yan Teng,
Yanghua Xiao,
Yingchun Wang
Abstract:
Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of ro…
▽ More
Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
△ Less
Submitted 24 June, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting
Authors:
Daojun Liang,
Haixia Zhang,
Dongfeng Yuan
Abstract:
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes…
▽ More
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Authors:
Tianle Gu,
Zeyang Zhou,
Kexin Huang,
Dandan Liang,
Yixu Wang,
Haiquan Zhao,
Yuanqi Yao,
Xingge Qiao,
Keqing Wang,
Yujiu Yang,
Yan Teng,
Yu Qiao,
Yingchun Wang
Abstract:
Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often la…
▽ More
Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.
△ Less
Submitted 13 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
Authors:
Xingkui Zhu,
Yiran Guan,
Dingkang Liang,
Yuchao Chen,
Yuliang Liu,
Xiang Bai
Abstract:
The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for M…
▽ More
The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications
Authors:
Dayang Liang,
**yang Lai,
Yunlong Liu
Abstract:
How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state…
▽ More
How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at https://github.com/DMU-XMU/DSR.
△ Less
Submitted 30 June, 2024; v1 submitted 30 May, 2024;
originally announced May 2024.
-
Enhancing Interaction Modeling with Agent Selection and Physical Methods for Trajectory Prediction
Authors:
Shiji Huang,
Lei Ye,
Min Chen,
Wenhai Luo,
Chenqi Xu,
Deyuan Liang,
Dihong Wang
Abstract:
In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscr…
▽ More
In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscriminate selection of all nearby agents substantially escalates the computational demands of the model, particularly in those interaction-rich scenarios. (ii) Moreover, the simplistic feature extraction of current time agents falls short of adequately capturing the nuanced dynamics of interactions. (iii) Compounded by the inherently low interpretability of attention mechanism and graph neural networks, there is a propensity for the model to allocate unreliable correlation coefficients to certain agents, adversely impacting the accuracy of trajectory predictions. To mitigate these issues, we introduce ASPILin, a novel approach that enhances the selection of interacting agents by considering their current and future lanes, extending this consideration across all historical frames. Utilizing the states of the agents, we estimate the nearest future distance between agents and the time needed to reach this distance. Then, combine these with their current distances to derive a physical correlation coefficient to encode interactions. Experiments conducted on popular trajectory prediction datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference
Authors:
Yonghao Liu,
Mengyu Li,
Di Liang,
Ximing Li,
Fausto Giunchiglia,
Lan Huang,
Xiaoyue Feng,
Renchu Guan
Abstract:
Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understandi…
▽ More
Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understanding of the intended meaning of the sentences due to the ambiguity and vagueness of language. To address this challenge, we propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information for NLI tasks. Specifically, we first design an image-sentence interaction module to incorporate visuals into the attention mechanism of the pre-trained model, allowing the two modalities to interact comprehensively. Furthermore, we introduce an image-sentence fusion module that can adaptively integrate visual information from images and semantic information from sentences. By incorporating relevant visual information and leveraging linguistic knowledge, our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks. Extensive benchmark experiments demonstrate that our proposed ScenaFuse, a scenario-guided approach, consistently boosts NLI performance.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Reindex-Then-Adapt: Improving Large Language Models for Conversational Recommendation
Authors:
Zhankui He,
Zhouhang Xie,
Harald Steck,
Dawen Liang,
Rahul Jha,
Nathan Kallus,
Julian McAuley
Abstract:
Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item p…
▽ More
Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item popularity, on targeted conversational recommendation platforms. In conversational recommendation, LLMs recommend items by generating the titles (as multiple tokens) autoregressively, making it difficult to obtain and control the recommendations over all items. Thus, we propose a Reindex-Then-Adapt (RTA) framework, which converts multi-token item titles into single tokens within LLMs, and then adjusts the probability distributions over these single-token item titles accordingly. The RTA framework marries the benefits of both LLMs and traditional recommender systems (RecSys): understanding complex queries as LLMs do; while efficiently controlling the recommended item distributions in conversational recommendations as traditional RecSys do. Our framework demonstrates improved accuracy metrics across three different conversational recommendation datasets and two adaptation settings
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
DP-MDM: Detail-Preserving MR Reconstruction via Multiple Diffusion Models
Authors:
Mengxiao Geng,
Jiahao Zhu,
Xiaolin Zhu,
Qiqing Liu,
Dong Liang,
Qiegen Liu
Abstract:
Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a com…
▽ More
Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a comprehensive detail-preserving reconstruction method using multiple diffusion models to extract structure and detail features in k-space domain instead of image do-main. Moreover, virtual binary modal masks are utilized to refine the range of values in k-space data through highly adaptive center windows, which allows the model to focus its attention more efficiently. Last but not least, an inverted pyramid structure is employed, where the top-down image information gradually decreases, ena-bling a cascade representation. The framework effective-ly represents multi-scale sampled data, taking into ac-count the sparsity of the inverted pyramid architecture, and utilizes cascade training data distribution to repre-sent multi-scale data. Through a step-by-step refinement approach, the method refines the approximation of de-tails. Finally, the proposed method was evaluated by con-ducting experiments on clinical and public datasets. The results demonstrate that the proposed method outper-forms other methods.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues
Authors:
Xianhua He,
Dashuang Liang,
Song Yang,
Zhanlong Hao,
Hui Ma,
Binjie Mao,
Xi Li,
Yao Wang,
Pengfei Yan,
Ajian Liu
Abstract:
Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to dev…
▽ More
Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Pathology-genomic fusion via biologically informed cross-modality graph learning for survival analysis
Authors:
Zeyu Zhang,
Yuanshen Zhao,
**gxian Duan,
Yaou Liu,
Hairong Zheng,
Dong Liang,
Zhenyu Zhang,
Zhi-Cheng Li
Abstract:
The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histo…
▽ More
The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histopathology and transcriptomics remains challenging. In this paper, we propose Pathology-Genome Heterogeneous Graph (PGHG) that integrates whole slide images (WSI) and bulk RNA-Seq expression data with heterogeneous graph neural network for cancer survival analysis. The PGHG consists of biological knowledge-guided representation learning network and pathology-genome heterogeneous graph. The representation learning network utilizes the biological prior knowledge of intra-modal and inter-modal data associations to guide the feature extraction. The node features of each modality are updated through attention-based graph learning strategy. Unimodal features and bi-modal fused features are extracted via attention pooling module and then used for survival prediction. We evaluate the model on low-grade gliomas, glioblastoma, and kidney renal papillary cell carcinoma datasets from the Cancer Genome Atlas (TCGA) and the First Affiliated Hospital of Zhengzhou University (FAHZU). Extensive experimental results demonstrate that the proposed method outperforms both unimodal and other multi-modal fusion models. For demonstrating the model interpretability, we also visualize the attention heatmap of pathological images and utilize integrated gradient algorithm to identify important tissue structure, biological pathways and key genes.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
PIE: Physics-inspired Low-light Enhancement
Authors:
Dong Liang,
Zhengyan Xu,
Ling Li,
Mingqiang Wei,
Songcan Chen
Abstract:
In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii)…
▽ More
In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii) To address the disregard for negative samples and the inadequacy of their generation in existing methods, we incorporate physics-inspired contrastive learning for LLE and design the Bag of Curves (BoC) method to generate more reasonable negative samples that closely adhere to the underlying physical imaging principle. (iii) To overcome the reliance on semantic ground truths in existing methods, we propose an unsupervised regional segmentation module, ensuring regional brightness consistency while eliminating the dependency on semantic ground truths. Overall, the proposed PIE can effectively learn from unpaired positive/negative samples and smoothly realize non-semantic regional enhancement, which is clearly different from existing LLE efforts. Besides the novel architecture of PIE, we explore the gain of PIE on downstream tasks such as semantic segmentation and face detection. Training on readily available open data and extensive experiments demonstrate that our method surpasses the state-of-the-art LLE models over six independent cross-scenes datasets. PIE runs fast with reasonable GFLOPs in test time, making it easy to use on mobile devices.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Agnostic Tomography of Stabilizer Product States
Authors:
Sabee Grewal,
Vishnu Iyer,
William Kretschmer,
Daniel Liang
Abstract:
We define a quantum learning task called agnostic tomography, where given copies of an arbitrary state $ρ$ and a class of quantum states $\mathcal{C}$, the goal is to output a succinct description of a state that approximates $ρ$ at least as well as any state in $\mathcal{C}$ (up to some small error $\varepsilon$). This task generalizes ordinary quantum tomography of states in $\mathcal{C}$ and is…
▽ More
We define a quantum learning task called agnostic tomography, where given copies of an arbitrary state $ρ$ and a class of quantum states $\mathcal{C}$, the goal is to output a succinct description of a state that approximates $ρ$ at least as well as any state in $\mathcal{C}$ (up to some small error $\varepsilon$). This task generalizes ordinary quantum tomography of states in $\mathcal{C}$ and is more challenging because the learning algorithm must be robust to perturbations of $ρ$.
We give an efficient agnostic tomography algorithm for the class $\mathcal{C}$ of $n$-qubit stabilizer product states. Assuming $ρ$ has fidelity at least $τ$ with a stabilizer product state, the algorithm runs in time $n^{O(1 + \log(1/τ))} / \varepsilon^2$. This runtime is quasipolynomial in all parameters, and polynomial if $τ$ is a constant.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Multi-Level Label Correction by Distilling Proximate Patterns for Semi-supervised Semantic Segmentation
Authors:
Hui Xiao,
Yuting Hong,
Li Dong,
Diqun Yan,
Jiayan Zhuang,
Junjie Xiong,
Dongtai Liang,
Chengbin Peng
Abstract:
Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (…
▽ More
Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (MLLC), which aims to use graph neural networks to capture structural relationships in Semantic-Level Graphs (SLGs) and Class-Level Graphs (CLGs) to rectify erroneous pseudo-labels. Specifically, SLGs represent semantic affinities between pairs of pixel features, and CLGs describe classification consistencies between pairs of pixel labels. With the support of proximate pattern information from graphs, MLLC can rectify incorrectly predicted pseudo-labels and can facilitate discriminative feature representations. We design an end-to-end network to train and perform this effective label corrections mechanism. Experiments demonstrate that MLLC can significantly improve supervised baselines and outperforms state-of-the-art approaches in different scenarios on Cityscapes and PASCAL VOC 2012 datasets. Specifically, MLLC improves the supervised baseline by at least 5% and 2% with DeepLabV2 and DeepLabV3+ respectively under different partition protocols.
△ Less
Submitted 9 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Pseudoentanglement Ain't Cheap
Authors:
Sabee Grewal,
Vishnu Iyer,
William Kretschmer,
Daniel Liang
Abstract:
We show that any pseudoentangled state ensemble with a gap of $t$ bits of entropy requires $Ω(t)$ non-Clifford gates to prepare. This bound is tight up to polylogarithmic factors if linear-time quantum-secure pseudorandom functions exist. Our result follows from a polynomial-time algorithm to estimate the entanglement entropy of a quantum state across any cut of qubits. When run on an $n$-qubit st…
▽ More
We show that any pseudoentangled state ensemble with a gap of $t$ bits of entropy requires $Ω(t)$ non-Clifford gates to prepare. This bound is tight up to polylogarithmic factors if linear-time quantum-secure pseudorandom functions exist. Our result follows from a polynomial-time algorithm to estimate the entanglement entropy of a quantum state across any cut of qubits. When run on an $n$-qubit state that is stabilized by at least $2^{n-t}$ Pauli operators, our algorithm produces an estimate that is within an additive factor of $\frac{t}{2}$ bits of the true entanglement entropy.
△ Less
Submitted 11 April, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
Transfer CLIP for Generalizable Image Denoising
Authors:
Jun Cheng,
Dong Liang,
Shan Tan
Abstract:
Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-w…
▽ More
Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal Reasoning
Authors:
Changmeng Zheng,
Dayong Liang,
Wengyu Zhang,
Xiao-Yong Wei,
Tat-Seng Chua,
Qing Li
Abstract:
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the iss…
▽ More
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate BDoG, achieving state-of-the-art results in Science QA and MMBench with significant improvements over previous methods.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Anomaly Detection by Adapting a pre-trained Vision Language Model
Authors:
Yuxuan Cai,
Xinwei He,
Dingkang Liang,
Ao Tong,
Xiang Bai
Abstract:
Recently, large vision and language models have shown their success when adapting them to many downstream tasks. In this paper, we present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. To this end, we make two important improvements: 1) To acquire unified anomaly detection across industrial images of multiple categories, we introduce the learnable p…
▽ More
Recently, large vision and language models have shown their success when adapting them to many downstream tasks. In this paper, we present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. To this end, we make two important improvements: 1) To acquire unified anomaly detection across industrial images of multiple categories, we introduce the learnable prompt and propose to associate it with abnormal patterns through self-supervised learning. 2) To fully exploit the representation power of CLIP, we introduce an anomaly region refinement strategy to refine the localization quality. During testing, the anomalies are localized by directly calculating the similarity between the representation of the learnable prompt and the image. Comprehensive experiments demonstrate the superiority of our framework, e.g., we achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and VisA for anomaly detection and localization. In addition, the proposed method also achieves encouraging performance with marginal training data, which is more challenging.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to Standard RL
Authors:
Kaiwen Wang,
Dawen Liang,
Nathan Kallus,
Wen Sun
Abstract:
We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization.…
▽ More
We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization. Our optimistic meta-algorithm generalizes almost all prior RSRL theory with entropic risk or CVaR. Under discrete rewards, our optimistic theory also certifies the first RSRL regret bounds for MDPs with bounded coverability, e.g., exogenous block MDPs. Under discrete rewards, our policy optimization meta-algorithm enjoys both global convergence and local improvement guarantees in a novel metric that lower bounds the true OCE risk. Finally, we instantiate our framework with PPO, construct an MDP, and show that it learns the optimal risk-sensitive policy while prior algorithms provably fail.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Switching the Loss Reduces the Cost in Batch Reinforcement Learning
Authors:
Alex Ayoub,
Kaiwen Wang,
Vincent Liu,
Samuel Robertson,
James McInerney,
Dawen Liang,
Nathan Kallus,
Csaba Szepesvári
Abstract:
We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving…
▽ More
We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
△ Less
Submitted 12 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
TripoSR: Fast 3D Object Reconstruction from a Single Image
Authors:
Dmitry Tochilkin,
David Pankratz,
Zexiang Liu,
Zixuan Huang,
Adam Letts,
Yangguang Li,
Ding Liang,
Christian Laforte,
Varun Jampani,
Yan-Pei Cao
Abstract:
This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exh…
▽ More
This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exhibits superior performance, both quantitatively and qualitatively, compared to other open-source alternatives. Released under the MIT license, TripoSR is intended to empower researchers, developers, and creatives with the latest advancements in 3D generative AI.
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis
Authors:
Xin Zhou,
Dingkang Liang,
Wei Xu,
Xingkui Zhu,
Yihan Xu,
Zhikang Zou,
Xiang Bai
Abstract:
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However, existing methods for model adaptation usually update all model parameters, i.e., full fine-tuning paradigm, which is inefficient as it relies on high computational costs (e.g., training GPU memory) and massive storage space. In this paper, we aim to study parameter-efficient transfer…
▽ More
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However, existing methods for model adaptation usually update all model parameters, i.e., full fine-tuning paradigm, which is inefficient as it relies on high computational costs (e.g., training GPU memory) and massive storage space. In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency. To achieve this goal, we freeze the parameters of the default pre-trained models and then propose the Dynamic Adapter, which generates a dynamic scale for each token, considering the token significance to the downstream task. We further seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing Internal Prompts, capturing the instance-specific features for interaction. Extensive experiments conducted on five challenging datasets demonstrate that the proposed DAPT achieves superior performance compared to the full fine-tuning counterparts while significantly reducing the trainable parameters and training GPU memory by 95% and 35%, respectively. Code is available at https://github.com/LMD0311/DAPT.
△ Less
Submitted 5 April, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
AVS-Net: Point Sampling with Adaptive Voxel Size for 3D Scene Understanding
Authors:
Hongcheng Yang,
Dingkang Liang,
Dingyuan Zhang,
Zhe Liu,
Zhikang Zou,
Xingyu Jiang,
Yingying Zhu
Abstract:
The recent advancements in point cloud learning have enabled intelligent vehicles and robots to comprehend 3D environments better. However, processing large-scale 3D scenes remains a challenging problem, such that efficient downsampling methods play a crucial role in point cloud learning. Existing downsampling methods either require a huge computational burden or sacrifice fine-grained geometric i…
▽ More
The recent advancements in point cloud learning have enabled intelligent vehicles and robots to comprehend 3D environments better. However, processing large-scale 3D scenes remains a challenging problem, such that efficient downsampling methods play a crucial role in point cloud learning. Existing downsampling methods either require a huge computational burden or sacrifice fine-grained geometric information. For such purpose, this paper presents an advanced sampler that achieves both high accuracy and efficiency. The proposed method utilizes voxel centroid sampling as a foundation but effectively addresses the challenges regarding voxel size determination and the preservation of critical geometric cues. Specifically, we propose a Voxel Adaptation Module that adaptively adjusts voxel sizes with the reference of point-based downsampling ratio. This ensures that the sampling results exhibit a favorable distribution for comprehending various 3D objects or scenes. Meanwhile, we introduce a network compatible with arbitrary voxel sizes for sampling and feature extraction while maintaining high efficiency. The proposed approach is demonstrated with 3D object detection and 3D semantic segmentation. Compared to existing state-of-the-art methods, our approach achieves better accuracy on outdoor and indoor large-scale datasets, e.g. Waymo and ScanNet, with promising efficiency.
△ Less
Submitted 15 April, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
Brain-inspired Distributed Memorization Learning for Efficient Feature-free Unsupervised Domain Adaptation
Authors:
Jianming Lv,
Depin Liang,
Zequan Liang,
Yaobin Zhang,
Sijun Xia
Abstract:
Compared with gradient based artificial neural networks, biological neural networks usually show a more powerful generalization ability to quickly adapt to unknown environments without using any gradient back-propagation procedure. Inspired by the distributed memory mechanism of human brains, we propose a novel gradient-free Distributed Memorization Learning mechanism, namely DML, to support quick…
▽ More
Compared with gradient based artificial neural networks, biological neural networks usually show a more powerful generalization ability to quickly adapt to unknown environments without using any gradient back-propagation procedure. Inspired by the distributed memory mechanism of human brains, we propose a novel gradient-free Distributed Memorization Learning mechanism, namely DML, to support quick domain adaptation of transferred models. In particular, DML adopts randomly connected neurons to memorize the association of input signals, which are propagated as impulses, and makes the final decision by associating the distributed memories based on their confidence. More importantly, DML is able to perform reinforced memorization based on unlabeled data to quickly adapt to a new domain without heavy fine-tuning of deep features, which makes it very suitable for deploying on edge devices. Experiments based on four cross-domain real-world datasets show that DML can achieve superior performance of real-time domain adaptation compared with traditional gradient based MLP with more than 10% improvement of accuracy while reducing 87% of the timing cost of optimization.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Question Calibration and Multi-Hop Modeling for Temporal Question Answering
Authors:
Chao Xue,
Di Liang,
Pengfei Wang,
**g Zhang
Abstract:
Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PL…
▽ More
Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) approach. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
PointMamba: A Simple State Space Model for Point Cloud Analysis
Authors:
Dingkang Liang,
Xin Zhou,
Wei Xu,
Xingkui Zhu,
Zhikang Zou,
Xiaoqing Ye,
Xiao Tan,
Xiang Bai
Abstract:
Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM)…
▽ More
Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs. Specifically, our method leverages space-filling curves for effective point tokenization and adopts an extremely simple, non-hierarchical Mamba encoder as the backbone. Comprehensive evaluations demonstrate that PointMamba achieves superior performance across multiple datasets while significantly reducing GPU memory usage and FLOPs. This work underscores the potential of SSMs in 3D vision-related tasks and presents a simple yet effective Mamba-based baseline for future research. The code is available at https://github.com/LMD0311/PointMamba.
△ Less
Submitted 29 May, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
EasyFS: an Efficient Model-free Feature Selection Framework via Elastic Transformation of Features
Authors:
Jianming Lv,
Sijun Xia,
Depin Liang,
Wei Chen
Abstract:
Traditional model-free feature selection methods treat each feature independently while disregarding the interrelationships among features, which leads to relatively poor performance compared with the model-aware methods. To address this challenge, we propose an efficient model-free feature selection framework via elastic expansion and compression of the features, namely EasyFS, to achieve better…
▽ More
Traditional model-free feature selection methods treat each feature independently while disregarding the interrelationships among features, which leads to relatively poor performance compared with the model-aware methods. To address this challenge, we propose an efficient model-free feature selection framework via elastic expansion and compression of the features, namely EasyFS, to achieve better performance than state-of-the-art model-aware methods while sharing the characters of efficiency and flexibility with the existing model-free methods. In particular, EasyFS expands the feature space by using the random non-linear projection network to achieve the non-linear combinations of the original features, so as to model the interrelationships among the features and discover most correlated features. Meanwhile, a novel redundancy measurement based on the change of coding rate is proposed for efficient filtering of redundant features. Comprehensive experiments on 21 different datasets show that EasyFS outperforms state-of-the art methods up to 10.9\% in the regression tasks and 5.7\% in the classification tasks while saving more than 94\% of the time.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
CounterCLR: Counterfactual Contrastive Learning with Non-random Missing Data in Recommendation
Authors:
Jun Wang,
Haoxuan Li,
Chi Zhang,
Dongxu Liang,
Enyun Yu,
Wenwu Ou,
Wenjia Wang
Abstract:
Recommender systems are designed to learn user preferences from observed feedback and comprise many fundamental tasks, such as rating prediction and post-click conversion rate (pCVR) prediction. However, the observed feedback usually suffer from two issues: selection bias and data sparsity, where biased and insufficient feedback seriously degrade the performance of recommender systems in terms of…
▽ More
Recommender systems are designed to learn user preferences from observed feedback and comprise many fundamental tasks, such as rating prediction and post-click conversion rate (pCVR) prediction. However, the observed feedback usually suffer from two issues: selection bias and data sparsity, where biased and insufficient feedback seriously degrade the performance of recommender systems in terms of accuracy and ranking. Existing solutions for handling the issues, such as data imputation and inverse propensity score, are highly susceptible to additional trained imputation or propensity models. In this work, we propose a novel counterfactual contrastive learning framework for recommendation, named CounterCLR, to tackle the problem of non-random missing data by exploiting the advances in contrast learning. Specifically, the proposed CounterCLR employs a deep representation network, called CauNet, to infer non-random missing data in recommendations and perform user preference modeling by further introducing a self-supervised contrastive learning task. Our CounterCLR mitigates the selection bias problem without the need for additional models or estimators, while also enhancing the generalization ability in cases of sparse data. Experiments on real-world datasets demonstrate the effectiveness and superiority of our method.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals
Authors:
Daojun Liang,
Haixia Zhang,
Dongfeng Yuan,
Bingzheng Zhang,
Minggao Zhang
Abstract:
In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is ren…
▽ More
In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer.
△ Less
Submitted 17 June, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Anything in Any Scene: Photorealistic Video Object Insertion
Authors:
Chen Bai,
Zeman Shao,
Guoxiang Zhang,
Di Liang,
Jie Yang,
Zhuorui Zhang,
Yujian Guo,
Chengzhang Zhong,
Yiqiao Qiu,
Zhendong Wang,
Yichen Guan,
Xiaoyin Zheng,
Tao Wang,
Cheng Lu
Abstract:
Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high level…
▽ More
Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high levels of photorealism. In this paper, we propose Anything in Any Scene, a novel and generic framework for realistic video simulation that seamlessly inserts any object into an existing dynamic video with a strong emphasis on physical realism. Our proposed general framework encompasses three key processes: 1) integrating a realistic object into a given scene video with proper placement to ensure geometric realism; 2) estimating the sky and environmental lighting distribution and simulating realistic shadows to enhance the light realism; 3) employing a style transfer network that refines the final video output to maximize photorealism. We experimentally demonstrate that Anything in Any Scene framework produces simulated videos of great geometric realism, lighting realism, and photorealism. By significantly mitigating the challenges associated with video data generation, our framework offers an efficient and cost-effective solution for acquiring high-quality videos. Furthermore, its applications extend well beyond video data augmentation, showing promising potential in virtual reality, video editing, and various other video-centric applications. Please check our project website https://anythinginanyscene.github.io for access to our project code and more high-resolution video results.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
You Only Look Bottom-Up for Monocular 3D Object Detection
Authors:
Kaixin Xiong,
Dingyuan Zhang,
Dingkang Liang,
Zhe Liu,
Hongcheng Yang,
Wondimu Dikubab,
Jianwei Cheng,
Xiang Bai
Abstract:
Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory perfor…
▽ More
Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory performances. Motivated by the fact that humans could leverage the bottom-up positional clues to locate objects in 3D space from a single image, in this paper, we explore the position modeling from the image feature column and propose a new method named You Only Look Bottum-Up (YOLOBU). Specifically, our YOLOBU leverages Column-based Cross Attention to determine how much a pixel contributes to pixels above it. Next, the Row-based Reverse Cumulative Sum (RRCS) is introduced to build the connections of pixels in the bottom-up direction. Our YOLOBU fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way. Extensive experiments on the KITTI dataset demonstrate the effectiveness and superiority of our method.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Episodic Reinforcement Learning with Expanded State-reward Space
Authors:
Dayang Liang,
Yaru Zhang,
Yunlong Liu
Abstract:
Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable s…
▽ More
Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
Authors:
Zi-Xin Zou,
Zhipeng Yu,
Yuan-Chen Guo,
Yangguang Li,
Ding Liang,
Yan-Pei Cao,
Song-Hai Zhang
Abstract:
Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress, these techniques often face limitations due to slow optimization or rendering processes, leading to extensive training and opt…
▽ More
Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress, these techniques often face limitations due to slow optimization or rendering processes, leading to extensive training and optimization times. In this paper, we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation. This hybrid representation strikes a balance, achieving a faster rendering speed compared to implicit representations while simultaneously delivering superior rendering quality than explicit representations. The point decoder is designed for generating point clouds from single images, offering an explicit representation which is then utilized by the triplane decoder to query Gaussian features for each point. This design choice addresses the challenges associated with directly regressing explicit 3D Gaussian attributes characterized by their non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to enable rapid rendering through splatting. Both decoders are built upon a scalable, transformer-based architecture and have been efficiently trained on large-scale 3D datasets. The evaluations conducted on both synthetic datasets and real-world images demonstrate that our method not only achieves higher quality but also ensures a faster runtime in comparison to previous state-of-the-art techniques. Please see our project page at https://zouzx.github.io/TriplaneGaussian/.
△ Less
Submitted 15 December, 2023; v1 submitted 14 December, 2023;
originally announced December 2023.
-
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation
Authors:
Zexiang Liu,
Yangguang Li,
Youtian Lin,
Xin Yu,
Sida Peng,
Yan-Pei Cao,
Xiaojuan Qi,
Xiaoshui Huang,
Ding Liang,
Wanli Ouyang
Abstract:
Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract fro…
▽ More
Recent advancements in text-to-3D generation technology have significantly advanced the conversion of textual descriptions into imaginative well-geometrical and finely textured 3D objects. Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract from their realism, thereby limiting their usability in applications that demand accurate relighting capabilities. To bridge this gap, we present UniDream, a text-to-3D generation framework by incorporating unified diffusion priors. Our approach consists of three main components: (1) a dual-phase training process to get albedo-normal aligned multi-view diffusion and reconstruction models, (2) a progressive generation procedure for geometry and albedo-textures based on Score Distillation Sample (SDS) using the trained reconstruction and diffusion models, and (3) an innovative application of SDS for finalizing PBR generation while kee** a fixed albedo based on Stable Diffusion model. Extensive evaluations demonstrate that UniDream surpasses existing methods in generating 3D objects with clearer albedo textures, smoother surfaces, enhanced realism, and superior relighting capabilities.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
Authors:
Zehuan Huang,
Hao Wen,
Junting Dong,
Yaohui Wang,
Yangguang Li,
Xinyuan Chen,
Yan-Pei Cao,
Ding Liang,
Yu Qiao,
Bo Dai,
Lu Sheng
Abstract:
Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews, but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue, we propose E…
▽ More
Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews, but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue, we propose EpiDiff, a localized interactive multiview diffusion model. At the core of the proposed approach is to insert a lightweight epipolar attention block into the frozen diffusion model, leveraging epipolar constraints to enable cross-view interaction among feature maps of neighboring views. The newly initialized 3D modeling module preserves the original feature distribution of the diffusion model, exhibiting compatibility with a variety of base diffusion models. Experiments show that EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS. Additionally, EpiDiff can generate a more diverse distribution of views, improving the reconstruction quality from generated multiviews. Please see our project page at https://huanngzh.github.io/EpiDiff/.
△ Less
Submitted 2 April, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training
Authors:
Jaehyung Kim,
Yuning Mao,
Rui Hou,
Hanchao Yu,
Davis Liang,
Pascale Fung,
Qifan Wang,
Fuli Feng,
Lifu Huang,
Madian Khabsa
Abstract:
Fine-tuning pre-trained language models (LMs) has become the de facto standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to robustness issues, such as adversarial robustness and model calibration. Several perspectives of robustness for LMs have been studied independently, but lacking a unified consideration in multiple perspectives. In this paper, we propose Robustifying LMs…
▽ More
Fine-tuning pre-trained language models (LMs) has become the de facto standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to robustness issues, such as adversarial robustness and model calibration. Several perspectives of robustness for LMs have been studied independently, but lacking a unified consideration in multiple perspectives. In this paper, we propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST), a simple yet effective fine-tuning technique to enhance the multi-perspective robustness of LMs in a unified way. RoAST effectively incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs. To be specific, RoAST introduces adversarial perturbation during fine-tuning while the model parameters are selectively updated upon their relative importance to minimize unnecessary deviation. Under a unified evaluation of fine-tuned LMs by incorporating four representative perspectives of model robustness, we demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs, which indicates its usefulness in practice.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Joint Diffusion: Mutual Consistency-Driven Diffusion Model for PET-MRI Co-Reconstruction
Authors:
Taofeng Xie,
Zhuo-Xu Cui,
Chen Luo,
Huayu Wang,
Congcong Liu,
Yuanzhi Zhang,
Xuemei Wang,
Yanjie Zhu,
Qiyu **,
Guoqing Chen,
Yihang Zhou,
Dong Liang,
Haifeng Wang
Abstract:
Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI sy…
▽ More
Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI systems. However, there exists complementary information among multi-modal images. The complementary information can contribute to image reconstruction. In this study, we propose a novel PET-MRI joint reconstruction model employing a mutual consistency-driven diffusion mode, namely MC-Diffusion. MC-Diffusion learns the joint probability distribution of PET and MRI for utilizing complementary information. We conducted a series of contrast experiments about LPLS, Joint ISAT-net and MC-Diffusion by the ADNI dataset. The results underscore the qualitative and quantitative improvements achieved by MC-Diffusion, surpassing the state-of-the-art method.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Should they? Mobile Biometrics and Technopolicy meet Queer Community Considerations
Authors:
Anaelia Ovalle,
Davi Liang,
Alicia Boyd
Abstract:
Smartphones are integral to our daily lives and activities, providing us with basic functions like texting and phone calls to more complex motion-based functionalities like navigation, mobile gaming, and fitness-tracking. To facilitate these functionalities, smartphones rely on integrated sensors like accelerometers and gyroscopes. These sensors provide personalized measurements that, in turn, con…
▽ More
Smartphones are integral to our daily lives and activities, providing us with basic functions like texting and phone calls to more complex motion-based functionalities like navigation, mobile gaming, and fitness-tracking. To facilitate these functionalities, smartphones rely on integrated sensors like accelerometers and gyroscopes. These sensors provide personalized measurements that, in turn, contribute to tasks such as analyzing biometric data for mobile health purposes. In addition to benefiting smartphone users, biometric data holds significant value for researchers engaged in biometric identification research. Nonetheless, utilizing this user data for biometric identification tasks, such as gait and gender recognition, raises serious privacy, normative, and ethical concerns, particularly within the queer community. Concerns of algorithmic bias and algorithmically-driven dysphoria surface from a historical backdrop of marginalization, surveillance, harassment, discrimination, and violence against the queer community. In this position paper, we contribute to the timely discourse on safeguarding human rights within AI-driven systems by providing a sense of challenges, tensions, and opportunities for new data protections and biometric collection practices in a way that grapples with the sociotechnical realities of the queer community.
△ Less
Submitted 6 October, 2023;
originally announced November 2023.
-
Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta
Authors:
Wei Zhang,
Dai Li,
Chen Liang,
Fang Zhou,
Zhongke Zhang,
Xuewei Wang,
Ru Li,
Yi Zhou,
Yaning Huang,
Dong Liang,
Kai Wang,
Zhangyuan Wang,
Zhengxing Chen,
Fenggang Wu,
Minghai Chen,
Huayu Li,
Yunnan Wu,
Zhan Shu,
Mindi Yuan,
Sri Reddy
Abstract:
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user represe…
▽ More
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and improved infrastructure efficiency.
△ Less
Submitted 22 May, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection
Authors:
Wenxin Wang,
Zhuo-Xu Cui,
Guanxun Cheng,
Chentao Cao,
Xi Xu,
Ziwei Liu,
Haifeng Wang,
Yulong Qi,
Dong Liang,
Yanjie Zhu
Abstract:
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cyc…
▽ More
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and Variance Exploding stochastic differential equation using joint probability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS dataset and 0.7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Authors:
Hayeon Lee,
Rui Hou,
Jongpil Kim,
Davis Liang,
Hongbo Zhang,
Sung Ju Hwang,
Alexander Min
Abstract:
Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we p…
▽ More
Knowledge Distillation (KD) compresses computationally expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models, allowing their use in resource-constrained or real-time settings. However, most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. To address this issue, we propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models while mutually distilling knowledge. The CTCD framework successfully achieves this based on two significant findings: 1) Distilling knowledge from the smaller model to the larger model during co-training improves the performance of the larger model. 2) The enhanced performance of the larger model further boosts the performance of the smaller model. The CTCD framework shows promise as it can be combined with existing techniques like architecture design or data augmentation, replacing one-way KD methods, to achieve further performance improvement. Extensive ablation studies demonstrate the effectiveness of CTCD, and the small model distilled by CTCD outperforms the original larger model by a significant margin of 1.66 on the GLUE benchmark.
△ Less
Submitted 7 November, 2023; v1 submitted 5 November, 2023;
originally announced November 2023.
-
Text-to-3D with Classifier Score Distillation
Authors:
Xin Yu,
Yuan-Chen Guo,
Yangguang Li,
Ding Liang,
Song-Hai Zhang,
Xiaojuan Qi
Abstract:
Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the ro…
▽ More
Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models. While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, we re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks. We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques. We validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods. Our project page is https://xinyu-andy.github.io/Classifier-Score-Distillation
△ Less
Submitted 31 October, 2023; v1 submitted 30 October, 2023;
originally announced October 2023.
-
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
Authors:
Nan He,
Hanyu Lai,
Chenyang Zhao,
Zirui Cheng,
Junting Pan,
Ruoyu Qin,
Ruofan Lu,
Rui Lu,
Yunchen Zhang,
Gangming Zhao,
Zhaohui Hou,
Zhiyuan Huang,
Shaoqing Lu,
Ding Liang,
Mingjie Zhan
Abstract:
Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of jus…
▽ More
Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of just "what". The TeacherLM-7.1B model achieved a zero-shot score of 52.3 on MMLU, surpassing most models with over 100B parameters. Even more remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we augmented 58 NLP datasets and taught various student models with different parameters from OPT and BLOOM series in a multi-task setting. The experimental results indicate that the data augmentation provided by TeacherLM has brought significant benefits. We will release the TeacherLM series of models and augmented datasets as open-source.
△ Less
Submitted 31 October, 2023; v1 submitted 29 October, 2023;
originally announced October 2023.
-
Off-Policy Evaluation for Large Action Spaces via Policy Convolution
Authors:
Noveen Sachdeva,
Lequn Wang,
Dawen Liang,
Nathan Kallus,
Julian McAuley
Abstract:
Develo** accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiase…
▽ More
Develo** accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance, even in the simpler case of one-step contextual bandits. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators. These methods leverage latent structure within actions -- made available through action embeddings -- to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, which can be controlled by adjusting the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over existing estimators.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
A Discrepancy Aware Framework for Robust Anomaly Detection
Authors:
Yuxuan Cai,
Dingkang Liang,
Dongliang Luo,
Xinwei He,
Xin Yang,
Xiang Bai
Abstract:
Defect detection is a critical research area in artificial intelligence. Recently, synthetic data-based self-supervised learning has shown great potential on this task. Although many sophisticated synthesizing strategies exist, little research has been done to investigate the robustness of models when faced with different strategies. In this paper, we focus on this issue and find that existing met…
▽ More
Defect detection is a critical research area in artificial intelligence. Recently, synthetic data-based self-supervised learning has shown great potential on this task. Although many sophisticated synthesizing strategies exist, little research has been done to investigate the robustness of models when faced with different strategies. In this paper, we focus on this issue and find that existing methods are highly sensitive to them. To alleviate this issue, we present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies across different anomaly detection benchmarks. We hypothesize that the high sensitivity to synthetic data of existing self-supervised methods arises from their heavy reliance on the visual appearance of synthetic data during decoding. In contrast, our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance. To this end, inspired by existing knowledge distillation methods, we employ a teacher-student network, which is trained based on synthesized outliers, to compute the discrepancy map as the cue. Extensive experiments on two challenging datasets prove the robustness of our method. Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance. Code is available at: https://github.com/caiyuxuan1120/DAF.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
InvKA: Gait Recognition via Invertible Koopman Autoencoder
Authors:
Fan Li,
Dong Liang,
**g Lian,
Qidong Liu,
Hegui Zhu,
Jizhao Liu
Abstract:
Most current gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. The transition matrix in this space captures complex kinematic features of gait cycles, namely the Koopman operator. The diagonal elements of the operator matrix can represent the over…
▽ More
Most current gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. The transition matrix in this space captures complex kinematic features of gait cycles, namely the Koopman operator. The diagonal elements of the operator matrix can represent the overall motion trend, providing a physically meaningful descriptor. To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers to compress its depth, resulting in fewer floating-point operations. Experimental results on multiple datasets show that our method reduces computational cost to 1% compared to state-of-the-art methods while achieving competitive recognition accuracy 98% on non-occlusion datasets.
△ Less
Submitted 27 September, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors
Authors:
Di Liang,
Nian Shao,
Xiaofei Li
Abstract:
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to…
▽ More
This work proposes a frame-wise online/streaming end-to-end neural diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely detect a flexible number of speakers and extract/update their corresponding attractors, we propose to leverage a causal speaker embedding encoder and an online non-autoregressive self-attention-based attractor decoder. A look-ahead mechanism is adopted to allow leveraging some future frames for effectively detecting new speakers in real time and adaptively updating speaker attractors. The proposed method processes the audio stream frame by frame, and has a low inference latency caused by the look-ahead frames. Experiments show that, compared with the recently proposed block-wise online methods, our method FS-EEND achieves state-of-the-art diarization results, with a low inference latency and computational cost.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Matrix Completion-Informed Deep Unfolded Equilibrium Models for Self-Supervised k-Space Interpolation in MRI
Authors:
Chen Luo,
Huayu Wang,
Taofeng Xie,
Qiyu **,
Guoqing Chen,
Zhuo-Xu Cui,
Dong Liang
Abstract:
Recently, regularization model-driven deep learning (DL) has gained significant attention due to its ability to leverage the potent representational capabilities of DL while retaining the theoretical guarantees of regularization models. However, most of these methods are tailored for supervised learning scenarios that necessitate fully sampled labels, which can pose challenges in practical MRI app…
▽ More
Recently, regularization model-driven deep learning (DL) has gained significant attention due to its ability to leverage the potent representational capabilities of DL while retaining the theoretical guarantees of regularization models. However, most of these methods are tailored for supervised learning scenarios that necessitate fully sampled labels, which can pose challenges in practical MRI applications. To tackle this challenge, we propose a self-supervised DL approach for accelerated MRI that is theoretically guaranteed and does not rely on fully sampled labels. Specifically, we achieve neural network structure regularization by exploiting the inherent structural low-rankness of the $k$-space data. Simultaneously, we constrain the network structure to resemble a nonexpansive map**, ensuring the network's convergence to a fixed point. Thanks to this well-defined network structure, this fixed point can completely reconstruct the missing $k$-space data based on matrix completion theory, even in situations where full-sampled labels are unavailable. Experiments validate the effectiveness of our proposed method and demonstrate its superiority over existing self-supervised approaches and traditional regularization methods, achieving performance comparable to that of supervised learning methods in certain scenarios.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Sequential Action-Induced Invariant Representation for Reinforcement Learning
Authors:
Dayang Liang,
Qihang Chen,
Yunlong Liu
Abstract:
How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to…
▽ More
How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning methods based on bisimulation metrics, contrast, prediction, and reconstruction have shown the ability for task-relevant information extraction. However, due to the lack of appropriate mechanisms for the extraction of task information in the prediction, contrast, and reconstruction-related approaches and the limitations of bisimulation-related methods in domains with sparse rewards, it is still difficult for these methods to be effectively extended to environments with distractions. To alleviate these problems, in the paper, the action sequences, which contain task-intensive signals, are incorporated into representation learning. Specifically, we propose a Sequential Action--induced invariant Representation (SAR) method, in which the encoder is optimized by an auxiliary learner to only preserve the components that follow the control signals of sequential actions, so the agent can be induced to learn the robust representation against distractions. We conduct extensive experiments on the DeepMind Control suite tasks with distractions while achieving the best performance over strong baselines. We also demonstrate the effectiveness of our method at disregarding task-irrelevant information by deploying SAR to real-world CARLA-based autonomous driving with natural distractions. Finally, we provide the analysis results of generalization drawn from the generalization decay and t-SNE visualization. Code and demo videos are available at https://github.com/DMU-XMU/SAR.git.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.