-
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback
Authors:
Bofei Gao,
Zefan Cai,
Runxin Xu,
Peiyi Wang,
Ce Zheng,
Runji Lin,
Keming Lu,
Junyang Lin,
Chang Zhou,
Wen Xiao,
Junjie Hu,
Tianyu Liu,
Baobao Chang
Abstract:
Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale la…
▽ More
Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale labels (i.e., the correctness of the current step and the explanations). In this paper, we propose \textbf{Math-Minos}, a natural language feedback enhanced verifier by constructing automatically-generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier by the accuracy of 1.6\% (86.6\% $\rightarrow$ 88.2\%) on GSM8K and 0.8\% (37.8\% $\rightarrow$ 38.6\%) on MATH. We have released our code and data for further exploration.
△ Less
Submitted 30 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension
Authors:
Runfeng Lin,
Dacheng Xu,
Huijiang Wang,
Zebiao Chen,
Yating Wang,
Shouqiang Liu
Abstract:
When evaluating a learner's knowledge proficiency, the multiple-choice question is an efficient and widely used format in standardized tests. Nevertheless, generating these questions, particularly plausible distractors (incorrect options), poses a considerable challenge. Generally, the distractor generation can be classified into cloze-style distractor generation (CDG) and natural questions distra…
▽ More
When evaluating a learner's knowledge proficiency, the multiple-choice question is an efficient and widely used format in standardized tests. Nevertheless, generating these questions, particularly plausible distractors (incorrect options), poses a considerable challenge. Generally, the distractor generation can be classified into cloze-style distractor generation (CDG) and natural questions distractor generation (NQDG). In contrast to the CDG, utilizing pre-trained language models (PLMs) for NQDG presents three primary challenges: (1) PLMs are typically trained to generate ``correct'' content, like answers, while rarely trained to generate ``plausible" content, like distractors; (2) PLMs often struggle to produce content that aligns well with specific knowledge and the style of exams; (3) NQDG necessitates the model to produce longer, context-sensitive, and question-relevant distractors. In this study, we introduce a fine-tuning framework named DGRC for NQDG in Chinese multi-choice reading comprehension from authentic examinations. DGRC comprises three major components: hard chain-of-thought, multi-task learning, and generation mask patterns. The experiment results demonstrate that DGRC significantly enhances generation performance, achieving a more than 2.5-fold improvement in BLEU scores.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario
Authors:
Yuhan Li,
Hao Zhou,
Wenxiang Shang,
Ran Lin,
Xuanhong Chen,
Bingbing Ni
Abstract:
While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a ligh…
▽ More
While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a lightweight, scalable, operator known as Hydra Block for attire combinations. This is achieved through a parallel attention mechanism that facilitates the feature injection of multiple garments from conditionally encoded branches into the main network. Secondly, to significantly enhance the model's robustness and expressiveness in real-world scenarios, we evolve its potential across diverse settings by synthesizing the residuals of multiple models, as well as implementing a mask region boost strategy to overcome the instability caused by information leakage in existing models. Equipped with the above design, AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap, excelling in producing well-fitting garments replete with photorealistic and rich details. Furthermore, AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Graph Threading with Turn Costs
Authors:
Erik D. Demaine,
Yael Kirkpatrick,
Rebecca Lin
Abstract:
How should we thread a single string through a set of tubes so that pulling the string taut self-assembles the tubes into a desired graph? While prior work [ITCS 2024] solves this problem with the goal of minimizing the length of string, we study here the objective of minimizing the total turn cost. The frictional force required to pull the string through the tubes grows exponentially with the tot…
▽ More
How should we thread a single string through a set of tubes so that pulling the string taut self-assembles the tubes into a desired graph? While prior work [ITCS 2024] solves this problem with the goal of minimizing the length of string, we study here the objective of minimizing the total turn cost. The frictional force required to pull the string through the tubes grows exponentially with the total absolute turn angles (by the Capstan equation), so this metric often dominates the friction in real-world applications such as deployable structures. We show that minimum-turn threading is NP-hard, even for graphs of maximum degree 4, and even when restricted to some special cases of threading. On the other hand, we show that these special cases can in fact be solved efficiently for graphs of maximum degree 4, thereby fully characterizing their dependence on maximum degree. We further provide polynomial-time exact and approximation algorithms for variants of turn-cost threading: restricting to threading each edge exactly twice, and on rectangular grid graphs.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
Authors:
Keming Lu,
Bowen Yu,
Fei Huang,
Yang Fan,
Runji Lin,
Chang Zhou
Abstract:
Effectively aligning Large Language Models (LLMs) with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and…
▽ More
Effectively aligning Large Language Models (LLMs) with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Layer-Aware Analysis of Catastrophic Overfitting: Revealing the Pseudo-Robust Shortcut Dependency
Authors:
Runqi Lin,
Chaojian Yu,
Bo Han,
Hang Su,
Tongliang Liu
Abstract:
Catastrophic overfitting (CO) presents a significant challenge in single-step adversarial training (AT), manifesting as highly distorted deep neural networks (DNNs) that are vulnerable to multi-step adversarial attacks. However, the underlying factors that lead to the distortion of decision boundaries remain unclear. In this work, we delve into the specific changes within different DNN layers and…
▽ More
Catastrophic overfitting (CO) presents a significant challenge in single-step adversarial training (AT), manifesting as highly distorted deep neural networks (DNNs) that are vulnerable to multi-step adversarial attacks. However, the underlying factors that lead to the distortion of decision boundaries remain unclear. In this work, we delve into the specific changes within different DNN layers and discover that during CO, the former layers are more susceptible, experiencing earlier and greater distortion, while the latter layers show relative insensitivity. Our analysis further reveals that this increased sensitivity in former layers stems from the formation of pseudo-robust shortcuts, which alone can impeccably defend against single-step adversarial attacks but bypass genuine-robust learning, resulting in distorted decision boundaries. Eliminating these shortcuts can partially restore robustness in DNNs from the CO state, thereby verifying that dependence on them triggers the occurrence of CO. This understanding motivates us to implement adaptive weight perturbations across different layers to hinder the generation of pseudo-robust shortcuts, consequently mitigating CO. Extensive experiments demonstrate that our proposed method, Layer-Aware Adversarial Weight Perturbation (LAP), can effectively prevent CO and further enhance robustness.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via Nonlinear Integer Programming
Authors:
Ruixi Lin,
Yang You
Abstract:
For language model classification, would you prefer having only one workable class or having every class working? The latter makes more practical uses. Especially for large language models (LLMs), the fact that they achieve a fair overall accuracy by in-context learning (ICL) obscures a large difference in individual class accuracies. In this work, we uncover and tackle language models' imbalance…
▽ More
For language model classification, would you prefer having only one workable class or having every class working? The latter makes more practical uses. Especially for large language models (LLMs), the fact that they achieve a fair overall accuracy by in-context learning (ICL) obscures a large difference in individual class accuracies. In this work, we uncover and tackle language models' imbalance in per-class prediction accuracy by reconceptualizing it as the Contextual Oddity Bias (COBias), and we are the first to engage nonlinear integer programming (NIP) to debias it. Briefly, COBias refers to the difference in accuracy by a class A compared to its ''odd'' class, which holds the majority wrong predictions of class A. With the COBias metric, we reveal that LLMs of varied scales and families exhibit large per-class accuracy differences. Then we propose Debiasing as Nonlinear Integer Programming (DNIP) to correct ICL per-class probabilities for lower bias and higher overall accuracy. Our optimization objective is directly based on the evaluation scores by COBias and accuracy metrics, solved by simulated annealing. Evaluations on three LLMs across seven NLP classification tasks show that DNIP simultaneously achieves significant COBias reduction ($-27\%$) and accuracy improvement ($+12\%$) over the conventional ICL approach, suggesting that modeling pairwise class accuracy differences is a direction in pushing forward more accurate, more reliable LLM predictions.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Eliminating Catastrophic Overfitting Via Abnormal Adversarial Examples Regularization
Authors:
Runqi Lin,
Chaojian Yu,
Tongliang Liu
Abstract:
Single-step adversarial training (SSAT) has demonstrated the potential to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO), a phenomenon that leads to a severely distorted classifier, making it vulnerable to multi-step adversarial attacks. In this work, we observe that some adversarial examples generated on the SSAT-trained network exhibit anomalous…
▽ More
Single-step adversarial training (SSAT) has demonstrated the potential to achieve both efficiency and robustness. However, SSAT suffers from catastrophic overfitting (CO), a phenomenon that leads to a severely distorted classifier, making it vulnerable to multi-step adversarial attacks. In this work, we observe that some adversarial examples generated on the SSAT-trained network exhibit anomalous behaviour, that is, although these training samples are generated by the inner maximization process, their associated loss decreases instead, which we named abnormal adversarial examples (AAEs). Upon further analysis, we discover a close relationship between AAEs and classifier distortion, as both the number and outputs of AAEs undergo a significant variation with the onset of CO. Given this observation, we re-examine the SSAT process and uncover that before the occurrence of CO, the classifier already displayed a slight distortion, indicated by the presence of few AAEs. Furthermore, the classifier directly optimizing these AAEs will accelerate its distortion, and correspondingly, the variation of AAEs will sharply increase as a result. In such a vicious circle, the classifier rapidly becomes highly distorted and manifests as CO within a few iterations. These observations motivate us to eliminate CO by hindering the generation of AAEs. Specifically, we design a novel method, termed Abnormal Adversarial Examples Regularization (AAER), which explicitly regularizes the variation of AAEs to hinder the classifier from becoming distorted. Extensive experiments demonstrate that our method can effectively eliminate CO and further boost adversarial robustness with negligible additional computational overhead.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Utilizing Computer Vision for Continuous Monitoring of Vaccine Side Effects in Experimental Mice
Authors:
Chuang Li,
Shuai Shao,
Willian Mikason,
Rubing Lin,
Yantong Liu
Abstract:
The demand for improved efficiency and accuracy in vaccine safety assessments is increasing. Here, we explore the application of computer vision technologies to automate the monitoring of experimental mice for potential side effects after vaccine administration. Traditional observation methods are labor-intensive and lack the capability for continuous monitoring. By deploying a computer vision sys…
▽ More
The demand for improved efficiency and accuracy in vaccine safety assessments is increasing. Here, we explore the application of computer vision technologies to automate the monitoring of experimental mice for potential side effects after vaccine administration. Traditional observation methods are labor-intensive and lack the capability for continuous monitoring. By deploying a computer vision system, our research aims to improve the efficiency and accuracy of vaccine safety assessments. The methodology involves training machine learning models on annotated video data of mice behaviors pre- and post-vaccination. Preliminary results indicate that computer vision effectively identify subtle changes, signaling possible side effects. Therefore, our approach has the potential to significantly enhance the monitoring process in vaccine trials in animals, providing a practical solution to the limitations of human observation.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models
Authors:
Haoran Sun,
Lixin Liu,
Junjie Li,
Fengyu Wang,
Baohua Dong,
Ran Lin,
Ruohui Huang
Abstract:
The ability of large language models (LLMs) to follow instructions is crucial to real-world applications. Despite recent advances, several studies have highlighted that LLMs struggle when faced with challenging instructions, especially those that include complex constraints, hindering their effectiveness in various tasks. To address this challenge, we introduce Conifer, a novel instruction tuning…
▽ More
The ability of large language models (LLMs) to follow instructions is crucial to real-world applications. Despite recent advances, several studies have highlighted that LLMs struggle when faced with challenging instructions, especially those that include complex constraints, hindering their effectiveness in various tasks. To address this challenge, we introduce Conifer, a novel instruction tuning dataset, designed to enhance LLMs to follow multi-level instructions with complex constraints. Utilizing GPT-4, we curate the dataset by a series of LLM-driven refinement processes to ensure high quality. We also propose a progressive learning scheme that emphasizes an easy-to-hard progression, and learning from process feedback. Models trained with Conifer exhibit remarkable improvements in instruction-following abilities, especially for instructions with complex constraints. On several instruction-following benchmarks, our 7B model outperforms the state-of-the-art open-source 7B models, even exceeds the performance of models 10 times larger on certain metrics. All the code and Conifer dataset are available at https://www.github.com/ConiferLM/Conifer.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Facilitating Reinforcement Learning for Process Control Using Transfer Learning: Perspectives
Authors:
Runze Lin,
Junghui Chen,
Lei Xie,
Hongye Su,
Biao Huang
Abstract:
This paper provides insights into deep reinforcement learning (DRL) for process control from the perspective of transfer learning. We analyze the challenges of applying DRL in the field of process industries and the necessity of introducing transfer learning. Furthermore, recommendations and prospects are provided for future research directions on how transfer learning can be integrated with DRL t…
▽ More
This paper provides insights into deep reinforcement learning (DRL) for process control from the perspective of transfer learning. We analyze the challenges of applying DRL in the field of process industries and the necessity of introducing transfer learning. Furthermore, recommendations and prospects are provided for future research directions on how transfer learning can be integrated with DRL to empower process control.
△ Less
Submitted 1 May, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Authors:
Alexander Khazatsky,
Karl Pertsch,
Suraj Nair,
Ashwin Balakrishna,
Sudeep Dasari,
Siddharth Karamcheti,
Soroush Nasiriany,
Mohan Kumar Srirama,
Lawrence Yunliang Chen,
Kirsty Ellis,
Peter David Fagan,
Joey Hejna,
Masha Itkina,
Marion Lepert,
Yecheng Jason Ma,
Patrick Tree Miller,
Jimmy Wu,
Suneel Belkhale,
Shivin Dass,
Huy Ha,
Arhan Jain,
Abraham Lee,
Youngwoon Lee,
Marius Memmel,
Sungjae Park
, et al. (74 additional authors not shown)
Abstract:
The creation of large, diverse, high-quality robot manipulation datasets is an important step** stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu…
▽ More
The creation of large, diverse, high-quality robot manipulation datasets is an important step** stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
A Spatio-temporal Aligned SUNet Model for Low-light Video Enhancement
Authors:
Ruirui Lin,
Nantheera Anantrasirichai,
Alexandra Malyugina,
David Bull
Abstract:
Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using…
▽ More
Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using a Swin Transformer as a backbone to capture low light video features and exploit their spatio-temporal correlations. The STA-SUNet model is trained on a novel, fully registered dataset (BVI), which comprises dynamic scenes captured under varying light conditions. It is further analysed comparatively against various other models over three test datasets. The model demonstrates superior adaptivity across all datasets, obtaining the highest PSNR and SSIM values. It is particularly effective in extreme low-light conditions, yielding fairly good visualisation results.
△ Less
Submitted 9 April, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
Authors:
Weiyi Lv,
Yuhang Huang,
Ning Zhang,
Ruei-Sung Lin,
Mei Han,
Dan Zeng
Abstract:
In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear m…
▽ More
In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
△ Less
Submitted 20 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Multi-modal preference alignment remedies regression of visual instruction tuning on language model
Authors:
Shengzhi Li,
Rongyu Lin,
Shichao Pei
Abstract:
In production, multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets which the underlying language model had been trained…
▽ More
In production, multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets which the underlying language model had been trained with. To address this challenging degradation, we first collect a lightweight (6k entries) VQA preference dataset where answers were annotated by Gemini for 5 quality metrics in a granular fashion, and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO), and SteerLM. Our findings indicate that the with DPO we are able to surpass instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99 despite small data scale. This enhancement in textual instruction proficiency correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that reconciles the textual and visual performance of MLLMs, restoring and boosting language capability after visual instruction tuning.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Bidirectional Autoregressive Diffusion Model for Dance Generation
Authors:
Canyu Zhang,
Youbao Tang,
Ning Zhang,
Ruei-Sung Lin,
Mei Han,
**g Xiao,
Song Wang
Abstract:
Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create…
▽ More
Dance serves as a powerful medium for expressing human emotions, but the lifelike generation of dance is still a considerable challenge. Recently, diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless, current diffusion-based motion generation models often create entire motion sequences directly and unidirectionally, lacking focus on the motion with local and bidirectional enhancement. When choreographing high-quality dance movements, people need to take into account not only the musical context but also the nearby music-aligned dance motions. To authentically capture human behavior, we propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions. To make the generated dance motion smoother, a local information decoder is built for local motion enhancement. The proposed framework is able to generate new motions based on the input conditions and nearby motions, which foresees individual motion slices iteratively and consolidates all predictions. To further refine the synchronicity between the generated dance and the beat, the beat information is incorporated as an input to generate better music-aligned dance movements. Experimental results demonstrate that the proposed model achieves state-of-the-art performance compared to existing unidirectional approaches on the prominent benchmark for music-to-dance generation.
△ Less
Submitted 22 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
BVI-Lowlight: Fully Registered Benchmark Dataset for Low-Light Video Enhancement
Authors:
Nantheera Anantrasirichai,
Ruirui Lin,
Alexandra Malyugina,
David Bull
Abstract:
Low-light videos often exhibit spatiotemporal incoherent noise, leading to poor visibility and compromised performance across various computer vision applications. One significant challenge in enhancing such content using modern technologies is the scarcity of training data. This paper introduces a novel low-light video dataset, consisting of 40 scenes captured in various motion scenarios under tw…
▽ More
Low-light videos often exhibit spatiotemporal incoherent noise, leading to poor visibility and compromised performance across various computer vision applications. One significant challenge in enhancing such content using modern technologies is the scarcity of training data. This paper introduces a novel low-light video dataset, consisting of 40 scenes captured in various motion scenarios under two distinct low-lighting conditions, incorporating genuine noise and temporal artifacts. We provide fully registered ground truth data captured in normal light using a programmable motorized dolly, and subsequently, refine them via image-based post-processing to ensure the pixel-wise alignment of frames in different light levels. This paper also presents an exhaustive analysis of the low-light dataset, and demonstrates the extensive and representative nature of our dataset in the context of supervised learning. Our experimental results demonstrate the significance of fully registered video pairs in the development of low-light video enhancement methods and the need for comprehensive evaluation. Our dataset is available at DOI:10.21227/mzny-8c77.
△ Less
Submitted 25 May, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
A New Class of Algorithms for Finding Short Vectors in Lattices Lifted from Co-dimension $k$ Codes
Authors:
Robert Lin,
Peter W. Shor
Abstract:
We introduce a new class of algorithms for finding a short vector in lattices defined by codes of co-dimension $k$ over $\mathbb{Z}_P^d$, where $P$ is prime. The co-dimension $1$ case is solved by exploiting the packing properties of the projections mod $P$ of an initial set of non-lattice vectors onto a single dual codeword. The technical tools we introduce are sorting of the projections followed…
▽ More
We introduce a new class of algorithms for finding a short vector in lattices defined by codes of co-dimension $k$ over $\mathbb{Z}_P^d$, where $P$ is prime. The co-dimension $1$ case is solved by exploiting the packing properties of the projections mod $P$ of an initial set of non-lattice vectors onto a single dual codeword. The technical tools we introduce are sorting of the projections followed by single-step pairwise Euclidean reduction of the projections, resulting in monotonic convergence of the positive-valued projections to zero. The length of vectors grows by a geometric factor each iteration. For fixed $P$ and $d$, and large enough user-defined input sets, we show that it is possible to minimize the number of iterations, and thus the overall length expansion factor, to obtain a short lattice vector. Thus we obtain a novel approach for controlling the output length, which resolves an open problem posed by Noah Stephens-Davidowitz (the possibility of an approximation scheme for the shortest-vector problem (SVP) which does not reduce to near-exact SVP). In our approach, one may obtain short vectors even when the lattice dimension is quite large, e.g., 8000. For fixed $P$, the algorithm yields shorter vectors for larger $d$. We additionally present a number of extensions and generalizations of our fundamental co-dimension $1$ method. These include a method for obtaining many different lattice vectors by multiplying the dual codeword by an integer and then modding by $P$; a co-dimension $k$ generalization; a large input set generalization; and finally, a "block" generalization, which involves the replacement of pairwise (Euclidean) reduction by a $k$-party (non-Euclidean) reduction. The $k$-block generalization of our algorithm constitutes a class of polynomial-time algorithms indexed by $k\geq 2$, which yield successively improved approximations for the short vector problem.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Cross-Age and Cross-Site Domain Shift Impacts on Deep Learning-Based White Matter Fiber Estimation in Newborn and Baby Brains
Authors:
Rizhong Lin,
Ali Gholipour,
Jean-Philippe Thiran,
Davood Karimi,
Hamza Kebiri,
Meritxell Bach Cuadra
Abstract:
Deep learning models have shown great promise in estimating tissue microstructure from limited diffusion magnetic resonance imaging data. However, these models face domain shift challenges when test and train data are from different scanners and protocols, or when the models are applied to data with inherent variations such as the develo** brains of infants and children scanned at various ages.…
▽ More
Deep learning models have shown great promise in estimating tissue microstructure from limited diffusion magnetic resonance imaging data. However, these models face domain shift challenges when test and train data are from different scanners and protocols, or when the models are applied to data with inherent variations such as the develo** brains of infants and children scanned at various ages. Several techniques have been proposed to address some of these challenges, such as data harmonization or domain adaptation in the adult brain. However, those techniques remain unexplored for the estimation of fiber orientation distribution functions in the rapidly develo** brains of infants. In this work, we extensively investigate the age effect and domain shift within and across two different cohorts of 201 newborns and 165 babies using the Method of Moments and fine-tuning strategies. Our results show that reduced variations in the microstructural development of babies in comparison to newborns directly impact the deep learning models' cross-age performance. We also demonstrate that a small number of target domain samples can significantly mitigate domain shift problems.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction
Authors:
Da Luo,
Yanglei Gan,
Rui Hou,
Run Lin,
Qiao Liu,
Yuxiang Cai,
Wannian Gao
Abstract:
Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learne…
▽ More
Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.
△ Less
Submitted 11 March, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach
Authors:
Weiyu Ma,
Qirui Mi,
Yongcheng Zeng,
Xue Yan,
Yuqiao Wu,
Runji Lin,
Haifeng Zhang,
Jun Wang
Abstract:
StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro level operations and strategic macro awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voy…
▽ More
StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro level operations and strategic macro awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voyage and MetaGPT, presents the immense potential in solving intricate tasks. Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II, a highly complex RTS game.To conveniently take full advantage of LLMs` reasoning abilities, we first develop textual StratCraft II environment, called TextStarCraft II, which LLM agent can interact. Secondly, we propose a Chain of Summarization method, including single frame summarization for processing raw observations and multi frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions. Our experiment consists of two parts: first, an evaluation by human experts, which includes assessing the LLMs`s mastery of StarCraft II knowledge and the performance of LLM agents in the game; second, the in game performance of LLM agents, encompassing aspects like win rate and the impact of Chain of Summarization.Experiment results demonstrate that: 1. LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; 2. Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for eight years; 3. LLM agents are capable of defeating the built in AI at the Harder(Lv5) difficulty level. We have open sourced the code and released demo videos of LLM agent playing StarCraft II.
△ Less
Submitted 17 June, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Authors:
Megan Kinniment,
Lucas Jun Koba Sato,
Haoxing Du,
Brian Goodrich,
Max Hasin,
Lawrence Chan,
Luke Harold Miles,
Tao R. Lin,
Hjalmar Wijk,
Joel Burget,
Aaron Ho,
Elizabeth Barnes,
Paul Christiano
Abstract:
In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting…
▽ More
In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ARA may be useful for informing measures around security, monitoring, and alignment. Additionally, once a system is capable of ARA, placing bounds on a system's capabilities may become significantly more difficult.
We construct four simple example agents that combine language models with tools that allow them to take actions in the world. We then evaluate these agents on 12 tasks relevant to ARA. We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. Unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ARA. In particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ARA, unless intermediate evaluations are performed during pretraining. Relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ARA.
△ Less
Submitted 4 January, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
A Unifying Tensor View for Lightweight CNNs
Authors:
Jason Chun Lok Li,
Rui Lin,
Jiajun Zhou,
Edmund Yin Mun Lam,
Ngai Wong
Abstract:
Despite the decomposition of convolutional kernels for lightweight CNNs being well studied, existing works that rely on tensor network diagrams or hyperdimensional abstraction lack geometry intuition. This work devises a new perspective by linking a 3D-reshaped kernel tensor to its various slice-wise and rank-1 decompositions, permitting a straightforward connection between various tensor approxim…
▽ More
Despite the decomposition of convolutional kernels for lightweight CNNs being well studied, existing works that rely on tensor network diagrams or hyperdimensional abstraction lack geometry intuition. This work devises a new perspective by linking a 3D-reshaped kernel tensor to its various slice-wise and rank-1 decompositions, permitting a straightforward connection between various tensor approximations and efficient CNN modules. Specifically, it is discovered that a pointwise-depthwise-pointwise (PDP) configuration constitutes a viable construct for lightweight CNNs. Moreover, a novel link to the latest ShiftNet is established, inspiring a first-ever shift layer pruning that achieves nearly 50% compression with < 1% drop in accuracy for ShiftResNet.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
BER Analysis of SCMA-OFDM Systems in the Presence of Carrier Frequency Offset
Authors:
Haibo Liu,
Qu Luo,
Zilong Liu,
Shan Luo,
Pei Xiao,
Rong** Lin
Abstract:
Sparse code multiple access (SCMA) building upon orthogonal frequency division multiplexing (OFDM) is a promising wireless technology for supporting massive connectivity in future machine-type communication networks. However, the sensitivity of OFDM to carrier frequency offset (CFO) poses a major challenge because it leads to orthogonality loss and incurs intercarrier interference (ICI). In this p…
▽ More
Sparse code multiple access (SCMA) building upon orthogonal frequency division multiplexing (OFDM) is a promising wireless technology for supporting massive connectivity in future machine-type communication networks. However, the sensitivity of OFDM to carrier frequency offset (CFO) poses a major challenge because it leads to orthogonality loss and incurs intercarrier interference (ICI). In this paper, we investigate the bit error rate (BER) performance of SCMA-OFDM systems in the presence of CFO over both Gaussian and multipath Rayleigh fading channels. We first model the ICI in SCMA-OFDM as Gaussian variables conditioned on a single channel realization for fading channels. The BER is then evaluated by averaging over all codeword pairs considering the fading statistics. Through simulations, we validate the accuracy of our BER analysis and reveal that there is a significant BER degradation for SCMA-OFDM systems when the normalized CFO exceeds 0.02.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Authors:
Keming Lu,
Hongyi Yuan,
Runji Lin,
Junyang Lin,
Zheng Yuan,
Chang Zhou,
**gren Zhou
Abstract:
The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complemen…
▽ More
The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate it by mining latent expertise with off-the-shelf reward models. We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. Zooter shows computation efficiency in inference as it introduces only a minor computation overhead of a routing function compared with reward model ranking methods. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks. Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Lite it fly: An All-Deformable-Butterfly Network
Authors:
Rui Lin,
Jason Chun Lok Li,
Jiajun Zhou,
Binxiao Huang,
Jie Ran,
Ngai Wong
Abstract:
Most deep neural networks (DNNs) consist fundamentally of convolutional and/or fully connected layers, wherein the linear transform can be cast as the product between a filter matrix and a data matrix obtained by arranging feature tensors into columns. The lately proposed deformable butterfly (DeBut) decomposes the filter matrix into generalized, butterflylike factors, thus achieving network compr…
▽ More
Most deep neural networks (DNNs) consist fundamentally of convolutional and/or fully connected layers, wherein the linear transform can be cast as the product between a filter matrix and a data matrix obtained by arranging feature tensors into columns. The lately proposed deformable butterfly (DeBut) decomposes the filter matrix into generalized, butterflylike factors, thus achieving network compression orthogonal to the traditional ways of pruning or low-rank decomposition. This work reveals an intimate link between DeBut and a systematic hierarchy of depthwise and pointwise convolutions, which explains the empirically good performance of DeBut layers. By develo** an automated DeBut chain generator, we show for the first time the viability of homogenizing a DNN into all DeBut layers, thus achieving an extreme sparsity and compression. Various examples and hardware benchmarks verify the advantages of All-DeBut networks. In particular, we show it is possible to compress a PointNet to < 5% parameters with < 5% accuracy drop, a record not achievable by other compression schemes.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Learning Lens Blur Fields
Authors:
Esther Y. H. Lin,
Zhecheng Wang,
Rebecca Lin,
Daniel Miau,
Florian Kainz,
Jiawen Chen,
Xuaner Cecilia Zhang,
David B. Lindell,
Kiriakos N. Kutulakos
Abstract:
Optical blur is an inherent property of any lens system and is challenging to model in modern cameras because of their complex optical elements. To tackle this challenge, we introduce a high-dimensional neural representation of blur$-$$\textit{the lens blur field}$$-$and a practical method for acquiring it. The lens blur field is a multilayer perceptron (MLP) designed to (1) accurately capture var…
▽ More
Optical blur is an inherent property of any lens system and is challenging to model in modern cameras because of their complex optical elements. To tackle this challenge, we introduce a high-dimensional neural representation of blur$-$$\textit{the lens blur field}$$-$and a practical method for acquiring it. The lens blur field is a multilayer perceptron (MLP) designed to (1) accurately capture variations of the lens 2D point spread function over image plane location, focus setting and, optionally, depth and (2) represent these variations parametrically as a single, sensor-specific function. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses. To learn the real-world blur field of a given device, we formulate a generalized non-blind deconvolution problem that directly optimizes the MLP weights using a small set of focal stacks as the only input. We also provide a first-of-its-kind dataset of 5D blur fields$-$for smartphone cameras, camera bodies equipped with a variety of lenses, etc. Lastly, we show that acquired 5D blur fields are expressive and accurate enough to reveal, for the first time, differences in optical behavior of smartphone devices of the same make and model.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Authors:
Open X-Embodiment Collaboration,
Abby O'Neill,
Abdul Rehman,
Abhinav Gupta,
Abhiram Maddukuri,
Abhishek Gupta,
Abhishek Padalkar,
Abraham Lee,
Acorn Pooley,
Agrim Gupta,
Ajay Mandlekar,
A**kya Jain,
Albert Tung,
Alex Bewley,
Alex Herzog,
Alex Irpan,
Alexander Khazatsky,
Anant Rai,
Anchit Gupta,
Andrew Wang,
Andrey Kolobov,
Anikait Singh,
Animesh Garg,
Aniruddha Kembhavi,
Annie Xie
, et al. (267 additional authors not shown)
Abstract:
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method…
▽ More
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
△ Less
Submitted 1 June, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
On the Over-Memorization During Natural, Robust and Catastrophic Overfitting
Authors:
Runqi Lin,
Chaojian Yu,
Bo Han,
Tongliang Liu
Abstract:
Overfitting negatively impacts the generalization ability of deep neural networks (DNNs) in both natural and adversarial training. Existing methods struggle to consistently address different types of overfitting, typically designing strategies that focus separately on either natural or adversarial patterns. In this work, we adopt a unified perspective by solely focusing on natural patterns to expl…
▽ More
Overfitting negatively impacts the generalization ability of deep neural networks (DNNs) in both natural and adversarial training. Existing methods struggle to consistently address different types of overfitting, typically designing strategies that focus separately on either natural or adversarial patterns. In this work, we adopt a unified perspective by solely focusing on natural patterns to explore different types of overfitting. Specifically, we examine the memorization effect in DNNs and reveal a shared behaviour termed over-memorization, which impairs their generalization capacity. This behaviour manifests as DNNs suddenly becoming high-confidence in predicting certain training patterns and retaining a persistent memory for them. Furthermore, when DNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit high-confidence prediction for the corresponding natural pattern. These findings motivate us to holistically mitigate different types of overfitting by hindering the DNNs from over-memorization training patterns. To this end, we propose a general framework, Distraction Over-Memorization (DOM), which explicitly prevents over-memorization by either removing or augmenting the high-confidence natural patterns. Extensive experiments demonstrate the effectiveness of our proposed method in mitigating overfitting across various training paradigms.
△ Less
Submitted 11 April, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
TensorMD: Scalable Tensor-Diagram based Machine Learning Interatomic Potential on Heterogeneous Many-Core Processors
Authors:
Xin Chen,
Yucheng Ouyang,
Xin Chen,
Zhenchuan Chen,
Rongfen Lin,
Xingyu Gao,
Lifang Wang,
Fang Li,
Yin Liu,
Honghui Shang,
Haifeng Song
Abstract:
Molecular dynamics simulations have emerged as a potent tool for investigating the physical properties and kinetic behaviors of materials at the atomic scale, particularly in extreme conditions. Ab initio accuracy is now achievable with machine learning based interatomic potentials. With recent advancements in high-performance computing, highly accurate and large-scale simulations become feasible.…
▽ More
Molecular dynamics simulations have emerged as a potent tool for investigating the physical properties and kinetic behaviors of materials at the atomic scale, particularly in extreme conditions. Ab initio accuracy is now achievable with machine learning based interatomic potentials. With recent advancements in high-performance computing, highly accurate and large-scale simulations become feasible. This study introduces TensorMD, a new machine learning interatomic potential (MLIP) model that integrates physical principles and tensor diagrams. The tensor formalism provides a more efficient computation and greater flexibility for use with other scientific codes. Additionally, we proposed several portable optimization strategies and developed a highly optimized version for the new Sunway supercomputer. Our optimized TensorMD can achieve unprecedented performance on the new Sunway, enabling simulations of up to 52 billion atoms with a time-to-solution of 31 ps/step/atom, setting new records for HPC + AI + MD.
△ Less
Submitted 12 October, 2023; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Qwen Technical Report
Authors:
**ze Bai,
Shuai Bai,
Yunfei Chu,
Zeyu Cui,
Kai Dang,
Xiaodong Deng,
Yang Fan,
Wenbin Ge,
Yu Han,
Fei Huang,
Binyuan Hui,
Luo Ji,
Mei Li,
Junyang Lin,
Runji Lin,
Dayiheng Liu,
Gao Liu,
Chengqiang Lu,
Keming Lu,
Jianxin Ma,
Rui Men,
Xingzhang Ren,
Xuancheng Ren,
Chuanqi Tan,
Sinan Tan
, et al. (23 additional authors not shown)
Abstract:
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Q…
▽ More
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Cluster-based Method for Eavesdrop** Identification and Localization in Optical Links
Authors:
Haokun Song,
Rui Lin,
Andrea Sgambelluri,
Filippo Cugini,
Yajie Li,
Jie Zhang,
Paolo Monti
Abstract:
We propose a cluster-based method to detect and locate eavesdrop** events in optical line systems characterized by small power losses. Our findings indicate that detecting such subtle losses from eavesdrop** can be accomplished solely through optical performance monitoring (OPM) data collected at the receiver. On the other hand, the localization of such events can be effectively achieved by le…
▽ More
We propose a cluster-based method to detect and locate eavesdrop** events in optical line systems characterized by small power losses. Our findings indicate that detecting such subtle losses from eavesdrop** can be accomplished solely through optical performance monitoring (OPM) data collected at the receiver. On the other hand, the localization of such events can be effectively achieved by leveraging in-line OPM data.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Graph Threading
Authors:
Erik D. Demaine,
Yael Kirkpatrick,
Rebecca Lin
Abstract:
Inspired by artistic practices such as beadwork and himmeli, we study the problem of threading a single string through a set of tubes, so that pulling the string forms a desired graph. More precisely, given a connected graph (where edges represent tubes and vertices represent junctions where they meet), we give a polynomial-time algorithm to find a minimum-length closed walk (representing a thread…
▽ More
Inspired by artistic practices such as beadwork and himmeli, we study the problem of threading a single string through a set of tubes, so that pulling the string forms a desired graph. More precisely, given a connected graph (where edges represent tubes and vertices represent junctions where they meet), we give a polynomial-time algorithm to find a minimum-length closed walk (representing a threading of string) that induces a connected graph of string at every junction. The algorithm is based on a surprising reduction to minimum-weight perfect matching. Along the way, we give tight worst-case bounds on the length of the optimal threading and on the maximum number of times this threading can visit a single edge. We also give more efficient solutions to two special cases: cubic graphs and the case when each edge can be visited at most twice.
△ Less
Submitted 28 May, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains
Authors:
Peiyang Li,
Ye Wang,
Qi Li,
Zhuotao Liu,
Ke Xu,
Ju Ren,
Zhiying Liu,
Ruilin Lin
Abstract:
Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for…
▽ More
Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for model training and deployment. In this paper, we propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains in an organization with limited training data. Specifically, it utilizes meta-learning to share knowledge across these domains, e.g., the relationship between HTTP requests in heterogeneous domains, to efficiently train detection models. Moreover, we develop an adaptive preprocessing module to facilitate semantic analysis of Web requests across different domains and design a multi-domain representation method to capture semantic correlations between different domains for cross-domain model training. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. The experimental results demonstrate that RETSINA outperforms the existing unsupervised Web attack detection methods with limited training data, e.g., RETSINA needs only 5-minute training data to achieve comparable detection performance to the existing methods that train separate models for different domains using 1-day training data. We also conduct real-world deployment in an Internet company. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models
Authors:
Keming Lu,
Hongyi Yuan,
Zheng Yuan,
Runji Lin,
Junyang Lin,
Chuanqi Tan,
Chang Zhou,
**gren Zhou
Abstract:
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions an…
▽ More
Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.
△ Less
Submitted 15 August, 2023; v1 submitted 14 August, 2023;
originally announced August 2023.
-
Surrogate Empowered Sim2Real Transfer of Deep Reinforcement Learning for ORC Superheat Control
Authors:
Runze Lin,
Yangyang Luo,
Xialai Wu,
Junghui Chen,
Biao Huang,
Lei Xie,
Hongye Su
Abstract:
The Organic Rankine Cycle (ORC) is widely used in industrial waste heat recovery due to its simple structure and easy maintenance. However, in the context of smart manufacturing in the process industry, traditional model-based optimization control methods are unable to adapt to the varying operating conditions of the ORC system or sudden changes in operating modes. Deep reinforcement learning (DRL…
▽ More
The Organic Rankine Cycle (ORC) is widely used in industrial waste heat recovery due to its simple structure and easy maintenance. However, in the context of smart manufacturing in the process industry, traditional model-based optimization control methods are unable to adapt to the varying operating conditions of the ORC system or sudden changes in operating modes. Deep reinforcement learning (DRL) has significant advantages in situations with uncertainty as it directly achieves control objectives by interacting with the environment without requiring an explicit model of the controlled plant. Nevertheless, direct application of DRL to physical ORC systems presents unacceptable safety risks, and its generalization performance under model-plant mismatch is insufficient to support ORC control requirements. Therefore, this paper proposes a Sim2Real transfer learning-based DRL control method for ORC superheat control, which aims to provide a new simple, feasible, and user-friendly solution for energy system optimization control. Experimental results show that the proposed method greatly improves the training speed of DRL in ORC control problems and solves the generalization performance issue of the agent under multiple operating conditions through Sim2Real transfer.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
A Message Passing Detection based Affine Frequency Division Multiplexing Communication System
Authors:
Lifan Wu,
Shan Luo,
Dongxiao Song,
Fan Yang,
Rong** Lin
Abstract:
The next generation of wireless communication technology is anticipated to address the communication reliability challenges encountered in high-speed mobile communication scenarios. An Orthogonal Time Frequency Space (OTFS) system has been introduced as a solution that effectively mitigates these issues. However, OTFS is associated with relatively high pilot overhead and multiuser multiplexing ove…
▽ More
The next generation of wireless communication technology is anticipated to address the communication reliability challenges encountered in high-speed mobile communication scenarios. An Orthogonal Time Frequency Space (OTFS) system has been introduced as a solution that effectively mitigates these issues. However, OTFS is associated with relatively high pilot overhead and multiuser multiplexing overhead. In response to these concerns within the OTFS framework, a novel modulation technology known as Affine Frequency Division Multiplexing (AFDM) which is based on the discrete affine Fourier transform has emerged. AFDM effectively resolves the challenges by achieving full diversity through parameter adjustments aligned with the channel's delay-Doppler profile. Consequently, AFDM is capable of achieving performance levels comparable to OTFS. As the research on AFDM detection is currently limited, we present a low-complexity yet efficient message passing (MP) algorithm. This algorithm handles joint interference cancellation and detection while capitalizing on the inherent sparsity of the channel. Based on simulation results, the MP detection algorithm outperforms Minimum Mean Square Error (MMSE) and Maximal Ratio Combining (MRC) detection techniques.
△ Less
Submitted 30 August, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
A Spectral Perspective towards Understanding and Improving Adversarial Robustness
Authors:
Binxiao Huang,
Rui Lin,
Chaofan Tao,
Ngai Wong
Abstract:
Deep neural networks (DNNs) are incredibly vulnerable to crafted, imperceptible adversarial perturbations. While adversarial training (AT) has proven to be an effective defense approach, the AT mechanism for robustness improvement is not fully understood. This work investigates AT from a spectral perspective, adding new insights to the design of effective defenses. In particular, we show that AT i…
▽ More
Deep neural networks (DNNs) are incredibly vulnerable to crafted, imperceptible adversarial perturbations. While adversarial training (AT) has proven to be an effective defense approach, the AT mechanism for robustness improvement is not fully understood. This work investigates AT from a spectral perspective, adding new insights to the design of effective defenses. In particular, we show that AT induces the deep model to focus more on the low-frequency region, which retains the shape-biased representations, to gain robustness. Further, we find that the spectrum of a white-box attack is primarily distributed in regions the model focuses on, and the perturbation attacks the spectral bands where the model is vulnerable. Based on this observation, to train a model tolerant to frequency-varying perturbation, we propose a spectral alignment regularization (SAR) such that the spectral output inferred by an attacked adversarial input stays as close as possible to its natural input counterpart. Experiments demonstrate that SAR and its weight averaging (WA) extension could significantly improve the robust accuracy by 1.14% ~ 3.87% relative to the standard AT, across multiple datasets (CIFAR-10, CIFAR-100 and Tiny ImageNet), and various attacks (PGD, C&W and Autoattack), without any extra data.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
Kernel Support Vector Machine Classifiers with the $\ell_0$-Norm Hinge Loss
Authors:
Rongrong Lin,
Yingjia Yao,
Yulan Liu
Abstract:
Support Vector Machine (SVM) has been one of the most successful machine learning techniques for binary classification problems. The key idea is to maximize the margin from the data to the hyperplane subject to correct classification on training samples. The commonly used hinge loss and its variations are sensitive to label noise, and unstable for resampling due to its unboundedness. This paper is…
▽ More
Support Vector Machine (SVM) has been one of the most successful machine learning techniques for binary classification problems. The key idea is to maximize the margin from the data to the hyperplane subject to correct classification on training samples. The commonly used hinge loss and its variations are sensitive to label noise, and unstable for resampling due to its unboundedness. This paper is concentrated on the kernel SVM with the $\ell_0$-norm hinge loss (referred as $\ell_0$-KSVM), which is a composite function of hinge loss and $\ell_0$-norm and then could overcome the difficulties mentioned above. In consideration of the nonconvexity and nonsmoothness of $\ell_0$-norm hinge loss, we first characterize the limiting subdifferential of the $\ell_0$-norm hinge loss and then derive the equivalent relationship among the proximal stationary point, the Karush-Kuhn-Tucker point, and the local optimal solution of $\ell_0$-KSVM. Secondly, we develop an ADMM algorithm for $\ell_0$-KSVM, and obtain that any limit point of the sequence generated by the proposed algorithm is a locally optimal solution. Lastly, some experiments on the synthetic and real datasets are illuminated to show that $\ell_0$-KSVM can achieve comparable accuracy compared with the standard KSVM while the former generally enjoys fewer support vectors.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
Large Sequence Models for Sequential Decision-Making: A Survey
Authors:
Muning Wen,
Runji Lin,
Han**g Wang,
Yaodong Yang,
Ying Wen,
Luo Mai,
Jun Wang,
Haifeng Zhang,
Weinan Zhang
Abstract:
Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are ty…
▽ More
Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are typically beset by long-standing issues involving sample efficiency, credit assignment, and partial observability. In recent years, sequence models, especially the Transformer, have attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability. This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer, by discussing the connection between sequential decision-making and sequence modeling, and categorizing them based on the way they utilize the Transformer. Moreover, this paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making, encompassing theoretical foundations, network architectures, algorithms, and efficient training systems. As this article has been accepted by the Frontiers of Computer Science, here is an early version, and the most up-to-date version can be found at https://journal.hep.com.cn/fcs/EN/10.1007/s11704-023-2689-5
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Authors:
Hejie Cui,
Rongmei Lin,
Nasser Zalmout,
Chenwei Zhang,
**gbo Shang,
Carl Yang,
Xian Li
Abstract:
Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-establi…
▽ More
Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Prediction of solar wind speed by applying convolutional neural network to potential field source surface (PFSS) magnetograms
Authors:
Rong Lin,
Zhekai Luo,
Jiansen He,
Lun Xie,
Chuanpeng Hou,
Shuwei Chen
Abstract:
An accurate solar wind speed model is important for space weather predictions, catastrophic event warnings, and other issues concerning solar wind - magnetosphere interaction. In this work, we construct a model based on convolutional neural network (CNN) and Potential Field Source Surface (PFSS) magnetograms, considering a solar wind source surface of $R_{\rm SS}=2.5R_\odot$, aiming to predict the…
▽ More
An accurate solar wind speed model is important for space weather predictions, catastrophic event warnings, and other issues concerning solar wind - magnetosphere interaction. In this work, we construct a model based on convolutional neural network (CNN) and Potential Field Source Surface (PFSS) magnetograms, considering a solar wind source surface of $R_{\rm SS}=2.5R_\odot$, aiming to predict the solar wind speed at the Lagrange 1 (L1) point of the Sun-Earth system. The input of our model consists of four Potential Field Source Surface (PFSS) magnetograms at $R_{\rm SS}$, which are 7, 6, 5, and 4 days before the target epoch. Reduced magnetograms are used to promote the model's efficiency. We use the Global Oscillation Network Group (GONG) photospheric magnetograms and the potential field extrapolation model to generate PFSS magnetograms at the source surface. The model provides predictions of the continuous test dataset with an averaged correlation coefficient (CC) of 0.52 and a root mean square error (RMSE) of 80.8 km/s in an eight-fold validation training scheme with the time resolution of the data as small as one hour. The model also has the potential to forecast high speed streams of the solar wind, which can be quantified with a general threat score of 0.39.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Bagging by Learning to Singulate Layers Using Interactive Perception
Authors:
Lawrence Yunliang Chen,
Baiyu Shi,
Roy Lin,
Daniel Seita,
Ayah Ahmad,
Richard Cheng,
Thomas Kollar,
David Held,
Ken Goldberg
Abstract:
Many fabric handling and 2D deformable material tasks in homes and industry require singulating layers of material such as opening a bag or arranging garments for sewing. In contrast to methods requiring specialized sensing or end effectors, we use only visual observations with ordinary parallel jaw grippers. We propose SLIP: Singulating Layers using Interactive Perception, and apply SLIP to the t…
▽ More
Many fabric handling and 2D deformable material tasks in homes and industry require singulating layers of material such as opening a bag or arranging garments for sewing. In contrast to methods requiring specialized sensing or end effectors, we use only visual observations with ordinary parallel jaw grippers. We propose SLIP: Singulating Layers using Interactive Perception, and apply SLIP to the task of autonomous bagging. We develop SLIP-Bagging, a bagging algorithm that manipulates a plastic or fabric bag from an unstructured state, and uses SLIP to grasp the top layer of the bag to open it for object insertion. In physical experiments, a YuMi robot achieves a success rate of 67% to 81% across bags of a variety of materials, shapes, and sizes, significantly improving in success rate and generality over prior work. Experiments also suggest that SLIP can be applied to tasks such as singulating layers of folded cloth and garments. Supplementary material is available at https://sites.google.com/view/slip-bagging/.
△ Less
Submitted 1 September, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation
Authors:
Xiaoye Qian,
Youbao Tang,
Ning Zhang,
Mei Han,
**g Xiao,
Ming-Chun Huang,
Ruei-Sung Lin
Abstract:
Transformer-based approaches have been successfully proposed for 3D human pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art (SOTA) performance. However, current SOTAs have difficulties in modeling spatial-temporal correlations of joints at different levels simultaneously. This is due to the poses' spatial-temporal complexity. Poses move at various speeds temporarily with va…
▽ More
Transformer-based approaches have been successfully proposed for 3D human pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art (SOTA) performance. However, current SOTAs have difficulties in modeling spatial-temporal correlations of joints at different levels simultaneously. This is due to the poses' spatial-temporal complexity. Poses move at various speeds temporarily with various joints and body-parts movement spatially. Hence, a cookie-cutter transformer is non-adaptable and can hardly meet the "in-the-wild" requirement. To mitigate this issue, we propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D HPE. HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion. Extensive experiments on three datasets (i.e., Human3.6M, MPI-INF-3DHP, and HumanEva) demonstrate that HSTFormer achieves competitive and consistent performance on benchmarks with various scales and difficulties. Specifically, it surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach. The code is available at: https://github.com/qianxiaoye825/HSTFormer.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Freeform Islamic Geometric Patterns
Authors:
Rebecca Lin,
Craig S. Kaplan
Abstract:
Islamic geometric patterns are a rich and venerable ornamental tradition. Many classic designs feature periodic arrangements of rosettes: star shapes surrounded by rings of hexagonal petals. We present a new technique for generating 'freeform' compositions of rosettes: finite designs that freely mix rosettes of unusual sizes while retaining the aesthetics of traditional patterns. We use a circle p…
▽ More
Islamic geometric patterns are a rich and venerable ornamental tradition. Many classic designs feature periodic arrangements of rosettes: star shapes surrounded by rings of hexagonal petals. We present a new technique for generating 'freeform' compositions of rosettes: finite designs that freely mix rosettes of unusual sizes while retaining the aesthetics of traditional patterns. We use a circle packing as a scaffolding for develo** a patch of polygons and fill each polygon with a motif based on established constructions from Islamic art.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Frequency Regularization for Improving Adversarial Robustness
Authors:
Binxiao Huang,
Chaofan Tao,
Rui Lin,
Ngai Wong
Abstract:
Deep neural networks are incredibly vulnerable to crafted, human-imperceptible adversarial perturbations. Although adversarial training (AT) has proven to be an effective defense approach, we find that the AT-trained models heavily rely on the input low-frequency content for judgment, accounting for the low standard accuracy. To close the large gap between the standard and robust accuracies during…
▽ More
Deep neural networks are incredibly vulnerable to crafted, human-imperceptible adversarial perturbations. Although adversarial training (AT) has proven to be an effective defense approach, we find that the AT-trained models heavily rely on the input low-frequency content for judgment, accounting for the low standard accuracy. To close the large gap between the standard and robust accuracies during AT, we investigate the frequency difference between clean and adversarial inputs, and propose a frequency regularization (FR) to align the output difference in the spectral domain. Besides, we find Stochastic Weight Averaging (SWA), by smoothing the kernels over epochs, further improves the robustness. Among various defense schemes, our method achieves the strongest robustness against attacks by PGD-20, C\&W and Autoattack, on a WideResNet trained on CIFAR-10 without any extra data.
△ Less
Submitted 24 December, 2022;
originally announced December 2022.
-
CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion
Authors:
**yuan Liu,
Runjia Lin,
Guanyao Wu,
Risheng Liu,
Zhongxuan Luo,
Xin Fan
Abstract:
Infrared and visible image fusion targets to provide an informative image by combining complementary information from different sensors. Existing learning-based fusion approaches attempt to construct various loss functions to preserve complementary features, while neglecting to discover the inter-relationship between the two modalities, leading to redundant or even invalid information on the fusio…
▽ More
Infrared and visible image fusion targets to provide an informative image by combining complementary information from different sensors. Existing learning-based fusion approaches attempt to construct various loss functions to preserve complementary features, while neglecting to discover the inter-relationship between the two modalities, leading to redundant or even invalid information on the fusion results. Moreover, most methods focus on strengthening the network with an increase in depth while neglecting the importance of feature transmission, causing vital information degeneration. To alleviate these issues, we propose a coupled contrastive learning network, dubbed CoCoNet, to realize infrared and visible image fusion in an end-to-end manner. Concretely, to simultaneously retain typical features from both modalities and to avoid artifacts emerging on the fused result, we develop a coupled contrastive constraint in our loss function. In a fused image, its foreground target / background detail part is pulled close to the infrared / visible source and pushed far away from the visible / infrared source in the representation space. We further exploit image characteristics to provide data-sensitive weights, allowing our loss function to build a more reliable relationship with source images. A multi-level attention module is established to learn rich hierarchical feature representation and to comprehensively transfer features in the fusion process. We also apply the proposed CoCoNet on medical image fusion of different types, e.g., magnetic resonance image, positron emission tomography image, and single photon emission computed tomography image. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance under both subjective and objective evaluation, especially in preserving prominent targets and recovering vital textural details.
△ Less
Submitted 14 October, 2023; v1 submitted 20 November, 2022;
originally announced November 2022.
-
Contextual Transformer for Offline Meta Reinforcement Learning
Authors:
Runji Lin,
Ye Li,
Xidong Feng,
Zhaowei Zhang,
Xian Hong Wu Fung,
Haifeng Zhang,
Jun Wang,
Yali Du,
Yaodong Yang
Abstract:
The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks…
▽ More
The pretrain-finetuning paradigm in large-scale sequence models has made significant progress in natural language processing and computer vision tasks. However, such a paradigm is still hindered by several challenges in Reinforcement Learning (RL), including the lack of self-supervised pretraining algorithms based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can improve sequence modeling-based offline reinforcement learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional policy generation. As such, we can pretrain a model on the offline dataset with self-supervised loss and learn a prompt to guide the policy towards desired actions. Secondly, we extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. Superior results validate the strong performance, and generality of our methods.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Prior-enhanced Temporal Action Localization using Subject-aware Spatial Attention
Authors:
Yifan Liu,
Youbao Tang,
Ning Zhang,
Ruei-Sung Lin,
Haoqian Wang
Abstract:
Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video. Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention. This limits their sensitivity to localize action boundaries. To this end, we propose a prior-enhanced temporal action localization method (PETAL)…
▽ More
Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video. Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention. This limits their sensitivity to localize action boundaries. To this end, we propose a prior-enhanced temporal action localization method (PETAL), which only takes in RGB input and incorporates action subjects as priors. This proposal leverages action subjects' information with a plug-and-play subject-aware spatial attention module (SA-SAM) to generate an aggregated and subject-prioritized representation. Experimental results on THUMOS-14 and ActivityNet-1.3 datasets demonstrate that the proposed PETAL achieves competitive performance using only RGB features, e.g., boosting mAP by 2.41% or 0.25% over the state-of-the-art approach that uses RGB features or with additional optical flow features on the THUMOS-14 dataset.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Recovering Sign Bits of DCT Coefficients in Digital Images as an Optimization Problem
Authors:
Ruiyuan Lin,
Sheng Liu,
Jun Jiang,
Shujun Li,
Chengqing Li,
C. -C. Jay Kuo
Abstract:
Recovering unknown, missing, damaged, distorted, or lost information in DCT coefficients is a common task in multiple applications of digital image processing, including image compression, selective image encryption, and image communication. This paper investigates the recovery of sign bits in DCT coefficients of digital images, by proposing two different approximation methods to solve a mixed int…
▽ More
Recovering unknown, missing, damaged, distorted, or lost information in DCT coefficients is a common task in multiple applications of digital image processing, including image compression, selective image encryption, and image communication. This paper investigates the recovery of sign bits in DCT coefficients of digital images, by proposing two different approximation methods to solve a mixed integer linear programming (MILP) problem, which is NP-hard in general. One method is a relaxation of the MILP problem to a linear programming (LP) problem, and the other splits the original MILP problem into some smaller MILP problems and an LP problem. We considered how the proposed methods can be applied to JPEG-encoded images and conducted extensive experiments to validate their performances. The experimental results showed that the proposed methods outperformed other existing methods by a substantial margin, both according to objective quality metrics and our subjective evaluation.
△ Less
Submitted 8 January, 2024; v1 submitted 2 November, 2022;
originally announced November 2022.