-
Diffusion Actor-Critic with Entropy Regulator
Authors:
Yinuo Wang,
Likun Wang,
Yuxuan Jiang,
Wenjun Zou,
Tong Liu,
Xujie Song,
Wenxuan Wang,
Liming Xiao,
Jiang Wu,
**gliang Duan,
Shengbo Eben Li
Abstract:
Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diff…
▽ More
Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $α$ that modulates the degree of exploration and exploitation. Parameter $α$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.
△ Less
Submitted 15 June, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Why Not Transform Chat Large Language Models to Non-English?
Authors:
Xiang Geng,
Ming Zhu,
Jiahuan Li,
Zhejian Lai,
Wei Zou,
Shuaijie She,
Jiaxin Guo,
Xiaofeng Zhao,
Yinglu Li,
Yuang Li,
Chang Su,
Yanqing Zhao,
Xinglin Lyu,
Min Zhang,
Jiajun Chen,
Hao Yang,
Shujian Huang
Abstract:
The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized fo…
▽ More
The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized for advanced abilities, e.g. multi-turn conversation and human preference alignment, and thus more powerful in both helpfulness and safety. However, transforming a chat LLM involves two critical issues: (1) How can we effectively transfer advanced abilities without their supervised data? (2) How can we prevent the original knowledge from catastrophic forgetting during transformation? We target these issues by introducing a simple framework called TransLLM. For the first issue, TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought, which uses the translation as the bridge between English and non-English step-by-step. We further enhance the performance of sub-tasks with publicly available data. For the second issue, we propose a method comprising two synergistic components: low-rank adaptation for training to maintain the original LLM parameters, and recovery KD, which utilizes data generated by the chat LLM itself to recover the original knowledge from the frozen parameters. In the experiments, we transform the LLaMA-2-chat-7B to the Thai language. Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench. Furthermore, our method, without safety data, rejects more harmful queries of safety benchmark AdvBench than both ChatGPT and GPT-4.
△ Less
Submitted 31 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Multi-Level Feature Fusion Network for Lightweight Stereo Image Super-Resolution
Authors:
Yunxiang Li,
Wenbin Zou,
Qiaomu Wei,
Feng Huang,
**g Wu
Abstract:
Stereo image super-resolution utilizes the cross-view complementary information brought by the disparity effect of left and right perspective images to reconstruct higher-quality images. Cascading feature extraction modules and cross-view feature interaction modules to make use of the information from stereo images is the focus of numerous methods. However, this adds a great deal of network parame…
▽ More
Stereo image super-resolution utilizes the cross-view complementary information brought by the disparity effect of left and right perspective images to reconstruct higher-quality images. Cascading feature extraction modules and cross-view feature interaction modules to make use of the information from stereo images is the focus of numerous methods. However, this adds a great deal of network parameters and structural redundancy. To facilitate the application of stereo image super-resolution in downstream tasks, we propose an efficient Multi-Level Feature Fusion Network for Lightweight Stereo Image Super-Resolution (MFFSSR). Specifically, MFFSSR utilizes the Hybrid Attention Feature Extraction Block (HAFEB) to extract multi-level intra-view features. Using the channel separation strategy, HAFEB can efficiently interact with the embedded cross-view interaction module. This structural configuration can efficiently mine features inside the view while improving the efficiency of cross-view information sharing. Hence, reconstruct image details and textures more accurately. Abundant experiments demonstrate the effectiveness of MFFSSR. We achieve superior performance with fewer parameters. The source code is available at https://github.com/KarosLYX/MFFSSR.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results
Authors:
Xiaoning Liu,
Zongwei Wu,
Ao Li,
Florin-Alexandru Vasluianu,
Yulun Zhang,
Shuhang Gu,
Le Zhang,
Ce Zhu,
Radu Timofte,
Zhi **,
Hongjun Wu,
Chenxi Wang,
Haitao Ling,
Yuanhao Cai,
Hao Bian,
Yuxin Zheng,
**g Lin,
Alan Yuille,
Ben Shao,
** Guo,
Tianli Liu,
Mohao Wu,
Yixu Feng,
Shuo Hou,
Haotian Lin
, et al. (87 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlig…
▽ More
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Enforcing Paraphrase Generation via Controllable Latent Diffusion
Authors:
Wei Zou,
Ziyuan Zhuang,
Shujian Huang,
Jia Liu,
Jiajun Chen
Abstract:
Paraphrase generation aims to produce high-quality and diverse utterances of a given text. Though state-of-the-art generation via the diffusion model reconciles generation quality and diversity, textual diffusion suffers from a truncation issue that hinders efficiency and quality control. In this work, we propose \textit{L}atent \textit{D}iffusion \textit{P}araphraser~(LDP), a novel paraphrase gen…
▽ More
Paraphrase generation aims to produce high-quality and diverse utterances of a given text. Though state-of-the-art generation via the diffusion model reconciles generation quality and diversity, textual diffusion suffers from a truncation issue that hinders efficiency and quality control. In this work, we propose \textit{L}atent \textit{D}iffusion \textit{P}araphraser~(LDP), a novel paraphrase generation by modeling a controllable diffusion process given a learned latent space. LDP achieves superior generation efficiency compared to its diffusion counterparts. It facilitates only input segments to enforce paraphrase semantics, which further improves the results without external features. Experiments show that LDP achieves improved and diverse paraphrase generation compared to baselines. Further analysis shows that our method is also helpful to other similar text generations and domain adaptations. Our code and data are available at https://github.com/NIL-zhuang/ld4pg.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
FCert: Certifiably Robust Few-Shot Classification in the Era of Foundation Models
Authors:
Yanting Wang,
Wei Zou,
**yuan Jia
Abstract:
Few-shot classification with foundation models (e.g., CLIP, DINOv2, PaLM-2) enables users to build an accurate classifier with a few labeled training samples (called support samples) for a classification task. However, an attacker could perform data poisoning attacks by manipulating some support samples such that the classifier makes the attacker-desired, arbitrary prediction for a testing input.…
▽ More
Few-shot classification with foundation models (e.g., CLIP, DINOv2, PaLM-2) enables users to build an accurate classifier with a few labeled training samples (called support samples) for a classification task. However, an attacker could perform data poisoning attacks by manipulating some support samples such that the classifier makes the attacker-desired, arbitrary prediction for a testing input. Empirical defenses cannot provide formal robustness guarantees, leading to a cat-and-mouse game between the attacker and defender. Existing certified defenses are designed for traditional supervised learning, resulting in sub-optimal performance when extended to few-shot classification. In our work, we propose FCert, the first certified defense against data poisoning attacks to few-shot classification. We show our FCert provably predicts the same label for a testing input under arbitrary data poisoning attacks when the total number of poisoned support samples is bounded. We perform extensive experiments on benchmark few-shot classification datasets with foundation models released by OpenAI, Meta, and Google in both vision and text domains. Our experimental results show our FCert: 1) maintains classification accuracy without attacks, 2) outperforms existing state-of-the-art certified defenses for data poisoning attacks, and 3) is efficient and general.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models
Authors:
Yanting Wang,
Hongye Fu,
Wei Zou,
**yuan Jia
Abstract:
Different from a unimodal model whose input is from a single modality, the input (called multi-modal input) of a multi-modal model is from multiple modalities such as image, 3D points, audio, text, etc. Similar to unimodal models, many existing studies show that a multi-modal model is also vulnerable to adversarial perturbation, where an attacker could add small perturbation to all modalities of a…
▽ More
Different from a unimodal model whose input is from a single modality, the input (called multi-modal input) of a multi-modal model is from multiple modalities such as image, 3D points, audio, text, etc. Similar to unimodal models, many existing studies show that a multi-modal model is also vulnerable to adversarial perturbation, where an attacker could add small perturbation to all modalities of a multi-modal input such that the multi-modal model makes incorrect predictions for it. Existing certified defenses are mostly designed for unimodal models, which achieve sub-optimal certified robustness guarantees when extended to multi-modal models as shown in our experimental results. In our work, we propose MMCert, the first certified defense against adversarial attacks to a multi-modal model. We derive a lower bound on the performance of our MMCert under arbitrary adversarial attacks with bounded perturbations to both modalities (e.g., in the context of auto-driving, we bound the number of changed pixels in both RGB image and depth image). We evaluate our MMCert using two benchmark datasets: one for the multi-modal road segmentation task and the other for the multi-modal emotion recognition task. Moreover, we compare our MMCert with a state-of-the-art certified defense extended from unimodal models. Our experimental results show that our MMCert outperforms the baseline.
△ Less
Submitted 1 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
Manifold Regularization Classification Model Based On Improved Diffusion Map
Authors:
Hongfu Guo,
Wencheng Zou,
Zeyu Zhang,
Shuishan Zhang,
Ruitong Wang,
**tao Zhang
Abstract:
Manifold regularization model is a semi-supervised learning model that leverages the geometric structure of a dataset, comprising a small number of labeled samples and a large number of unlabeled samples, to generate classifiers. However, the original manifold norm limits the performance of models to local regions. To address this limitation, this paper proposes an approach to improve manifold reg…
▽ More
Manifold regularization model is a semi-supervised learning model that leverages the geometric structure of a dataset, comprising a small number of labeled samples and a large number of unlabeled samples, to generate classifiers. However, the original manifold norm limits the performance of models to local regions. To address this limitation, this paper proposes an approach to improve manifold regularization based on a label propagation model. We initially enhance the probability transition matrix of the diffusion map algorithm, which can be used to estimate the Neumann heat kernel, enabling it to accurately depict the label propagation process on the manifold. Using this matrix, we establish a label propagation function on the dataset to describe the distribution of labels at different time steps. Subsequently, we extend the label propagation function to the entire data manifold. We prove that the extended label propagation function converges to a stable distribution after a sufficiently long time and can be considered as a classifier. Building upon this concept, we propose a viable improvement to the manifold regularization model and validate its superiority through experiments.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Policy Bifurcation in Safe Reinforcement Learning
Authors:
Wenjun Zou,
Yao Lyu,
Jie Li,
Yujie Yang,
Shengbo Eben Li,
**gliang Duan,
Xianyuan Zhan,
**g**g Liu,
Yaqin Zhang,
Keqiang Li
Abstract:
Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous l…
▽ More
Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of policy bifurcation in safe RL, which corresponds to the contractibility of the reachable tuple. Our theorem reveals that in scenarios where the obstacle-free state space is non-simply connected, a feasible policy is required to be bifurcated, meaning its output action needs to change abruptly in response to the varying state. To train such a bifurcated policy, we propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output. The bifurcated behavior can be achieved by selecting the Gaussian component with the highest mixing coefficient. Besides, MUPO also integrates spectral normalization and forward KL divergence to enhance the policy's capability of exploring different modes. Experiments with vehicle control tasks show that our algorithm successfully learns the bifurcated policy and ensures satisfying safety, while a continuous policy suffers from inevitable constraint violations.
△ Less
Submitted 28 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Authors:
Yuxin Guo,
Shijie Ma,
Hu Su,
Zhiqing Wang,
Yuhao Zhao,
Wei Zou,
Siyang Sun,
Yun Zheng
Abstract:
Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives.…
▽ More
Audio-Visual Source Localization (AVSL) aims to locate sounding objects within video frames given the paired audio clips. Existing methods predominantly rely on self-supervised contrastive learning of audio-visual correspondence. Without any bounding-box annotations, they struggle to achieve precise localization, especially for small objects, and suffer from blurry boundaries and false positives. Moreover, the naive semi-supervised method is poor in fully leveraging the information of abundant unlabeled data. In this paper, we propose a novel semi-supervised learning framework for AVSL, namely Dual Mean-Teacher (DMT), comprising two teacher-student structures to circumvent the confirmation bias issue. Specifically, two teachers, pre-trained on limited labeled data, are employed to filter out noisy samples via the consensus between their predictions, and then generate high-quality pseudo-labels by intersecting their confidence maps. The sufficient utilization of both labeled and unlabeled data and the proposed unbiased framework enable DMT to outperform current state-of-the-art methods by a large margin, with CIoU of 90.4% and 48.8% on Flickr-SoundNet and VGG-Sound Source, obtaining 8.9%, 9.6% and 4.6%, 6.4% improvements over self- and semi-supervised methods respectively, given only 3% positional-annotations. We also extend our framework to some existing AVSL methods and consistently boost their performance.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization
Authors:
Yuxin Guo,
Shijie Ma,
Yuhao Zhao,
Hu Su,
Wei Zou
Abstract:
Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn fro…
▽ More
Audio-Visual Source Localization (AVSL) is the task of identifying specific sounding objects in the scene given audio cues. In our work, we focus on semi-supervised AVSL with pseudo-labeling. To address the issues with vanilla hard pseudo-labels including bias accumulation, noise sensitivity, and instability, we propose a novel method named Cross Pseudo-Labeling (XPL), wherein two models learn from each other with the cross-refine mechanism to avoid bias accumulation. We equip XPL with two effective components. Firstly, the soft pseudo-labels with sharpening and pseudo-label exponential moving average mechanisms enable models to achieve gradual self-improvement and ensure stable training. Secondly, the curriculum data selection module adaptively selects pseudo-labels with high quality during training to mitigate potential bias. Experimental results demonstrate that XPL significantly outperforms existing methods, achieving state-of-the-art performance while effectively mitigating confirmation bias and ensuring training stability.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
A Piece of Theatre: Investigating How Teachers Design LLM Chatbots to Assist Adolescent Cyberbullying Education
Authors:
Michael A. Hedderich,
Natalie N. Bazarova,
Wenting Zou,
Ryun Shim,
Xinda Ma,
Qian Yang
Abstract:
Cyberbullying harms teenagers' mental health, and teaching them upstanding intervention is crucial. Wizard-of-Oz studies show chatbots can scale up personalized and interactive cyberbullying education, but implementing such chatbots is a challenging and delicate task. We created a no-code chatbot design tool for K-12 teachers. Using large language models and prompt chaining, our tool allows teache…
▽ More
Cyberbullying harms teenagers' mental health, and teaching them upstanding intervention is crucial. Wizard-of-Oz studies show chatbots can scale up personalized and interactive cyberbullying education, but implementing such chatbots is a challenging and delicate task. We created a no-code chatbot design tool for K-12 teachers. Using large language models and prompt chaining, our tool allows teachers to prototype bespoke dialogue flows and chatbot utterances. In offering this tool, we explore teachers' distinctive needs when designing chatbots to assist their teaching, and how chatbot design tools might better support them. Our findings reveal that teachers welcome the tool enthusiastically. Moreover, they see themselves as playwrights guiding both the students' and the chatbot's behaviors, while allowing for some improvisation. Their goal is to enable students to rehearse both desirable and undesirable reactions to cyberbullying in a safe environment. We discuss the design opportunities LLM-Chains offer for empowering teachers and the research opportunities this work opens up.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models
Authors:
Wei Zou,
Runpeng Geng,
Binghui Wang,
**yuan Jia
Abstract:
Large language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities. Despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to mitigate those limitations. In particular, given a question, RAG retrieves relevant knowledge from…
▽ More
Large language models (LLMs) have achieved remarkable success due to their exceptional generative capabilities. Despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. Retrieval-Augmented Generation (RAG) is a state-of-the-art technique to mitigate those limitations. In particular, given a question, RAG retrieves relevant knowledge from a knowledge database to augment the input of the LLM. For instance, the retrieved knowledge could be a set of top-k texts that are most semantically similar to the given question when the knowledge database contains millions of texts collected from Wikipedia. As a result, the LLM could utilize the retrieved knowledge as the context to generate an answer for the given question. Existing studies mainly focus on improving the accuracy or efficiency of RAG, leaving its security largely unexplored. We aim to bridge the gap in this work. Particularly, we propose PoisonedRAG , a set of knowledge poisoning attacks to RAG, where an attacker could inject a few poisoned texts into the knowledge database such that the LLM generates an attacker-chosen target answer for an attacker-chosen target question. We formulate knowledge poisoning attacks as an optimization problem, whose solution is a set of poisoned texts. Depending on the background knowledge (e.g., black-box and white-box settings) of an attacker on the RAG, we propose two solutions to solve the optimization problem, respectively. Our results on multiple benchmark datasets and LLMs show our attacks could achieve 90% attack success rates when injecting 5 poisoned texts for each target question into a database with millions of texts. We also evaluate recent defenses and our results show they are insufficient to defend against our attacks, highlighting the need for new defenses.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code
Authors:
Wenjie Qu,
Dong Yin,
Zixin He,
Wei Zou,
Tianyang Tao,
**yuan Jia,
Jiaheng Zhang
Abstract:
Large Language Models (LLMs) have been widely deployed for their remarkable capability to generate texts resembling human language. However, they could be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. Watermarking is a key technique to mitigate the misuse of LLMs, which embeds a watermark (e.g., a bit string) into a text gen…
▽ More
Large Language Models (LLMs) have been widely deployed for their remarkable capability to generate texts resembling human language. However, they could be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. Watermarking is a key technique to mitigate the misuse of LLMs, which embeds a watermark (e.g., a bit string) into a text generated by a LLM. Consequently, this enables the detection of texts generated by a LLM as well as the tracing of generated texts to a specific user. The major limitation of existing watermark techniques is that they cannot accurately or efficiently extract the watermark from a text, especially when the watermark is a long bit string. This key limitation impedes their deployment for real-world applications, e.g., tracing generated texts to a specific user.
This work introduces a novel watermarking method for LLM-generated text grounded in \textbf{error-correction codes} to address this challenge. We provide strong theoretical analysis, demonstrating that under bounded adversarial word/token edits (insertion, deletion, and substitution), our method can correctly extract watermarks, offering a provable robustness guarantee. This breakthrough is also evidenced by our extensive experimental results. The experiments show that our method substantially outperforms existing baselines in both accuracy and robustness on benchmark datasets. For instance, when embedding a bit string of length 12 into a 200-token generated text, our approach attains an impressive match rate of $98.4\%$, surpassing the performance of Yoo et al. (state-of-the-art baseline) at $85.6\%$. When subjected to a copy-paste attack involving the injection of 50 tokens to generated texts with 200 words, our method maintains a substantial match rate of $90.8\%$, while the match rate of Yoo et al. diminishes to below $65\%$.
△ Less
Submitted 15 April, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization
Authors:
Shuaijie She,
Wei Zou,
Shujian Huang,
Wenhao Zhu,
Xiang Liu,
Xiang Geng,
Jiajun Chen
Abstract:
Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimizatio…
▽ More
Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages.
△ Less
Submitted 13 April, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models
Authors:
Na Liu,
Liangyu Chen,
Xiaoyu Tian,
Wei Zou,
Kaijiang Chen,
Ming Cui
Abstract:
This paper introduces RAISE (Reasoning and Acting through Scratchpad and Examples), an advanced architecture enhancing the integration of Large Language Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. I…
▽ More
This paper introduces RAISE (Reasoning and Acting through Scratchpad and Examples), an advanced architecture enhancing the integration of Large Language Models (LLMs) like GPT-4 into conversational agents. RAISE, an enhancement of the ReAct framework, incorporates a dual-component memory system, mirroring human short-term and long-term memory, to maintain context and continuity in conversations. It entails a comprehensive agent construction scenario, including phases like Conversation Selection, Scene Extraction, CoT Completion, and Scene Augmentation, leading to the LLMs Training phase. This approach appears to enhance agent controllability and adaptability in complex, multi-turn dialogues. Our preliminary evaluations in a real estate sales context suggest that RAISE has some advantages over traditional agents, indicating its potential for broader applications. This work contributes to the AI field by providing a robust framework for develo** more context-aware and versatile conversational agents.
△ Less
Submitted 30 January, 2024; v1 submitted 5 January, 2024;
originally announced January 2024.
-
A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Software Engineering Tasks
Authors:
Wentao Zou,
Qi Li,
Jidong Ge,
Chuanyi Li,
Xiaoyu Shen,
Liguo Huang,
Bin Luo
Abstract:
Pre-trained models (PTMs) have achieved great success in various Software Engineering (SE) downstream tasks following the ``pre-train then fine-tune'' paradigm. As fully fine-tuning all parameters of PTMs can be computationally expensive, a widely used solution is parameter-efficient fine-tuning (PEFT), which freezes PTMs while introducing extra parameters. Though work has been done to test PEFT m…
▽ More
Pre-trained models (PTMs) have achieved great success in various Software Engineering (SE) downstream tasks following the ``pre-train then fine-tune'' paradigm. As fully fine-tuning all parameters of PTMs can be computationally expensive, a widely used solution is parameter-efficient fine-tuning (PEFT), which freezes PTMs while introducing extra parameters. Though work has been done to test PEFT methods in the SE field, a comprehensive evaluation is still lacking. This paper aims to fill in this gap by evaluating the effectiveness of five PEFT methods on eight PTMs and four SE downstream tasks. For different tasks and PEFT methods, we seek answers to the following research questions: 1) Is it more effective to use PTMs trained specifically on source code, or is it sufficient to use PTMs trained on natural language text? 2) What is the impact of varying model sizes? 3) How does the model architecture affect the performance? Besides effectiveness, we also discuss the efficiency of PEFT methods, concerning the costs of required training time and GPU resource consumption. We hope that our findings can provide a deeper understanding of PEFT methods on various PTMs and SE downstream tasks. All the codes and data are available at \url{https://github.com/zwtnju/PEFT.git}.
△ Less
Submitted 25 December, 2023;
originally announced December 2023.
-
VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook
Authors:
Wenbin Zou,
Hongxia Gao,
Tian Ye,
Liang Chen,
Weipeng Yang,
Shasha Huang,
Hongsheng Chen,
Sixiang Chen
Abstract:
Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the str…
▽ More
Night photography often struggles with challenges like low light and blurring, stemming from dark environments and prolonged exposures. Current methods either disregard priors and directly fitting end-to-end networks, leading to inconsistent illumination, or rely on unreliable handcrafted priors to constrain the network, thereby bringing the greater error to the final result. We believe in the strength of data-driven high-quality priors and strive to offer a reliable and consistent prior, circumventing the restrictions of manual priors. In this paper, we propose Clearer Night Image Restoration with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent restoration outcomes on real-world and synthetic benchmarks. To ensure the faithful restoration of details and illumination, we propose the incorporation of two essential modules: the Adaptive Illumination Enhancement Module (AIEM) and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM leverages the inter-channel correlation of features to dynamically maintain illumination consistency between degraded features and high-quality codebook features. Meanwhile, the DBCA module effectively integrates texture and structural information through bi-directional cross-attention and deformable convolution, resulting in enhanced fine-grained detail and structural fidelity across parallel decoders. Extensive experiments validate the remarkable benefits of VQCNIR in enhancing image quality under low-light conditions, showcasing its state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/AlexZou14/VQCNIR.
△ Less
Submitted 16 December, 2023; v1 submitted 13 December, 2023;
originally announced December 2023.
-
On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering
Authors:
Linyong Nan,
Ellen Zhang,
Wei** Zou,
Yilun Zhao,
Wenfei Zhou,
Arman Cohan
Abstract:
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings high…
▽ More
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
HeLM: Highlighted Evidence augmented Language Model for Enhanced Table-to-Text Generation
Authors:
Junyi Bian,
Xiaolei Qin,
Wuhe Zou,
Mengzuo Huang,
Congyi Luo,
Ke Zhang,
Weidong Zhang
Abstract:
Large models have demonstrated significant progress across various domains, particularly in tasks related to text generation. In the domain of Table to Text, many Large Language Model (LLM)-based methods currently resort to modifying prompts to invoke public APIs, incurring potential costs and information leaks. With the advent of open-source large models, fine-tuning LLMs has become feasible. In…
▽ More
Large models have demonstrated significant progress across various domains, particularly in tasks related to text generation. In the domain of Table to Text, many Large Language Model (LLM)-based methods currently resort to modifying prompts to invoke public APIs, incurring potential costs and information leaks. With the advent of open-source large models, fine-tuning LLMs has become feasible. In this study, we conducted parameter-efficient fine-tuning on the LLaMA2 model. Distinguishing itself from previous fine-tuning-based table-to-text methods, our approach involves injecting reasoning information into the input by emphasizing table-specific row data. Our model consists of two modules: 1) a table reasoner that identifies relevant row evidence, and 2) a table summarizer that generates sentences based on the highlighted table. To facilitate this, we propose a search strategy to construct reasoning labels for training the table reasoner. On both the FetaQA and QTSumm datasets, our approach achieved state-of-the-art results. Additionally, we observed that highlighting input tables significantly enhances the model's performance and provides valuable interpretability.
△ Less
Submitted 27 April, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking
Authors:
Xiaoyu Tian,
Liangyu Chen,
Na Liu,
Yaxuan Liu,
Wei Zou,
Kaijiang Chen,
Ming Cui
Abstract:
Inspired by the dual-process theory of human cognition, we introduce DUMA, a novel conversational agent framework that embodies a dual-mind mechanism through the utilization of two generative Large Language Models (LLMs) dedicated to fast and slow thinking respectively. The fast thinking model serves as the primary interface for external interactions and initial response generation, evaluating the…
▽ More
Inspired by the dual-process theory of human cognition, we introduce DUMA, a novel conversational agent framework that embodies a dual-mind mechanism through the utilization of two generative Large Language Models (LLMs) dedicated to fast and slow thinking respectively. The fast thinking model serves as the primary interface for external interactions and initial response generation, evaluating the necessity for engaging the slow thinking model based on the complexity of the complete response. When invoked, the slow thinking model takes over the conversation, engaging in meticulous planning, reasoning, and tool utilization to provide a well-analyzed response. This dual-mind configuration allows for a seamless transition between intuitive responses and deliberate problem-solving processes based on the situation. We have constructed a conversational agent to handle online inquiries in the real estate industry. The experiment proves that our method balances effectiveness and efficiency, and has a significant improvement compared to the baseline.
△ Less
Submitted 24 November, 2023; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Calibration-based Dual Prototypical Contrastive Learning Approach for Domain Generalization Semantic Segmentation
Authors:
Muxin Liao,
Shishun Tian,
Yuhang Zhang,
Guoguang Hua,
Wenbin Zou,
Xia Li
Abstract:
Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learn…
▽ More
Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learned from the source domain by PCL need to be aligned with the prototypes of other domains simultaneously. However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. It contains an uncertainty-guided PCL (UPCL) and a hard-weighted PCL (HPCL). Since the domain discrepancies of the prototypes of different classes may be different, we propose an uncertainty probability matrix to represent the domain discrepancies of the prototypes of all the classes. The UPCL estimates the uncertainty probability matrix to calibrate the weights of the prototypes during the PCL. Moreover, considering that the prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, the HPCL is proposed to generate a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL. Extensive experiments demonstrate that our approach achieves superior performance over current approaches on domain generalization semantic segmentation tasks.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles
Authors:
Shulin Huang,
Shirong Ma,
Yinghui Li,
Mengzuo Huang,
Wuhe Zou,
Weidong Zhang,
Hai-Tao Zheng
Abstract:
With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive frame…
▽ More
With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
△ Less
Submitted 17 March, 2024; v1 submitted 21 August, 2023;
originally announced August 2023.
-
ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation
Authors:
Cheng Wen,
Xianghui Sun,
Shuaijiang Zhao,
Xiaoquan Fang,
Liangyu Chen,
Wei Zou
Abstract:
This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, p…
▽ More
This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, precise outputs relevant to the home renovation arena. ChatHome's novelty rests on its methodology, fusing domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content pertinent to home renovation. This dual-pronged strategy is designed to ensure that our model can assimilate comprehensive domain knowledge and effectively address user inquiries. Via thorough experimentation on diverse datasets, both universal and domain-specific, including the freshly introduced "EvalHome" domain dataset, we substantiate that ChatHome not only amplifies domain-specific functionalities but also preserves its versatility.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
QTSumm: Query-Focused Summarization over Tabular Data
Authors:
Yilun Zhao,
Zhenting Qi,
Linyong Nan,
Boyu Mi,
Yixin Liu,
Wei** Zou,
Simeng Han,
Ruizhe Chen,
Xiangru Tang,
Yumo Xu,
Dragomir Radev,
Arman Cohan
Abstract:
People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and a…
▽ More
People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics. We investigate a set of strong baselines on QTSumm, including text generation, table-to-text generation, and large language models. Experimental results and manual analysis reveal that the new task presents significant challenges in table-to-text generation for future research. Moreover, we propose a new approach named ReFactor, to retrieve and reason over query-relevant information from tabular data to generate several natural language facts. Experimental results demonstrate that ReFactor can bring improvements to baselines by concatenating the generated facts to the model input. Our data and code are publicly available at https://github.com/yale-nlp/QTSumm.
△ Less
Submitted 6 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies
Authors:
Linyong Nan,
Yilun Zhao,
Wei** Zou,
Narutatsu Ri,
Jaesung Tae,
Ellen Zhang,
Arman Cohan,
Dragomir Radev
Abstract:
In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions. In this paper, we aim to extend this method to question answering tasks that utilize structured knowledge sources, and improve Text-to-SQL syste…
▽ More
In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions. In this paper, we aim to extend this method to question answering tasks that utilize structured knowledge sources, and improve Text-to-SQL systems by exploring various prompt design strategies for employing LLMs. We conduct a systematic investigation into different demonstration selection methods and optimal instruction formats for prompting LLMs in the Text-to-SQL task. Our approach involves leveraging the syntactic structure of an example's SQL query to retrieve demonstrations, and we demonstrate that pursuing both diversity and similarity in demonstration selection leads to enhanced performance. Furthermore, we show that LLMs benefit from database-related knowledge augmentations. Our most effective strategy outperforms the state-of-the-art system by 2.5 points (Execution Accuracy) and the best fine-tuned system by 5.1 points on the Spider dataset. These results highlight the effectiveness of our approach in adapting LLMs to the Text-to-SQL task, and we present an analysis of the factors contributing to the success of our strategy.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Introduction to dynamical mean-field theory of randomly connected neural networks with bidirectionally correlated couplings
Authors:
Wenxuan Zou,
Hai** Huang
Abstract:
Dynamical mean-field theory is a powerful physics tool used to analyze the typical behavior of neural networks, where neurons can be recurrently connected, or multiple layers of neurons can be stacked. However, it is not easy for beginners to access the essence of this tool and the underlying physics. Here, we give a pedagogical introduction of this method in a particular example of random neural…
▽ More
Dynamical mean-field theory is a powerful physics tool used to analyze the typical behavior of neural networks, where neurons can be recurrently connected, or multiple layers of neurons can be stacked. However, it is not easy for beginners to access the essence of this tool and the underlying physics. Here, we give a pedagogical introduction of this method in a particular example of random neural networks, where neurons are randomly and fully connected by correlated synapses and therefore the network exhibits rich emergent collective dynamics. We also review related past and recent important works applying this tool. In addition, a physically transparent and alternative method, namely the dynamical cavity method, is also introduced to derive exactly the same results. The numerical implementation of solving the integro-differential mean-field equations is also detailed, with an illustration of exploring the fluctuation dissipation theorem.
△ Less
Submitted 7 October, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Cross-View Hierarchy Network for Stereo Image Super-Resolution
Authors:
Wenbin Zou,
Hongxia Gao,
Liang Chen,
Yunchen Zhang,
Mingchao Jiang,
Zhongxin Yu,
Ming Tan
Abstract:
Stereo image super-resolution aims to improve the quality of high-resolution stereo image pairs by exploiting complementary information across views. To attain superior performance, many methods have prioritized designing complex modules to fuse similar information across views, yet overlooking the importance of intra-view information for high-resolution reconstruction. It also leads to problems o…
▽ More
Stereo image super-resolution aims to improve the quality of high-resolution stereo image pairs by exploiting complementary information across views. To attain superior performance, many methods have prioritized designing complex modules to fuse similar information across views, yet overlooking the importance of intra-view information for high-resolution reconstruction. It also leads to problems of wrong texture in recovered images. To address this issue, we explore the interdependencies between various hierarchies from intra-view and propose a novel method, named Cross-View-Hierarchy Network for Stereo Image Super-Resolution (CVHSSR). Specifically, we design a cross-hierarchy information mining block (CHIMB) that leverages channel attention and large kernel convolution attention to extract both global and local features from the intra-view, enabling the efficient restoration of accurate texture details. Additionally, a cross-view interaction module (CVIM) is proposed to fuse similar features from different views by utilizing cross-view attention mechanisms, effectively adapting to the binocular scene. Extensive experiments demonstrate the effectiveness of our method. CVHSSR achieves the best stereo image super-resolution performance than other state-of-the-art methods while using fewer parameters. The source code and pre-trained models are available at https://github.com/AlexZou14/CVHSSR.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Progressive Content-aware Coded Hyperspectral Compressive Imaging
Authors:
Xuanyu Zhang,
Bin Chen,
Wenzhen Zou,
Shuai Liu,
Yongbing Zhang,
Ruiqin Xiong,
Jian Zhang
Abstract:
Hyperspectral imaging plays a pivotal role in a wide range of applications, like remote sensing, medicine, and cytology. By acquiring 3D hyperspectral images (HSIs) via 2D sensors, the coded aperture snapshot spectral imaging (CASSI) has achieved great success due to its hardware-friendly implementation and fast imaging speed. However, for some less spectrally sparse scenes, single snapshot and un…
▽ More
Hyperspectral imaging plays a pivotal role in a wide range of applications, like remote sensing, medicine, and cytology. By acquiring 3D hyperspectral images (HSIs) via 2D sensors, the coded aperture snapshot spectral imaging (CASSI) has achieved great success due to its hardware-friendly implementation and fast imaging speed. However, for some less spectrally sparse scenes, single snapshot and unreasonable coded aperture design tend to make HSI recovery more ill-posed and yield poor spatial and spectral fidelity. In this paper, we propose a novel Progressive Content-Aware CASSI framework, dubbed PCA-CASSI, which captures HSIs with multiple optimized content-aware coded apertures and fuses all the snapshots for reconstruction progressively. Simultaneously, by map** the Range-Null space Decomposition (RND) into a deep network with several phases, an RND-HRNet is proposed for HSI recovery. Each recovery phase can fully exploit the hidden physical information in the coded apertures via explicit $\mathcal{R}$$-$$\mathcal{N}$ decomposition and explore the spatial-spectral correlation by dual transformer blocks. Our method is validated to surpass other state-of-the-art methods on both multiple- and single-shot HSI imaging tasks by large margins.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
A Class-wise Non-salient Region Generalized Framework for Video Semantic Segmentation
Authors:
Yuhang Zhang,
Shishun Tian,
Muxin Liao,
Zhengyu Zhang,
Wenbin Zou,
Chen Xu
Abstract:
Video semantic segmentation (VSS) is beneficial for dealing with dynamic scenes due to the continuous property of the real-world environment. On the one hand, some methods alleviate the predicted inconsistent problem between continuous frames. On the other hand, other methods employ the previous frame as the prior information to assist in segmenting the current frame. Although the previous methods…
▽ More
Video semantic segmentation (VSS) is beneficial for dealing with dynamic scenes due to the continuous property of the real-world environment. On the one hand, some methods alleviate the predicted inconsistent problem between continuous frames. On the other hand, other methods employ the previous frame as the prior information to assist in segmenting the current frame. Although the previous methods achieve superior performances on the independent and identically distributed (i.i.d) data, they can not generalize well on other unseen domains. Thus, we explore a new task, the video generalizable semantic segmentation (VGSS) task that considers both continuous frames and domain generalization. In this paper, we propose a class-wise non-salient region generalized (CNSG) framework for the VGSS task. Concretely, we first define the class-wise non-salient feature, which describes features of the class-wise non-salient region that carry more generalizable information. Then, we propose a class-wise non-salient feature reasoning strategy to select and enhance the most generalized channels adaptively. Finally, we propose an inter-frame non-salient centroid alignment loss to alleviate the predicted inconsistent problem in the VGSS task. We also extend our video-based framework to the image-based generalizable semantic segmentation (IGSS) task. Experiments demonstrate that our CNSG framework yields significant improvement in the VGSS and IGSS tasks.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
Technical Report -- Competition Solution for Prompt Tuning using Pretrained Language Model
Authors:
Jiang-Long Song,
Wu-He Zou,
Feng Li,
Xiao-Lei Qin,
Wei-Dong Zhang
Abstract:
Prompt tuning recently becomes a hot-spot in the applications of large pretrained language models on specific downstream tasks. Regarding the Language Model as a Service (LMaaS), black-box tuning using derivative-free optimization (DFO) provides a novel approach to expand the practical scenarios of pretrained models and enrich the researches of few-shot learning. In this report, we present our sol…
▽ More
Prompt tuning recently becomes a hot-spot in the applications of large pretrained language models on specific downstream tasks. Regarding the Language Model as a Service (LMaaS), black-box tuning using derivative-free optimization (DFO) provides a novel approach to expand the practical scenarios of pretrained models and enrich the researches of few-shot learning. In this report, we present our solution in this competition that is based on the LMaaS scenario. Our solution consists of several modifications to BBTv2, including multiple label words, selection of P0, rolling update strategy, multi-task loss from MLP classifier, and finally using the ensemble method to further improve generalization ability. We also shared some strategies that we tried but didn't use in the final submission for further discussion. In the end we raised a question about the SNLI dataset and the impact on the results, as well as our concerns about the competition.
△ Less
Submitted 20 December, 2022; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Statistical mechanics of continual learning: variational principle and mean-field potential
Authors:
Chan Li,
Zhenye Huang,
Wenxuan Zou,
Hai** Huang
Abstract:
An obstacle to artificial general intelligence is set by continual learning of multiple tasks of different nature. Recently, various heuristic tricks, both from machine learning and from neuroscience angles, were proposed, but they lack a unified theory ground. Here, we focus on continual learning in single-layered and multi-layered neural networks of binary weights. A variational Bayesian learnin…
▽ More
An obstacle to artificial general intelligence is set by continual learning of multiple tasks of different nature. Recently, various heuristic tricks, both from machine learning and from neuroscience angles, were proposed, but they lack a unified theory ground. Here, we focus on continual learning in single-layered and multi-layered neural networks of binary weights. A variational Bayesian learning setting is thus proposed, where the neural networks are trained in a field-space, rather than gradient-ill-defined discrete-weight space, and furthermore, weight uncertainty is naturally incorporated, and modulates synaptic resources among tasks. From a physics perspective, we translate the variational continual learning into Franz-Parisi thermodynamic potential framework, where previous task knowledge acts as a prior and a reference as well. We thus interpret the continual learning of the binary perceptron in a teacher-student setting as a Franz-Parisi potential computation. The learning performance can then be analytically studied with mean-field order parameters, whose predictions coincide with numerical experiments using stochastic gradient descent methods. Based on the variational principle and Gaussian field approximation of internal preactivations in hidden layers, we also derive the learning algorithm considering weight uncertainty, which solves the continual learning with binary weights using multi-layered neural networks, and performs better than the currently available metaplasticity algorithm. Our proposed principled frameworks also connect to elastic weight consolidation, weight-uncertainty modulated learning, and neuroscience inspired metaplasticity, providing a theory-grounded method for the real-world multi-task learning with deep networks.
△ Less
Submitted 20 June, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
Emerging Threats in Deep Learning-Based Autonomous Driving: A Comprehensive Survey
Authors:
Hui Cao,
Wenlong Zou,
Yinkun Wang,
Ting Song,
Mengjun Liu
Abstract:
Since the 2004 DARPA Grand Challenge, the autonomous driving technology has witnessed nearly two decades of rapid development. Particularly, in recent years, with the application of new sensors and deep learning technologies extending to the autonomous field, the development of autonomous driving technology has continued to make breakthroughs. Thus, many carmakers and high-tech giants dedicated to…
▽ More
Since the 2004 DARPA Grand Challenge, the autonomous driving technology has witnessed nearly two decades of rapid development. Particularly, in recent years, with the application of new sensors and deep learning technologies extending to the autonomous field, the development of autonomous driving technology has continued to make breakthroughs. Thus, many carmakers and high-tech giants dedicated to research and system development of autonomous driving. However, as the foundation of autonomous driving, the deep learning technology faces many new security risks. The academic community has proposed deep learning countermeasures against the adversarial examples and AI backdoor, and has introduced them into the autonomous driving field for verification. Deep learning security matters to autonomous driving system security, and then matters to personal safety, which is an issue that deserves attention and research.This paper provides an summary of the concepts, developments and recent research in deep learning security technologies in autonomous driving. Firstly, we briefly introduce the deep learning framework and pipeline in the autonomous driving system, which mainly include the deep learning technologies and algorithms commonly used in this field. Moreover, we focus on the potential security threats of the deep learning based autonomous driving system in each functional layer in turn. We reviews the development of deep learning attack technologies to autonomous driving, investigates the State-of-the-Art algorithms, and reveals the potential risks. At last, we provides an outlook on deep learning security in the autonomous driving field and proposes recommendations for building a safe and trustworthy autonomous driving system.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate
Authors:
Dongjie Yu,
Wenjun Zou,
Yujie Yang,
Haitong Ma,
Shengbo Eben Li,
**gliang Duan,
Jianyu Chen
Abstract:
Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an i…
▽ More
Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
Downlink Compression Improves TopK Sparsification
Authors:
William Zou,
Hans De Sterck,
Jun Liu
Abstract:
Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the lar…
▽ More
Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not only does it reduce the amount of communication per step, but also, counter-intuitively, can improve the upper bound in the convergence analysis. To show this, we revisit non-convex convergence analysis of topK stochastic gradient descent (SGD) and extend it from the unidirectional to the bidirectional setting. We also remove a restriction of the previous analysis that requires unrealistically large values of K. We experimentally evaluate bidirectional topK SGD against unidirectional topK SGD and show that models trained with bidirectional topK SGD will perform as well as models trained with unidirectional topK SGD while yielding significant communication benefits for large numbers of workers.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
AIM 2022 Challenge on Super-Resolution of Compressed Image and Video: Dataset, Methods and Results
Authors:
Ren Yang,
Radu Timofte,
Xin Li,
Qi Zhang,
Lin Zhang,
Fanglong Liu,
Dongliang He,
Fu li,
He Zheng,
Weihang Yuan,
Pavel Ostyakov,
Dmitry Vyal,
Magauiya Zhussip,
Xueyi Zou,
Youliang Yan,
Lei Li,
**gzhu Tang,
Ming Chen,
Shijie Zhao,
Yu Zhu,
Xiaoran Qin,
Chenghua Li,
Cong Leng,
Jian Cheng,
Claudio Rota
, et al. (28 additional authors not shown)
Abstract:
This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track~2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 3…
▽ More
This paper reviews the Challenge on Super-Resolution of Compressed Image and Video at AIM 2022. This challenge includes two tracks. Track 1 aims at the super-resolution of compressed image, and Track~2 targets the super-resolution of compressed video. In Track 1, we use the popular dataset DIV2K as the training, validation and test sets. In Track 2, we propose the LDV 3.0 dataset, which contains 365 videos, including the LDV 2.0 dataset (335 videos) and 30 additional videos. In this challenge, there are 12 teams and 2 teams that submitted the final results to Track 1 and Track 2, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution on compressed image and video. The proposed LDV 3.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge is at https://github.com/RenYang-home/AIM22_CompressSR.
△ Less
Submitted 25 August, 2022; v1 submitted 23 August, 2022;
originally announced August 2022.
-
Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition
Authors:
Goutham Rajendran,
Wei Zou
Abstract:
We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, whi…
▽ More
We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, which led to the thriving field of robust machine learning. We consider this important issue in the setting of automatic speech recognition. With the increasing popularity of pre-trained models, it's an important question to analyze and understand the robustness of such models to noise. In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets. We use different kinds of noising mechanisms and measure the model performances as quantified by the inference time and the standard Word Error Rate metric. We also do an in-depth layer-wise analysis of the wav2vec2 model when injecting noise in between layers, enabling us to predict at a high level what each layer learns. Finally for this model, we visualize the propagation of errors across the layers and compare how it behaves on clean versus noisy data. Our experiments conform the predictions of Pasad et al. [2021] and also raise interesting directions for future work.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
DSLA: Dynamic smooth label assignment for efficient anchor-free object detection
Authors:
Hu Su,
Yonghao He,
Rui Jiang,
Jiabin Zhang,
Wei Zou,
Bin Fan
Abstract:
Anchor-free detectors basically formulate object detection as dense classification and regression. For popular anchor-free detectors, it is common to introduce an individual prediction branch to estimate the quality of localization. The following inconsistencies are observed when we delve into the practices of classification and quality estimation. Firstly, for some adjacent samples which are assi…
▽ More
Anchor-free detectors basically formulate object detection as dense classification and regression. For popular anchor-free detectors, it is common to introduce an individual prediction branch to estimate the quality of localization. The following inconsistencies are observed when we delve into the practices of classification and quality estimation. Firstly, for some adjacent samples which are assigned completely different labels, the trained model would produce similar classification scores. This violates the training objective and leads to performance degradation. Secondly, it is found that detected bounding boxes with higher confidences contrarily have smaller overlaps with the corresponding ground-truth. Accurately localized bounding boxes would be suppressed by less accurate ones in the Non-Maximum Suppression (NMS) procedure. To address the inconsistency problems, the Dynamic Smooth Label Assignment (DSLA) method is proposed. Based on the concept of centerness originally developed in FCOS, a smooth assignment strategy is proposed. The label is smoothed to a continuous value in [0, 1] to make a steady transition between positive and negative samples. Intersection-of-Union (IoU) is predicted dynamically during training and is coupled with the smoothed label. The dynamic smooth label is assigned to supervise the classification branch. Under such supervision, quality estimation branch is naturally merged into the classification branch, which simplifies the architecture of anchor-free detector. Comprehensive experiments are conducted on the MS COCO benchmark. It is demonstrated that, DSLA can significantly boost the detection accuracy by alleviating the above inconsistencies for anchor-free detectors. Our codes are released at https://github.com/YonghaoHe/DSLA.
△ Less
Submitted 29 September, 2022; v1 submitted 1 August, 2022;
originally announced August 2022.
-
Prior-Guided One-shot Neural Architecture Search
Authors:
Peijie Dong,
Xin Niu,
Lujun Li,
Linzhen Xie,
Wenbin Zou,
Tian Ye,
Zimian Wei,
Hengyue Pan
Abstract:
Neural architecture search methods seek optimal candidates with efficient weight-sharing supernet training. However, recent studies indicate poor ranking consistency about the performance between stand-alone architectures and shared-weight networks. In this paper, we present Prior-Guided One-shot NAS (PGONAS) to strengthen the ranking correlation of supernets. Specifically, we first explore the ef…
▽ More
Neural architecture search methods seek optimal candidates with efficient weight-sharing supernet training. However, recent studies indicate poor ranking consistency about the performance between stand-alone architectures and shared-weight networks. In this paper, we present Prior-Guided One-shot NAS (PGONAS) to strengthen the ranking correlation of supernets. Specifically, we first explore the effect of activation functions and propose a balanced sampling strategy based on the Sandwich Rule to alleviate weight coupling in the supernet. Then, FLOPs and Zen-Score are adopted to guide the training of supernet with ranking correlation loss. Our PGONAS ranks 3rd place in the supernet Track Track of CVPR2022 Second lightweight NAS challenge. Code is available in https://github.com/pprp/CVPR2022-NAS?competition-Track1-3th-solution.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
R2D2: Robust Data-to-Text with Replacement Detection
Authors:
Linyong Nan,
Lorenzo Jaime Yu Flores,
Yilun Zhao,
Yixin Liu,
Luke Benson,
Wei** Zou,
Dragomir Radev
Abstract:
Unfaithful text generation is a common problem for text generation systems. In the case of Data-to-Text (D2T) systems, the factuality of the generated text is particularly crucial for any real-world applications. We introduce R2D2, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replace…
▽ More
Unfaithful text generation is a common problem for text generation systems. In the case of Data-to-Text (D2T) systems, the factuality of the generated text is particularly crucial for any real-world applications. We introduce R2D2, a training framework that addresses unfaithful Data-to-Text generation by training a system both as a generator and a faithfulness discriminator with additional replacement detection and unlikelihood learning tasks. To facilitate such training, we propose two methods for sampling unfaithful sentences. We argue that the poor entity retrieval capability of D2T systems is one of the primary sources of unfaithfulness, so in addition to the existing metrics, we further propose NER-based metrics to evaluate the fidelity of D2T generations. Our experimental results show that R2D2 systems could effectively mitigate the unfaithful text generation, and they achieve new state-of-the-art results on FeTaQA, LogicNLG, and ToTTo, all with significant improvements.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results
Authors:
Yawei Li,
Kai Zhang,
Radu Timofte,
Luc Van Gool,
Fangyuan Kong,
Mingxi Li,
Songwei Liu,
Zongcai Du,
Ding Liu,
Chenhui Zhou,
**gyi Chen,
Qingrui Han,
Zheyuan Li,
Yingqi Liu,
Xiangyu Chen,
Haoming Cai,
Yu Qiao,
Chao Dong,
Long Sun,
**shan Pan,
Yi Zhu,
Zhikai Zong,
Xiaoxiao Liu,
Zheng Hui,
Tao Yang
, et al. (86 additional authors not shown)
Abstract:
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of e…
▽ More
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29.00dB on DIV2K validation set. IMDN is set as the baseline for efficiency measurement. The challenge had 3 tracks including the main track (runtime), sub-track one (model complexity), and sub-track two (overall performance). In the main track, the practical runtime performance of the submissions was evaluated. The rank of the teams were determined directly by the absolute value of the average runtime on the validation set and test set. In sub-track one, the number of parameters and FLOPs were considered. And the individual rankings of the two metrics were summed up to determine a final ranking in this track. In sub-track two, all of the five metrics mentioned in the description of the challenge including runtime, parameter count, FLOPs, activations, and memory consumption were considered. Similar to sub-track one, the rankings of five metrics were summed up to determine a final ranking. The challenge had 303 registered participants, and 43 teams made valid submissions. They gauge the state-of-the-art in efficient single image super-resolution.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
Self-Calibrated Efficient Transformer for Lightweight Super-Resolution
Authors:
Wenbin Zou,
Tian Ye,
Weixin Zheng,
Yunchen Zhang,
Liang Chen,
Yi Wu
Abstract:
Recently, deep learning has been successfully applied to the single-image super-resolution (SISR) with remarkable performance. However, most existing methods focus on building a more complex network with a large number of layers, which can entail heavy computational costs and memory storage. To address this problem, we present a lightweight Self-Calibrated Efficient Transformer (SCET) network to s…
▽ More
Recently, deep learning has been successfully applied to the single-image super-resolution (SISR) with remarkable performance. However, most existing methods focus on building a more complex network with a large number of layers, which can entail heavy computational costs and memory storage. To address this problem, we present a lightweight Self-Calibrated Efficient Transformer (SCET) network to solve this problem. The architecture of SCET mainly consists of the self-calibrated module and efficient transformer block, where the self-calibrated module adopts the pixel attention mechanism to extract image features effectively. To further exploit the contextual information from features, we employ an efficient transformer to help the network obtain similar features over long distances and thus recover sufficient texture details. We provide comprehensive results on different settings of the overall network. Our proposed method achieves more remarkable performance than baseline methods. The source code and pre-trained models are available at https://github.com/AlexZou14/SCET.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
Audio Deep Fake Detection System with Neural Stitching for ADD 2022
Authors:
Rui Yan,
Cheng Wen,
Shuran Zhou,
Tingwei Guo,
Wei Zou,
Xiangang Li
Abstract:
This paper describes our best system and methodology for ADD 2022: The First Audio Deep Synthesis Detection Challenge\cite{Yi2022ADD}. The very same system was used for both two rounds of evaluation in Track 3.2 with a similar training methodology. The first round of Track 3.2 data is generated from Text-to-Speech(TTS) or voice conversion (VC) algorithms, while the second round of data consists of…
▽ More
This paper describes our best system and methodology for ADD 2022: The First Audio Deep Synthesis Detection Challenge\cite{Yi2022ADD}. The very same system was used for both two rounds of evaluation in Track 3.2 with a similar training methodology. The first round of Track 3.2 data is generated from Text-to-Speech(TTS) or voice conversion (VC) algorithms, while the second round of data consists of generated fake audio from other participants in Track 3.1, aiming to spoof our systems. Our systems use a standard 34-layer ResNet, with multi-head attention pooling \cite{india2019self} to learn the discriminative embedding for fake audio and spoof detection. We further utilize neural stitching to boost the model's generalization capability in order to perform equally well in different tasks, and more details will be explained in the following sessions. The experiments show that our proposed method outperforms all other systems with a 10.1% equal error rate(EER) in Track 3.2.
△ Less
Submitted 19 April, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Time Domain Adversarial Voice Conversion for ADD 2022
Authors:
Cheng Wen,
Tingwei Guo,
Xingjun Tan,
Rui Yan,
Shuran Zhou,
Chuandong Xie,
Wei Zou,
Xiangang Li
Abstract:
In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability.…
▽ More
In this paper, we describe our speech generation system for the first Audio Deep Synthesis Detection Challenge (ADD 2022). Firstly, we build an any-to-many voice conversion (VC) system to convert source speech with arbitrary language content into the target speaker%u2019s fake speech. Then the converted speech generated from VC is post-processed in the time domain to improve the deception ability. The experimental results show that our system has adversarial ability against anti-spoofing detectors with a little compromise in audio quality and speaker similarity. This system ranks top in Track 3.1 in the ADD 2022, showing that our method could also gain good generalization ability against different detectors.
△ Less
Submitted 19 April, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Audio-Visual Wake Word Spotting System For MISP Challenge 2021
Authors:
Yanguang Xu,
Jianwei Sun,
Yang Han,
Shuaijiang Zhao,
Chaoyang Mei,
Tingwei Guo,
Shuran Zhou,
Chuandong Xie,
Wei Zou,
Xiangang Li,
Shuran Zhou,
Chuandong Xie,
Wei Zou,
Xiangang Li
Abstract:
This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weight…
▽ More
This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.
△ Less
Submitted 19 April, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Graph Flow: Cross-layer Graph Flow Distillation for Dual Efficient Medical Image Segmentation
Authors:
Wenxuan Zou,
Muyi Sun
Abstract:
With the development of deep convolutional neural networks, medical image segmentation has achieved a series of breakthroughs in recent years. However, the high-performance convolutional neural networks always mean numerous parameters and high computation costs, which will hinder the applications in clinical scenarios. Meanwhile, the scarceness of large-scale annotated medical image datasets furth…
▽ More
With the development of deep convolutional neural networks, medical image segmentation has achieved a series of breakthroughs in recent years. However, the high-performance convolutional neural networks always mean numerous parameters and high computation costs, which will hinder the applications in clinical scenarios. Meanwhile, the scarceness of large-scale annotated medical image datasets further impedes the application of high-performance networks. To tackle these problems, we propose Graph Flow, a comprehensive knowledge distillation framework, for both network-efficiency and annotation-efficiency medical image segmentation. Specifically, our core Graph Flow Distillation transfer the essence of cross-layer variations from a well-trained cumbersome teacher network to a non-trained compact student network. In addition, an unsupervised Paraphraser Module is integrated to purify the knowledge of the teacher network, which is also beneficial for the stabilization of training procedure. Furthermore, we build a unified distillation framework by integrating the adversarial distillation and the vanilla logits distillation, which can further refine the final predictions of the compact network. With different teacher networks (conventional convolutional architecture or prevalent transformer architecture) and student networks, we conduct extensive experiments on four medical image datasets with different modalities (Gastric Cancer, Synapse, BUSI, and CVC-ClinicDB).We demonstrate the prominent ability of our method which achieves competitive performance on these datasets. Moreover, we demonstrate the effectiveness of our Graph Flow through a novel semi-supervised paradigm for dual efficient medical image segmentation. Our code will be available at Graph Flow.
△ Less
Submitted 29 August, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
High-order tensor flow processing using integrated photonic circuits
Authors:
Shaofu Xu,
**g Wang,
Sicheng Yi,
Weiwen Zou
Abstract:
Tensor analytics lays mathematical basis for the prosperous promotion of multiway signal processing. To increase computing throughput, mainstream processors transform tensor convolutions to matrix multiplications to enhance parallelism of computing. However, such order-reducing transformation produces data duplicates and consumes additional memory. Here, we demonstrate an integrated photonic tenso…
▽ More
Tensor analytics lays mathematical basis for the prosperous promotion of multiway signal processing. To increase computing throughput, mainstream processors transform tensor convolutions to matrix multiplications to enhance parallelism of computing. However, such order-reducing transformation produces data duplicates and consumes additional memory. Here, we demonstrate an integrated photonic tensor flow processor without tensor-matrix transformation, which outputs the convolved tensor as the input tensor 'flows' through the processor. The hybrid manipulation of optical dimensions of wavelength, time, and space enables the direct representation and processing of high-order tensors in optical domain. In the proof-of-concept experiment, processing of multi-channel images and videos is accomplished at the frequency of 20 GHz. A convolutional neural network is demonstrated on the processor, which achieves an accuracy of 97.9 percent on action recognition.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
SDWNet: A Straight Dilated Network with Wavelet Transformation for Image Deblurring
Authors:
Wenbin Zou,
Mingchao Jiang,
Yunchen Zhang,
Liang Chen,
Zhiyong Lu,
Yi Wu
Abstract:
Image deblurring is a classical computer vision problem that aims to recover a sharp image from a blurred image. To solve this problem, existing methods apply the Encode-Decode architecture to design the complex networks to make a good performance. However, most of these methods use repeated up-sampling and down-sampling structures to expand the receptive field, which results in texture informatio…
▽ More
Image deblurring is a classical computer vision problem that aims to recover a sharp image from a blurred image. To solve this problem, existing methods apply the Encode-Decode architecture to design the complex networks to make a good performance. However, most of these methods use repeated up-sampling and down-sampling structures to expand the receptive field, which results in texture information loss during the sampling process and some of them design the multiple stages that lead to difficulties with convergence. Therefore, our model uses dilated convolution to enable the obtainment of the large receptive field with high spatial resolution. Through making full use of the different receptive fields, our method can achieve better performance. On this basis, we reduce the number of up-sampling and down-sampling and design a simple network structure. Besides, we propose a novel module using the wavelet transform, which effectively helps the network to recover clear high-frequency texture details. Qualitative and quantitative evaluations of real and synthetic datasets show that our deblurring method is comparable to existing algorithms in terms of performance with much lower training requirements. The source code and pre-trained models are available at https://github.com/FlyEgle/SDWNet.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Comb-based photonic neural population for parallel and nonlinear processing
Authors:
Bowen Ma,
Junfeng Zhang,
Weiwen Zou
Abstract:
It is believed that neural information representation and processing relies on the neural population instead of a single neuron. In neuromorphic photonics, photonic neurons in the form of nonlinear responses have been extensively studied in single devices and temporal nodes. However, to construct a photonic neural population (PNP), the process of scaling up and massive interconnections remain chal…
▽ More
It is believed that neural information representation and processing relies on the neural population instead of a single neuron. In neuromorphic photonics, photonic neurons in the form of nonlinear responses have been extensively studied in single devices and temporal nodes. However, to construct a photonic neural population (PNP), the process of scaling up and massive interconnections remain challenging considering the physical complexity and response latency. Here, we propose a comb-based PNP interconnected by carrier coupling with superior scalability. Two unique properties of neural population are theoretically and experimentally demonstrated in the comb-based PNP, including nonlinear response curves and population activities coding. A classification task of three input patterns with dual radio-frequency (RF) tones is successfully implemented in a real-time manner, which manifests the comb-based PNP can make effective use of the ultra-broad bandwidth of photonics for parallel and nonlinear processing.
△ Less
Submitted 25 September, 2021;
originally announced September 2021.
-
BLESER: Bug Localization Based on Enhanced Semantic Retrieval
Authors:
Weiqin Zou,
Enming Li,
Chunrong Fang
Abstract:
Static bug localization techniques that locate bugs at method granularity have gained much attention from both researchers and practitioners. For a static method-level bug localization technique, a key but challenging step is to fully retrieve the semantics of methods and bug reports. Currently, existing studies mainly use the same bag-of-word space to represent the semantics of methods and bug re…
▽ More
Static bug localization techniques that locate bugs at method granularity have gained much attention from both researchers and practitioners. For a static method-level bug localization technique, a key but challenging step is to fully retrieve the semantics of methods and bug reports. Currently, existing studies mainly use the same bag-of-word space to represent the semantics of methods and bug reports without considering structure information of methods and textual contexts of bug reports, which largely and negatively affects bug localization performance.
To address this problem, we develop BLESER, a new bug localization technique based on enhanced semantic retrieval. Specifically, we use an AST-based code embedding model (capturing code structure better) to retrieve the semantics of methods, and word embedding models (capturing textual contexts better) to represent the semantics of bug reports. Then, a deep learning model is built on the enhanced semantic representations. During model building, we compare five typical word embedding models in representing bug reports and try to explore the usefulness of re-sampling strategies and cost-sensitive strategies in handling class imbalance problems. We evaluate our BLESER on five Java projects from the Defects4J dataset. We find that: (1) On the whole, the word embedding model ELMo outperformed the other four models (including word2vec, BERT, etc.) in facilitating bug localization techniques. (2) Among four strategies aiming at solving class imbalance problems, the strategy ROS (random over-sampling) performed much better than the other three strategies (including random under-sampling, Focal Loss, etc.). (3) By integrating ELMo and ROS into BLESER, at method-level bug localization, we could achieve MAP of 0.108-0.504, MRR of 0.134-0.510, and Accuracy@1 of 0.125-0.5 on five Defects4J projects.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.